AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

I would have liked a younger brother for my R9 390.
44 Polaris CU's, 384bit memory interface, 64 ROPS. Would have been a nice stopgap and just like Hawaii before it, would have been one of the more balanced GCN GPU's.
 
so what actually went wrong? did rtg just spend all of their time to develop a "good" pro driver and let the gaming/consumer one for later on? i mean the card is a beast but nothing really works as intented
Seems to be something software related going off Linux's open vs proprietary drivers. While AMD actively contributes to the open one, it's 30-500% ahead of the closed driver in OpenGL benchmarks. Even giving 1080ti a run and it's still a work in progress. Good Phoronix benchmarks with those numbers.

The pro and compute drivers would be a different team and resource model. Possibly even a higher priority if AMD anticipated a supply constraint on Vega with Ryzen hogging fab space. Focusing on higher margin Vega parts is logical.

I would have liked a younger brother for my R9 390.
44 Polaris CU's, 384bit memory interface, 64 ROPS. Would have been a nice stopgap and just like Hawaii before it, would have been one of the more balanced GCN GPU's.
Should work well for console gaming too! I'd still withholding judgement until the drivers mature a bit.
 
A smaller Vega would indeed be a good thing, imagine something like a 580 but with the clock speeds of Vega it might give the 1070 a run for its money. I'm still not sold on HBM2 for consumer cards, maybe GDDR6 will be a good fit for the next AMD GPUs.
 
I think the brunt of what we are seeing is the result of the fact that years ago AMD made a strategic decision that locked their high-end GPUs into the HMBx development path and GoFlo foundry process, with former being a mistake and latter an economic necessity. The combination has not worked out nearly as well as hoped.
 
It's not about rates, it's about saturation.

Producer-consumer intermediate buffers are finite.

For illustrative purposes: if you have a triangle buffer that spends 100% of its time full, with 99% of the triangles culled after they leave the buffer then anything more than 1% culling before the triangles reach the buffer is going to be a win.
Ok. So that's the benefit of merging multiple shaders into one: the buffer that you're talking about resides in between two earlier conventional shader stages.

Now is your 99%/1% example realistic for some games or is it just for the sake of explanation? Would games optimize their triangle strips such that you get a more balanced ratio between culling and non-culled triangles?

Conventional GPUs have finite buffers for triangles, which have to be assembled before being culled, with culling done by fixed function hardware. Even if this hardware ran at infinite speed, the finite buffer before it would limit throughput due to saturation.
Do those buffers reside in DRAM/L2 cache or are we talking hard RAMs on the die? From what you write, it seems to be the latter.

To make the saturated buffer situation worse, the buffer has to hold attribute data, not just vertex position data. The alternative would be to defer non-position attribute-shading until after culling. Which is how the primitive shader kills more than one bird with a stone.
Got it. So those non-pos attributes are created by one of the pre-buffer shader stages and need to be stored temporarily. And, again, those are not stored in DRAM either. And there are probably not enough triangles in flight to warrant the round trip to DRAM or L2 cache to store them there.

What bugs me is that AMD could have spent the last few years experimenting with this on Fury X, RX 480 or whatever. There is nothing here that old hardware couldn't do.
Now that gets us back to part of my initial question: what kind of extra HW is needed for primitive shaders that wasn't already there?
 
Another 1980Mhz Vega 64 on overclockers uk forum, however is behind a 1817Mhz with same HBM2 memory speed,

https://forums.overclockers.co.uk/threads/firestrike-ultra-4k-benchmark.18629665/#post-27044551

14nm+ could help with clocks further.
From what I have been seeing around of people overclocking RX Vega, it appears that core overclocking may be too bugged on the current drivers to be able to draw any conclusions about its effectiveness. I'm seeing wildly inconsistent results from core overclocks compared to just setting a +50% power target, maxing out the HBM2 overclock, an undervolting to minimize throttling. I'm really looking forward to GN's upcoming video on undervolting Vega 56 as they appear to have the best grasp of the foibles of overclocking on the current drivers.
 
On a side note, it was mentioned that reviewers found an extra non-functional Vega package with their samples. If they're actual silicon, any chance someone could send one to the same person who got the Polaris and Fiji shots.
There's apparently some Zen pics now.
 
On a side note, it was mentioned that reviewers found an extra non-functional Vega package with their samples. If they're actual silicon, any chance someone could send one to the same person who got the Polaris and Fiji shots.
There's apparently some Zen pics now.
That would be fantastic. And we had Zen shots directly from AMD for a while now.
 
Ok. So that's the benefit of merging multiple shaders into one: the buffer that you're talking about resides in between two earlier conventional shader stages.
In a conventional GPU primitive assembly is fixed function and that's where fixed function culling occurs (page 6 of the whitepaper). That stage needs buffering and if you swamp the buffer with useless triangles which have lots of attributes, that's the saturation problem.

In a pipeline configured with a GS, then there will need to be vertex buffering to hold the output of VS while it queues to be assembled into workgroups for GS shading.

So primitive shader "replaces" either:
  • programmable VS + fixed-function culling (though I expect culling in primitive assembly isn't turned off)
  • programmable VS + programmable GS
Now is your 99%/1% example realistic for some games or is it just for the sake of explanation? Would games optimize their triangle strips such that you get a more balanced ratio between culling and non-culled triangles?
99%/1% was illustrative. But very high culling rate is realistic: e.g. very high poly count animated characters have a lot of back-facing triangles. When you start to shadow map, with multiple light sources, then each light will have an independent set of back-facing triangles.

But it's saturation that's the underlying problem. When you move the buffering problem into a primitive shader you now gain access to VGPR and LDS capacity: megabytes, so saturation is now a function of load-balancing and thread allocation.

Games can do all sorts of tricks to improve the ratio of desired versus culled triangles, e.g.
  • using level of detail (reduced count of triangles for high poly models that are distant from the camera, so a lower percentage of all scene triangles)
  • occlusion-querying (choosing not to draw triangles because they're known to be occluded, e.g. by terrain or other objects or view frustum
  • using geometry shaders to perform developer originated culling
sebbbi has written a lot on this topic.

Do those buffers reside in DRAM/L2 cache or are we talking hard RAMs on the die? From what you write, it seems to be the latter.
Conventionally they'd be RAM near the fixed function block that consumes the data. The fixed function block is a choke point in a sense, all the data it receives can come from any programmable shader units, since all of the programmable shader units can run VS (or GS).

Got it. So those non-pos attributes are created by one of the pre-buffer shader stages and need to be stored temporarily. And, again, those are not stored in DRAM either. And there are probably not enough triangles in flight to warrant the round trip to DRAM or L2 cache to store them there.
Yes, primitive shading (mostly) obviates the capacity problem while dealing with in-flight vertices (triangles) and their shaded attributes until they are rasterised. Don't forget the rasteriser still has an input buffer: it is throughput limited, so it uses a buffer to smooth its handling of work, but this buffer can still be saturated even after primitive shader culling has already done a sterling job of reducing the effective triangle count.

GS was originally designed with a mode to stream out vertices to DRAM, ready for a second pass (or third etc.) through VS, before pixel shading (if there was even going to be pixel shading). So back in the D3D10 era, it was possible to come up with algorithms where stream out to DRAM was viable, but it was extremely rarely used.

NVidia's distributed geometry architecture has for years used L2 as an adaptive buffer to support the distribution of triangle data around the GPU after vertex shading.

Now that gets us back to part of my initial question: what kind of extra HW is needed for primitive shaders that wasn't already there?
None that I can discern.
 
Who says they weren't?

I can't help thinking that AMD seems to be late to the party. The primitive shader concept is applicable to all GPUs going right back to D3D10 at least.

Does anyone really believe NVidia isn't doing this already?

What bugs me is that AMD could have spent the last few years experimenting with this on Fury X, RX 480 or whatever. There is nothing here that old hardware couldn't do.

EDITED: late night sloppiness fixed

Does that mean that also Nvidia has Primitive Shader?
I never read something about this.
 
AMD's bridgman is writing some interesting tidbits on phoronix:
AMD bridgman said:
Draw Stream Binning Rasterizer is not being used yet in the open drivers yet. Enabling it would be mostly in the amdgpu kernel driver but optimizing performance with it would be mostly in radeonsi and game engines.

HBCC is not fully enabled yet, although we are using some of the foundation features like 4-level page tables and variable page size support (mixing 2MB and 4KB pages). On Linux we are looking at HBCC more for compute than for graphics, so SW implementation and exposed behaviour would be quite different from Windows where the focus is more on graphics. Most of the work for Linux would be in the amdgpu kernel driver.

Primitive Shader support - IIRC this is part of a larger NGG feature (next generation geometry). There has been some initial work done for primitive shader support IIRC but don't know if anything has been enabled yet. I believe the work would mostly be in radeonsi but haven't looked closely.

For both DSBR and NGG/PS I expect we will follow the Windows team's efforts, while I expect HBCC on Linux will get worked on independently of Windows efforts.
https://www.phoronix.com/forums/for...opengl-proprietary-driver?p=970697#post970697
 
If it's just code, why AMD pushed this a lot in slide and at "tech launch" few months ago ? And, can they "enable" this on fiji and Polaris too ?
 
AMDs word on that slide were:
Primitive Shaders
New hardware shader stage combining vertex and primitive phases
Enables early primitive culling in shaders
Faster processing of content with high culling potential
Faster rendering of depth pass
Speed up for vertex shaderswith attribute computations​
A world of potential uses
Shadow maps
Multi-view and multi-resolution rendering
Particles
Hm, they explicitly say „hardware shader stage“. But given how much you needed to put words on a fine scale lately... Maybe it's just worded this way because it made sense to enable this, because the geometry engines could share data via L2 cache now.
 
Nvidia can do 8?
Hard to tell when their math changes and they don't give details on how they arrive at that number.
 
Back
Top