AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

I expect NVidia is doing the same thing. I am going to guess NVidia has been doing it for a long time.

It's just code.
Wouldn't be surprised if that's how they bin and create tiles.

But AMD Quote that they can discarge 17 primitives per clock. Nvidia can only 8. So where Comes the Advantage from?
That 17 was only 11 a few months ago, so AMD is still finding ways to evaluate or cull more. Best guess is the packed math and scalar being a bit more versatile. Backface culling with really low precision as that should remove more than half.
 
And that implies a hard limit of 8 in your eyes? Ok, so be it.

I'd say we're looking at a different limitation here. I would think, R/W-rate of L2 cache partitions (not aggregate!) might be limiting.
 
Last edited:
But AMD Quote that they can discarge 17 primitives per clock. Nvidia can only 8. So where Comes the Advantage from?
Well going from the bandwidth figures and assuming one triangle per vertex and just X, Y, Z for the vertex (12 bytes)... So say a long non indexed strip. A 1733MHz chip with 4 triangles per clock will bust over 83GB/s of bandwidth on input assembly alone. If it's indexed geometry (assuming 32 bit indices) that figure will double.
That's without any drawing. I'm just pointing this out because once you get to these insanely high primitive rates some weird stuff will start popping out.
 
AMDs word on that slide were:
Primitive Shaders
New hardware shader stage combining vertex and primitive phases
Enables early primitive culling in shaders
Faster processing of content with high culling potential
Faster rendering of depth pass
Speed up for vertex shaderswith attribute computations​
A world of potential uses
Shadow maps
Multi-view and multi-resolution rendering
Particles
Hm, they explicitly say „hardware shader stage“. But given how much you needed to put words on a fine scale lately... Maybe it's just worded this way because it made sense to enable this, because the geometry engines could share data via L2 cache now.
Vertex shader is a hardware shader stage. So is a fragment shader stage. And so on...

It's "hardware" because the GPU is cognisant of the type of shader and can use that, as well as the data associated with each thread for that shader type, as inputs into load-balancing. The hardware also knows how to connect a source of data for a stage with the stage itself and then how to connect the results from that stage with the next stage (or the buffer for that stage).

When you look at how they actually work, all these types of shader are just code. You populate a buffer and/or some registers with the right data, you optionally put some other data into LDS and voila, you have all the data that a "hardware stage" shader requires.

When a game developer writes a compute shader to do the same job as the primitive shader, they are responsible for setting up the data connections and working out how it should be load balanced.

So the hardware aspect here is controlling how to start and feed the type of shader (vertex, fragment, primitive etc.) and what to do with the data it produces. The shader itself is just code.

A primitive shader accepts data just like a vertex shader does. It outputs data just like a geometry shader does (if there is one defined by the programmer when setting up the pipeline, otherwise, just like a vertex shader does).

The white paper also refers to a surface shader. A surface shader accepts data in the same way as a vertex shader and outputs data just like a hull shader does.

Both of these are examples of a type of shader that already fits into the model of the graphics pipeline that the hardware has been designed to process. Vertices, patches and triangles are well defined already. So these new shader types are really just a re-configuration of the hardware, working with geometry-related data types that the GPU already knows how to handle.
 
Well going from the bandwidth figures and assuming one triangle per vertex and just X, Y, Z for the vertex (12 bytes)... So say a long non indexed strip. A 1733MHz chip with 4 triangles per clock will bust over 83GB/s of bandwidth on input assembly alone. If it's indexed geometry (assuming 32 bit indices) that figure will double.
That's without any drawing. I'm just pointing this out because once you get to these insanely high primitive rates some weird stuff will start popping out.
The case when you perform multiple frustum rendering, e.g. for VR:

Single Pass Stereo

should benefit greatly from "primitive shader" functionality.
 
And that implies a hard limit of 8 in your eyes? Ok, so be it.

I'd say we're looking at a different limitation here. I would think, R/W-rate of L2 cache partitions (not aggregate!) might be limiting.
It Looks like a hard Limit. If you look the values between 1080 ti and 1080 they have nearby the same Limits. So this seems to be the Hardware Limit of Pascal Architektur.

@MDolance
AMD statet that they do culling before Vertex Data is written. So the bandwithd should be not so high?
 
Last edited:
I thought it was already known that GP104 has a 6 triangle setup limit? Compared to AMD's 4.
 
In the slide that says "over 2x peak throughput per clock", which seems to be what @Ryan Smith is commenting when talking about the 11 polygons per clock.
So, we are still missing any possible reference to „draw“?

Let me help you with your link, where Ryan says quite clearly where he got this information. Which is, btw, why he put in quotation marks - because he does not comment, but he quotes.
->„And while AMD's presentation and comments itself don't go into detail on how they achieved this increase in throughput, buried in the footnote for AMD's slide deck is this nugget: "Vega is designed to handle up to 11 polygons per clock with 4 geometry engines."
[my bold]
 
So, we are still missing any possible reference to „draw“?
You're suggesting that Vega 10 at 1.5Ghz is discarding 16.5 billion triangles per second?
At a generous 60FPS, that's 275 million triangles per frame. Does discarding 275M triangles/second even make any sense?
 
You're suggesting that Vega 10 at 1.5Ghz is discarding 16.5 billion triangles per second?
At a generous 60FPS, that's 275 million triangles per frame. Does discarding 275M triangles/second even make any sense?
I am not suggesting anything, just going by the most recent information published by AMD and not interpreting anything into their marketing slides which is not in there.

You are the one making the assertions, i.e. „draw“, even though you put question marks behind.
 
Back
Top