AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Anarchist4000 · Aug 19, 2017

Jawed said:
I expect NVidia is doing the same thing. I am going to guess NVidia has been doing it for a long time.

It's just code.

Wouldn't be surprised if that's how they bin and create tiles.

Digidi said:
But AMD Quote that they can discarge 17 primitives per clock. Nvidia can only 8. So where Comes the Advantage from?

That 17 was only 11 a few months ago, so AMD is still finding ways to evaluate or cull more. Best guess is the packed math and scalar being a bit more versatile. Backface culling with really low precision as that should remove more than half.

Digidi · Aug 19, 2017

CarstenS said:
Nvidia can do 8?
Hard to tell when their math changes and they don't give details on how they arrive at that number.

Yes if you look at your Beyond3d Suite Test, Nvidia can do 11421 MTriangls/s So you have a clockspeed of 1835 you get 6,2 Triangles per Clock.

http://www.pcgameshardware.de/Radeo...3/Tests/Benchmark-Preis-Release-1235445/3/#a1

CarstenS · Aug 19, 2017

And that implies a hard limit of 8 in your eyes? Ok, so be it.

I'd say we're looking at a different limitation here. I would think, R/W-rate of L2 cache partitions (not aggregate!) might be limiting.

MDolenc · Aug 19, 2017

Digidi said:
But AMD Quote that they can discarge 17 primitives per clock. Nvidia can only 8. So where Comes the Advantage from?

Well going from the bandwidth figures and assuming one triangle per vertex and just X, Y, Z for the vertex (12 bytes)... So say a long non indexed strip. A 1733MHz chip with 4 triangles per clock will bust over 83GB/s of bandwidth on input assembly alone. If it's indexed geometry (assuming 32 bit indices) that figure will double.
That's without any drawing. I'm just pointing this out because once you get to these insanely high primitive rates some weird stuff will start popping out.

Jawed · Aug 19, 2017

CarstenS said:
AMDs word on that slide were:

Primitive Shaders

New hardware shader stage combining vertex and primitive phases
Enables early primitive culling in shaders

Faster processing of content with high culling potential
Faster rendering of depth pass
Speed up for vertex shaderswith attribute computations
A world of potential uses

Shadow maps
Multi-view and multi-resolution rendering
Particles

Hm, they explicitly say „hardware shader stage“. But given how much you needed to put words on a fine scale lately... Maybe it's just worded this way because it made sense to enable this, because the geometry engines could share data via L2 cache now.

Vertex shader is a hardware shader stage. So is a fragment shader stage. And so on...

It's "hardware" because the GPU is cognisant of the type of shader and can use that, as well as the data associated with each thread for that shader type, as inputs into load-balancing. The hardware also knows how to connect a source of data for a stage with the stage itself and then how to connect the results from that stage with the next stage (or the buffer for that stage).

When you look at how they actually work, all these types of shader are just code. You populate a buffer and/or some registers with the right data, you optionally put some other data into LDS and voila, you have all the data that a "hardware stage" shader requires.

When a game developer writes a compute shader to do the same job as the primitive shader, they are responsible for setting up the data connections and working out how it should be load balanced.

So the hardware aspect here is controlling how to start and feed the type of shader (vertex, fragment, primitive etc.) and what to do with the data it produces. The shader itself is just code.

A primitive shader accepts data just like a vertex shader does. It outputs data just like a geometry shader does (if there is one defined by the programmer when setting up the pipeline, otherwise, just like a vertex shader does).

The white paper also refers to a surface shader. A surface shader accepts data in the same way as a vertex shader and outputs data just like a hull shader does.

Both of these are examples of a type of shader that already fits into the model of the graphics pipeline that the hardware has been designed to process. Vertices, patches and triangles are well defined already. So these new shader types are really just a re-configuration of the hardware, working with geometry-related data types that the GPU already knows how to handle.

Jawed · Aug 19, 2017

MDolenc said:
Well going from the bandwidth figures and assuming one triangle per vertex and just X, Y, Z for the vertex (12 bytes)... So say a long non indexed strip. A 1733MHz chip with 4 triangles per clock will bust over 83GB/s of bandwidth on input assembly alone. If it's indexed geometry (assuming 32 bit indices) that figure will double.
That's without any drawing. I'm just pointing this out because once you get to these insanely high primitive rates some weird stuff will start popping out.

The case when you perform multiple frustum rendering, e.g. for VR:

Single Pass Stereo

should benefit greatly from "primitive shader" functionality.

Digidi · Aug 19, 2017

CarstenS said:
And that implies a hard limit of 8 in your eyes? Ok, so be it.

I'd say we're looking at a different limitation here. I would think, R/W-rate of L2 cache partitions (not aggregate!) might be limiting.

It Looks like a hard Limit. If you look the values between 1080 ti and 1080 they have nearby the same Limits. So this seems to be the Hardware Limit of Pascal Architektur.

@MDolance
AMD statet that they do culling before Vertex Data is written. So the bandwithd should be not so high?

Malo · Aug 19, 2017

I thought it was already known that GP104 has a 6 triangle setup limit? Compared to AMD's 4.

Digidi · Aug 19, 2017

It's not about triangle Setup. It's about culling and AMD have the primitive shader. If it's activated AMD can put out 17 Polygons per Clock because of fast culling.

CarstenS · Aug 19, 2017

Discard.

Deleted member 13524 · Aug 20, 2017

CarstenS said:
Discard.

So it can discard 17 polygons but draw 11 polygons per clock?

3dcgi · Aug 20, 2017

Malo said:
I thought it was already known that GP104 has a 6 triangle setup limit? Compared to AMD's 4.

GP104 can rasterize 4 triangles. You're thinking of the bigger chip.

ToTTenTranz said:
So it can discard 17 polygons but draw 11 polygons per clock?

Forget you ever heard 11 per clock. That number shouldn't have been published. It was referring to discard rate though.

Malo · Aug 20, 2017

3dcgi said:
GP104 can rasterize 4 triangles. You're thinking of the bigger chip.

Gotcha, thanks for the clarification.

CarstenS · Aug 20, 2017

ToTTenTranz said:
So it can discard 17 polygons but draw 11 polygons per clock?

Where does your link say „draw“? Anandtech quotes quite clearly from what AMD meant: "Vega is designed to handle up to 11 polygons per clock with 4 geometry engines."

Digidi · Aug 20, 2017

It's up to 17 per clock in the white paper (page 7) and should come with driver 17.320 (small 7 last page)
http://radeon.com/_downloads/vega-whitepaper-11.6.17.pdf

@Rys we will see the advantage in you Beyond3d suite

Deleted member 13524 · Aug 20, 2017

CarstenS said:
Where does your link say „draw“?

In the slide that says "over 2x peak throughput per clock", which seems to be what @Ryan Smith is commenting when talking about the 11 polygons per clock.

CarstenS · Aug 20, 2017

ToTTenTranz said:
In the slide that says "over 2x peak throughput per clock", which seems to be what @Ryan Smith is commenting when talking about the 11 polygons per clock.

So, we are still missing any possible reference to „draw“?

Let me help you with your link, where Ryan says quite clearly where he got this information. Which is, btw, why he put in quotation marks - because he does not comment, but he quotes.
->„And while AMD's presentation and comments itself don't go into detail on how they achieved this increase in throughput, buried in the footnote for AMD's slide deck is this nugget: "Vega is designed to handle up to 11 polygons per clock with 4 geometry engines."“
[my bold]

Deleted member 13524 · Aug 20, 2017

CarstenS said:
So, we are still missing any possible reference to „draw“?

You're suggesting that Vega 10 at 1.5Ghz is discarding 16.5 billion triangles per second?
At a generous 60FPS, that's 275 million triangles per frame. Does discarding 275M triangles/second even make any sense?

Digidi · Aug 20, 2017

Yes it make sense. Amd promote a sample scene of deusx where are 220 million polygons in a scene, but later you only see 2 million.

https://www.pcper.com/reviews/Graph...w-Redesigned-Memory-Architecture/Primitive-Sh

CarstenS · Aug 20, 2017

ToTTenTranz said:
You're suggesting that Vega 10 at 1.5Ghz is discarding 16.5 billion triangles per second?
At a generous 60FPS, that's 275 million triangles per frame. Does discarding 275M triangles/second even make any sense?

I am not suggesting anything, just going by the most recent information published by AMD and not interpreting anything into their marketing slides which is not in there.

You are the one making the assertions, i.e. „draw“, even though you put question marks behind.

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Anarchist4000

Digidi

CarstenS

Moderator

MDolenc

Jawed

Jawed

Digidi

Malo

Yak Mechanicum

Digidi

CarstenS

Moderator

Deleted member 13524

Guest

3dcgi

Malo

Yak Mechanicum

CarstenS

Moderator

Digidi

Deleted member 13524

Guest

CarstenS

Moderator

Deleted member 13524

Guest

Digidi

CarstenS

Moderator