AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

In synthetic tests, Polaris gets 2x geometry through put if I'm not mistaken over previous gen cards.
It does not (only in pretty specific circumstances with triangle sizes well below 1 pixel [like 20 triangles per pixel] as I mentioned before and each engine still never exceeds 1 triangle per clock; older GPUs are just pretty slow in that range).

getgraphimg.php
getgraphimg.php
 
Last edited:
But imho Razor has a point. AMD claimed over 2x for Polaris compared to Fiji and claims up to 2,7X for Vega but also compared to Fiji, so it makes sense that the output will be similar to Polaris.
 
Why 11, particularly if a shader is involved?
Even for fixed hardware, the numbers usually rounded out pretty well. Fiji is 4 geometry engines, 4 triangles per clock. Smaller GPUs might drop to fractional rates for a single engine.
Why 11/4? I don't see why a shader would hit this restriction, particularly if more than one wavefront is running per shader engine. If the front ends are equivalent, what is stopping it from being 12?
 
Maybe because it is just better culling and this is somehow the maximum culling efficiency. It might require a very small triangle size and 64% of those to be culled. Similar to Polaris, which also achieves the 2x gain only with very very small Triangles.
 
But imho Razor has a point. AMD claimed over 2x for Polaris compared to Fiji .
Where? I checked most of their presentations, but not a single one compares Polaris to Fiji in any aspect. They compared Polaris to Pitcairn, Tonga and even Hawaii in some aspects, but never to Fiji. The only chart related to geometry performance compares Polaris (Prim Discard Off) to Polaris (Prim Discard On).
 
Why 11, particularly if a shader is involved?
Even for fixed hardware, the numbers usually rounded out pretty well. Fiji is 4 geometry engines, 4 triangles per clock. Smaller GPUs might drop to fractional rates for a single engine.
Why 11/4? I don't see why a shader would hit this restriction, particularly if more than one wavefront is running per shader engine. If the front ends are equivalent, what is stopping it from being 12?
For a shader you are completely right, this appears to be almost impossible to be a hard cap for a shader solution.
And for fixed function hardware? Maybe they do it more in a hierarchical fashion and move the culling closer to the CUs as nV is doing with their polymorph engines. Imagine each group of CUs (sharing a L1-I$ and L1-sD$) has such an engine. That would make at least 16 such groups in Vega10 (potentially a few more as there appears to be some trend for making the groups a bit smaller [more bandwidth per CU?]). If this thing would process 2 triangles every 3 clock cycles or something (or what about the wild idea of clocking the shader core at 1.5GHz but keep ~1GHz for some other parts?), you end up pretty close to 11. I guess there are some more possibilities to arrive at this seemingly strange number.
 
I think, the key here is that the four geometry engines can work together (i.e. share geometry data for 11 triangles out of 12 vertices) now.

edit: FWIW: The words of the footnote: Geometry throughput slide: Data based on AMD Engineering design of Vega. Radeon R9 Fury X has 4 geometry engines and a peak of 4 polygons per clock. Vega is designed to handle up to 11 polygons per clock with 4 geometry engines. This represents an increase of 2.6x.

Note the use of „handle up to 11 polygons per clock“
 
Last edited:
Where? I checked most of their presentations, but not a single one compares Polaris to Fiji in any aspect. They compared Polaris to Pitcairn, Tonga and even Hawaii in some aspects, but never to Fiji. The only chart related to geometry performance compares Polaris (Prim Discard Off) to Polaris (Prim Discard On).

Yes, but the peak throughput of Fiji is not better than Hawaii. Apart from that in the tech pdf for Polaris they even claim over 3X better geometry performance compared to early generations, but only under tesselation. So at the moment I take the up to 2,7x with a grain of salt, it is impossible to say what it will deliver under which circumstances.
 
I think, the key here is that the four geometry engines can work together (i.e. share geometry data for 11 triangles out of 12 vertices) now.

edit: FWIW: The words of the footnote: Geometry throughput slide: Data based on AMD Engineering design of Vega. Radeon R9 Fury X has 4 geometry engines and a peak of 4 polygons per clock. Vega is designed to handle up to 11 polygons per clock with 4 geometry engines. This represents an increase of 2.6x.

Note the use of „handle up to 11 polygons per clock“
Nice idea (would need to be probably 13 vertices though, as a triangle strip needs N+2 vertices for N triangles). Anyway, such a "handling" could help quite a bit for dense (tessellated) meshes (they fall predominantly in the same screen tile), even if that means only up to 11 triangles per clock can be culled and 4 drawn. NV GPUs aren't really better than this (see hardware.fr benchmarks above), they also can't draw as many triangles as they can cull. So I wouldn't see it that much as a marketing stunt.
 
Apart from that in the tech pdf for Polaris they even claim over 3X better geometry performance compared to early generations…
No.

"The new filtering algorithm can improve performance by up to 3.5X (fig. 6), and the benefits are more pronounced in scenes with many polygons."

fig. 6 shows performance with and without Primitive Discard:
p10geometryg6j6b.png


"Based on AMD internal small prim filter test as of 6/14/2016. Primitive assembly rates with prim filter ON vs. OFF: 18 tri/px (3.947 vs. 1.255), 32 tri/px (3.901 vs. 1.773), 50 tri/px (3.760 vs. 1.402), 72 tri/px (3.303 vs. 1.187), 98 tri/px (3.928 vs. 1.171), 128 tri/px (3.870 vs. 1.171). System configuration: Radeon™ RX 480, Core i7-6700K, 16GB DDR4-2666, Windows 10 x64, Radeon™ Software 16.5.2."

Not a single slide compares geometry performance of Polaris to any previous gen GCN. Feel free to prove me wrong.
 
Nice idea (would need to be probably 13 vertices though, as a triangle strip needs N+2 vertices for N triangles). Anyway, such a "handling" could help quite a bit for dense (tessellated) meshes (they fall predominantly in the same screen tile), even if that means only up to 11 triangles per clock can be culled and 4 drawn.
I think I can do with 12.
YFr1ANa.png

And at least for a triangle fan it'd be n+1 in any case.

NV GPUs aren't really better than this (see hardware.fr benchmarks above), they also can't draw as many triangles as they can cull. So I wouldn't see it that much as a marketing stunt.
The marketing stunt I was referring to would have been the 2,6× improvement teaser over prior generation Fiji, since the VTF rate would not necessarily be improved so it might be a limited use case scenario with 2,6×.
 
No.

"The new filtering algorithm can improve performance by up to 3.5X (fig. 6), and the benefits are more pronounced in scenes with many polygons."

fig. 6 shows performance with and without Primitive Discard:
p10geometryg6j6b.png


"Based on AMD internal small prim filter test as of 6/14/2016. Primitive assembly rates with prim filter ON vs. OFF: 18 tri/px (3.947 vs. 1.255), 32 tri/px (3.901 vs. 1.773), 50 tri/px (3.760 vs. 1.402), 72 tri/px (3.303 vs. 1.187), 98 tri/px (3.928 vs. 1.171), 128 tri/px (3.870 vs. 1.171). System configuration: Radeon™ RX 480, Core i7-6700K, 16GB DDR4-2666, Windows 10 x64, Radeon™ Software 16.5.2."

Not a single slide compares geometry performance of Polaris to any previous gen GCN. Feel free to prove me wrong.
And as I said, AMD's numbers don't show any geometry rate above 4 primitives/clock. And this performance delta was measured at ridiculously densely tesselated meshes (and 4xAA on top of it for some reason, probably because it slows down the process without primitive discard even more). Look at the number of triangles per pixel of the different bars!
 
I think I can do with 12.
YFr1ANa.png
Well, the topology of that geometry isn't inherently clear, i.e. one would need an additional index buffer or something to clarify which vertices should be connected to form a triangle (there are a lot of possibilities to form different triangles from the same 12 vertices). I don't know if one could use this in the general case. Triangle strips don't suffer from this, they are unambigous without needing additional information.
And at least for a triangle fan it'd be n+1 in any case.
Only for a closed fan. And I thought fans are not supported anymore since D3D10.
The marketing stunt I was referring to would have been the 2,6× improvement teaser over prior generation Fiji, since the VTF rate would not necessarily be improved so it might be a limited use case scenario with 2,6×.
It's always going to be an improvement in limited scenarios. The question is, how restrictive the limits are. If Vega could sustain significantly higher geometry rates in a lot of cases with geometry shaders, dense meshes, or tessellation (where also Polaris still lacks performance compared to nV), the claim would be justified in my opinion.
 
Well, the topology of that geometry isn't inherently clear, i.e. one would need an additional index buffer or something to clarify which vertices should be connected to form a triangle (there are a lot of possibilities to form different triangles from the same 12 vertices). I don't know if one could use this in the general case. Triangle strips don't suffer from this, they are unambigous without needing additional information.
I thought fans are not supported anymore since D3D10.
It's always going to be an improvement in limited scenarios. The question is, how restrictive are the limits. If Vega could sustain significantly higher geometry rates in a lot of cases with geometry shaders, dense meshes, or tessellation (where also Polaris still lacks performance compared to nV), the claim would be justified in my opinion.
Don't know about fans not being supported anymore - might be edit: you're right - so no fans in D3D. What about Open GL or Vulkan?. In any case, it would be sufficient for marketing purposes if there is some kind of construct allowing for the peak rate cited, no need for general cases.

As for Nvidia and their respective stunt: It will - as you said - all boil down to how often you can use your hardware's abilities. :)
 
Don't know about fans not being supported anymore - might be edit: you're right - so no fans in D3D. What about Open GL or Vulkan?. In any case, it would be sufficient for marketing purposes if there is some kind of construct allowing for the peak rate cited, no need for general cases.

As for Nvidia and their respective stunt: It will - as you said - all boil down to how often you can use your hardware's abilities. :)
I tend to go with what Ryan at PCPer with regards to Vega said that in theory best will be 8 polygons/clock if wrapped by the driver and the x2.7 is with API/coding.
Earlier in his article before mentioning those figures he says:
as AMD told it to me, “with the right knowledge” you can discard game based primitives at an incredible rate. This right knowledge though is the crucial component – it is something that has to be coded for directly and isn’t something that AMD or Vega will be able to do behind the scenes.
And we know Wasson and Raja were doing interviews and providing further information after the preview, along with Mike Mantor.
Cheers
 
Last edited:
I tend to go with what Ryan at PCPer with regards to Vega said that in theory best will be 8 polygons/clock if wrapped by the driver and the x2.7 is with API/coding.
Earlier in his article before mentioning those figures he says:
And we know Wasson and Raja were doing interviews and providing further information after the preview, along with Mike Mantor.
Cheers


Yep and do we know how many geometry engines Vega's got?

There is something funny about up to 11 tris per clock doesn't make much sense unless something is stopping it from getting there.
 
It does not (only in pretty specific circumstances with triangle sizes well below 1 pixel [like 20 triangles per pixel] as I mentioned before and each engine still never exceeds 1 triangle per clock; older GPUs are just pretty slow in that range).

getgraphimg.php
getgraphimg.php


Even this shows a huge improvement for Polaris though.

AMD always show best case numbers, and they could be talking about tessellated smaller than pixel level traingles in this case too, they didn't specify that all they stated they were comparing to Fiji, no specifics on software/app etc. They actually gave more information on Polaris in their white paper.

How many times have their "marketing" material, really just been in limited scenarios? Can't remember the last time they used realistic real world numbers (averages).
 
So they envision a future, where a programmable geometry pipeline can offer an even higher throughput than what the traditional way of doing it allows
As I said in the edit above. It allows more flexible use of shaders for geometry stuff.
Sounds like shaders are definitely involved in the process to me. How much and how far is still up to debate.
And that primitive shaders are needed for the stated 11 prims/clock, you just made up. This is nowhere to be found in the slides or statements from AMD.
It's not far fetched to theorize they are needed to reach a certain limit, not with all of these hints being thrown at us from AMD marketing and interviews.

Please show me a benchmark where Polaris has 2x the throughput of Fiji, Hawaii or even Tonga
I found some cases where Polaris has been shown to to achieve performance equal or above Fiji in some titles, Forza Horizon 3, Fallout 4 comes to mind.
 
Back
Top