Performance evolution between GCN versions - Tahiti vs. Tonga vs. Polaris 10 at same clocks and CUs

There are some misunderstandings here. No GCN part has required the DS to execute on the same CU as the HS, though it sometimes does.
With no patch control point shader and with no vertex shader GCN 1.1 tessellation performs quite well (HS body only + DS that takes SV_PrimitiveId and SV_Barycenrics as input). Primitive rate becomes the bottleneck. Tiny triangles also cause various bottlenecks with the rasterization pipeline (Polaris improved this).
I suspect people give far too much credit to Nvidia's tiled rendering though without a benchmark that can disable this feature there's no way to prove anything. Low voltage and high clock speeds are the primary weapons of Maxwell and Pascal.
It is recording and reordering triangles to improve screen locality. The synergy with tile based compression (DCC & lossless depth compression) is clearly there. Screen locality also (trivially) improves render target cache hit rate. I am not expecting it to behave like PowerVR TBDR (= no overdraw), but it should be easily able to save 20%+ of render target bandwidth in common cases. Nvidia could also use slightly more complex DCC algorithms, as tiling should hide the DCC latency better and invoke DCC hardware less often. This gives further bandwidth gains.

One case where the Nvidia tiling really helps is particle rendering (rgba16f output). Particles are most often 2 triangle quads. Nvidia can bin thousands of particles to tiles before rasterizing them. Particles close to each other spatially (from the same emitter) are likely also close in the triangle list, meaning that they get binned together. Particle effects (big explosion) close to the camera are the number one reason for big frame dips in games. One explosion is < 1000 particles = gets binned at once. So instead of hammering the memory bandwidth (read + write) with 100x full screen rgba16f overdraw (of the nearby explosion smoke particles), we get a single read + a single write. This is a huge saving.

Good example of potential gains (this technique blends particles in LDS):
http://www.slideshare.net/DevCentra...ndering-using-direct-compute-by-gareth-thomas
 
Last edited:
Personally I like AMD GPUs. During the past year I have been writing almost purely compute shaders. I have been consistently amazed how well the old Radeon 7970 fares against GTX 980 in compute shaders. I gladly leave all the rasterization and geometry processing woes to others :)

No wonder why i still have my 7970's as main compute gpu's for raytracing, they are so solid and a good balance between power consumption and performance for computing.

To note, thoses are the original 7970 reference ones ( not GHZ but flashed with Ghz bios), with over engineered and solid PWM.

There are some misunderstandings here. No GCN part has required the DS to execute on the same CU as the HS, though it sometimes does. Also, AMD does load balance rasterization across the GPU. Probably in a similar fashion to Nvidia at a high level.

I suspect people give far too much credit to Nvidia's tiled rendering though without a benchmark that can disable this feature there's no way to prove anything. Low voltage and high clock speeds are the primary weapons of Maxwell and Pascal.


Agree completelly with that, something to retain is that Pascal gpu's run in the 1.8ghz mark, when Polaris run at way lower... The architectures and their respective performance differences are not so big, im pretty sure that a 480 will have not to pale against a 1080 if it was running at the same speed.

Nvidia with Pascal have nearly the same number of SP than with previous generation, there's a different configuration of the SM, but at nearly the same shader counts you end with a difference in performance who seems not so far of the difference on clock speed between a 980 and a 1070, same things goes for the TitanX.

I can compare with Firestrike scores ( DX11 ) :

980 ( 2048 SP ) = 11'686
1070 ( 1920SP )= 16'229

score difference = 38,8%

980 = 1216 mhz boost
1070 = 1683 mhz boost

38.4 % more core speed.
 
Last edited:
I'd have to agree with this. The tiled rasterization likely helps, but there should be ways to overcome it and it would be situational. We'd be seeing far larger performance gaps if it made that large of a difference. If AMD had the cash to fine tune critical paths and were clocked significantly higher would there really be that much of a difference?

But higher clocks are also a result of the design of the chip. Maybe NV is trading some mm² die size for higher clocks, maybe AMD would need to remove some SPs to achieve higher clocks within the same die size, but in the end both are making those decisions. Maybe GF is worse than TSMC, but AMD decided to move its production to GF. Maybe AMD´s architecture really owns NV under DX12 and Vulkan, but then they did bet on people moving to Win10 quickly and on software developers to scrap all their tried, tested and optimized engines, middlewares and tools to move to DX12 At least the last was always highly unlikely because software developers also have to meet release dates and anything coming out until 2018 will have to run good on DX11 and was most likely started on DX11.

I am growing tired of AMD always being the victim.
 
There was an improvement with GCN 1.1 that helps in this situation, primarily by reducing the latency of the HS.
Every time I hear some tidbits about issues that were fixed with GCN 1.1, I feel glad that the consoles are not GCN 1.0.
 
Back
Top