Performance evolution between GCN versions - Tahiti vs. Tonga vs. Polaris 10 at same clocks and CUs

sebbbi · Sep 21, 2016

3dcgi said:
There are some misunderstandings here. No GCN part has required the DS to execute on the same CU as the HS, though it sometimes does.

With no patch control point shader and with no vertex shader GCN 1.1 tessellation performs quite well (HS body only + DS that takes SV_PrimitiveId and SV_Barycenrics as input). Primitive rate becomes the bottleneck. Tiny triangles also cause various bottlenecks with the rasterization pipeline (Polaris improved this).

3dcgi said:
I suspect people give far too much credit to Nvidia's tiled rendering though without a benchmark that can disable this feature there's no way to prove anything. Low voltage and high clock speeds are the primary weapons of Maxwell and Pascal.

It is recording and reordering triangles to improve screen locality. The synergy with tile based compression (DCC & lossless depth compression) is clearly there. Screen locality also (trivially) improves render target cache hit rate. I am not expecting it to behave like PowerVR TBDR (= no overdraw), but it should be easily able to save 20%+ of render target bandwidth in common cases. Nvidia could also use slightly more complex DCC algorithms, as tiling should hide the DCC latency better and invoke DCC hardware less often. This gives further bandwidth gains.

One case where the Nvidia tiling really helps is particle rendering (rgba16f output). Particles are most often 2 triangle quads. Nvidia can bin thousands of particles to tiles before rasterizing them. Particles close to each other spatially (from the same emitter) are likely also close in the triangle list, meaning that they get binned together. Particle effects (big explosion) close to the camera are the number one reason for big frame dips in games. One explosion is < 1000 particles = gets binned at once. So instead of hammering the memory bandwidth (read + write) with 100x full screen rgba16f overdraw (of the nearby explosion smoke particles), we get a single read + a single write. This is a huge saving.

Good example of potential gains (this technique blends particles in LDS):
http://www.slideshare.net/DevCentra...ndering-using-direct-compute-by-gareth-thomas

lanek · Sep 21, 2016

sebbbi said:
Personally I like AMD GPUs. During the past year I have been writing almost purely compute shaders. I have been consistently amazed how well the old Radeon 7970 fares against GTX 980 in compute shaders. I gladly leave all the rasterization and geometry processing woes to others

No wonder why i still have my 7970's as main compute gpu's for raytracing, they are so solid and a good balance between power consumption and performance for computing.

To note, thoses are the original 7970 reference ones ( not GHZ but flashed with Ghz bios), with over engineered and solid PWM.

3dcgi said:
There are some misunderstandings here. No GCN part has required the DS to execute on the same CU as the HS, though it sometimes does. Also, AMD does load balance rasterization across the GPU. Probably in a similar fashion to Nvidia at a high level.

I suspect people give far too much credit to Nvidia's tiled rendering though without a benchmark that can disable this feature there's no way to prove anything. Low voltage and high clock speeds are the primary weapons of Maxwell and Pascal.

Agree completelly with that, something to retain is that Pascal gpu's run in the 1.8ghz mark, when Polaris run at way lower... The architectures and their respective performance differences are not so big, im pretty sure that a 480 will have not to pale against a 1080 if it was running at the same speed.

Nvidia with Pascal have nearly the same number of SP than with previous generation, there's a different configuration of the SM, but at nearly the same shader counts you end with a difference in performance who seems not so far of the difference on clock speed between a 980 and a 1070, same things goes for the TitanX.

I can compare with Firestrike scores ( DX11 ) :

980 ( 2048 SP ) = 11'686
1070 ( 1920SP )= 16'229

score difference = 38,8%

980 = 1216 mhz boost
1070 = 1683 mhz boost

38.4 % more core speed.

seahawk · Sep 21, 2016

Anarchist4000 said:
I'd have to agree with this. The tiled rasterization likely helps, but there should be ways to overcome it and it would be situational. We'd be seeing far larger performance gaps if it made that large of a difference. If AMD had the cash to fine tune critical paths and were clocked significantly higher would there really be that much of a difference?

But higher clocks are also a result of the design of the chip. Maybe NV is trading some mm² die size for higher clocks, maybe AMD would need to remove some SPs to achieve higher clocks within the same die size, but in the end both are making those decisions. Maybe GF is worse than TSMC, but AMD decided to move its production to GF. Maybe AMD´s architecture really owns NV under DX12 and Vulkan, but then they did bet on people moving to Win10 quickly and on software developers to scrap all their tried, tested and optimized engines, middlewares and tools to move to DX12 At least the last was always highly unlikely because software developers also have to meet release dates and anything coming out until 2018 will have to run good on DX11 and was most likely started on DX11.

I am growing tired of AMD always being the victim.

Deleted member 2197 · Sep 21, 2016

Then and Now: Six Generations of $200 Mainstream Radeon GPUs Compared
http://www.techspot.com/article/123...ream/&usg=ALkJrhgWI3D9iWuyKx9HM_JRBp-3xXbzLw/

3dcgi · Sep 21, 2016

sebbbi said:
With no patch control point shader and with no vertex shader GCN 1.1 tessellation performs quite well (HS body only + DS that takes SV_PrimitiveId and SV_Barycenrics as input).

There was an improvement with GCN 1.1 that helps in this situation, primarily by reducing the latency of the HS.

sebbbi · Sep 21, 2016

3dcgi said:
There was an improvement with GCN 1.1 that helps in this situation, primarily by reducing the latency of the HS.

Every time I hear some tidbits about issues that were fixed with GCN 1.1, I feel glad that the consoles are not GCN 1.0.

Ethatron · Sep 22, 2016

pharma said:
Then and Now: Six Generations of $200 Mainstream Radeon GPUs Compared
http://www.techspot.com/article/123...ream/&usg=ALkJrhgWI3D9iWuyKx9HM_JRBp-3xXbzLw/

Wow, watts/frame is a metric now.

Razor1 · Sep 22, 2016

same as performance per watt, just broken out in a different way lol.

MrFox · Sep 22, 2016

It should be joules per frame :runaway:

Esrever · Sep 23, 2016

calories per frame

Razor1 · Sep 23, 2016

hmm bacon! wait now I'm hungry lol.

sonen · Sep 23, 2016

MrFox said:
It should be joules per frame

Or frames per joule, which would be equivalent to performance per watt. Sounds good to me - why call it broken?

Performance evolution between GCN versions - Tahiti vs. Tonga vs. Polaris 10 at same clocks and CUs

sebbbi

lanek

seahawk

Deleted member 2197

Guest

3dcgi

sebbbi

Ethatron

Razor1

MrFox

Deludedly Fantastic

Esrever

Razor1

sonen

Similar threads