AMD Vega Hardware Reviews

Slide 19

http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf

It's a key differentiator as compared with the VLIW SIMD design: "Vector back-to-back wavefront instruction issue", versus "Interleaved wavefront instruction required" for VLIW.

I have got respectable performance out of GCN with just a single wavefront per SIMD (i.e. more than 128 VGPR allocation). Depends on ALU:MEM and incoherent control flow, in the end.

32KiB (shared by several CUs) of I$ is plenty large enough for fairly complex compute (a single very heavy kernel). Multiple, large, competing kernels sharing I$ is obviously going to be a factor with the various kernels seen by graphics. Still doesn't change the fact that GCN was designed explicitly for back-to-back execution of instructions from a single wavefront.
 
A single wavefront per SIMD workload and the same fully cached kernel for all CUs sharing a front end would go far in making the CU's fetch arbitration straightforward, perhaps more so after Polaris increased its buffer sizes and added whatever it calls instruction prefetch.
It makes it seem like GCN can heavily exercise its capacity to have back to back issue within a wavefront in a SIMD if the workload gives the CU nothing else to switch to.
 
How so? Comparable performance subjectively in a variable refresh environment.

Well, Antal says in the blog:
„Even if we pitted Radeon RX Vega against the mightier GeForce 1080Ti,[…]“
which might be interpreted as comparison to RX Vega as well as to the 1080 non-Ti, if you take into account the context of the preceding paragraphs.

If you take this sentence of Antal's blog
„That said, the biggest difference between the two systems was the price based simply on the monitor itself, with the G-Sync display costing $300 more than its FreeSync counterpart.“
you might conclude that RX Vega in the edition shown in Budapest will sell for roughly GTX 1080's price levels.
 
How so? Comparable performance subjectively in a variable refresh environment.
Well, in addition to what Carsten said, there is AMD's choice for the 1080 in the comparison, there are AMD's earlier demos which indicated 1080 performance, there is FE's gaming performance which is around 1080. And then there are the blind tests and lack of fps counters. Too many signs now indicating 1080 for the Air Vega. Maybe the water cooled version will be noticeably better than 1080?
 
What that sounds like is the perfect mining card that no consumer would ever get their hands on.

AMD on the other hand would be delighted to see the RX580/570 back in the shelves for gamers.
This is actually an interesting point. Vega's high power consumption makes it a poor choice for mining, meaning retail prices for Vega FX could be substantially lower than similar performing cards with lower TDP.
 
Well, in addition to what Carsten said, there is AMD's choice for the 1080 in the comparison, there are AMD's earlier demos which indicated 1080 performance, there is FE's gaming performance which is around 1080. And then there are the blind tests and lack of fps counters. Too many signs now indicating 1080 for the Air Vega. Maybe the water cooled version will be noticeably better than 1080?
Yep, you're right. Given everything available previous and now the blind tests and follow-up article, we're confirmed 1080 speed for RX Vega. I imagine the water cooled version will be able to sustain the higher clocks (around 1600) but likely no more than 10% faster.

Now let's move onto SIGGRAPH and announce pricing!
 
32KiB (shared by several CUs) of I$ is plenty large enough for fairly complex compute (a single very heavy kernel). Multiple, large, competing kernels sharing I$ is obviously going to be a factor with the various kernels seen by graphics. Still doesn't change the fact that GCN was designed explicitly for back-to-back execution of instructions from a single wavefront.
It's the multiple large kernels, amplified with async, that would be the concern. Then DPP and conditions requiring wait states where back to back execution is obviously problematic. I was never disputing GCN was capable of back to back instructions from the same wave, but that scheduling would prefer not to execute a single wave indefinitely and two waves being more ideal. In fact an unbound VALU heavy wave would ideally be the lowest priority for scheduling.

It's a complex issue, but for multiple thread groups, especially with async, keeping them somewhat in lockstep I would think is preferable. Ensure all related waves hit the cache before another kernel or SIMD trashes it. Taken a step further, periodically changing groups to attempt to always stall one on every possible bottleneck to increase utilization. Assuming limited OoO execution capabilities. Instruction prefetch as a non-dependent fetch to prime a cache could do that, but there are other bottlenecks. It would make sense that the schedulers attempted to prioritize whatever waves weren't likely to be bound by currently observed bottlenecks. I know AMD has mentioned they actively track some of those metrics. HWS could use them, but so could wave scheduling.

Back to my original software scheduling point, Nvidia was explicitly addressing temporary registers for the RF caching. Unlike VGPRs, they would block other waves from execution. Waves not using them could still schedule and execute. The programming/scheduling model would likely prefer short execution bursts allowing the hardware schedulers prioritization mechanisms to balance the load. Not an issue for Nvidia as they lack the async and hardware scheduling there. So while a single wave could execute indefinitely, that would be less than ideal. A CU/group level scalar for controlling timing could really help there.

Polaris increased its buffer sizes and added whatever it calls instruction prefetch.
Polaris also added HWS, so it stands to reason some conventional wisdom may have changed. As mentioned above, non-dependent cached memory accesses make sense for the prefetch as an OoO optimization to prime caches. Then some barrier close to the expected latency to ensure all waves in a group hit the cache ASAP or at the very least stall only a short time.
 
I have got respectable performance out of GCN with just a single wavefront per SIMD (i.e. more than 128 VGPR allocation). Depends on ALU:MEM and incoherent control flow, in the end.
It's true that occupancy = 1 is enough for GCN SIMD to achieve peak performance, but only in massive ALU heavy kernels that don't do any memory operations. Occupancy = 1 prevents any co-issuing, so scalar unit can't give you perf gains (bad if you have ALU operations that could run at wave granularity). And obviously each memory access = stall, even when they hit caches. Even L1$ access is >100 cycle stall (but one instruction takes 4 cycles to execute, so stall is 4x less in instructions). LDS accesses stall the SIMD and barriers stall the SIMD. Same is obviously true for texture sampling (even from L1$). So in practice occupancy = 1 is useless, but occupancy = 2 is already much better, because it can co-issue vector + scalar + mem and hide infrequent LDS latency and infrequent K$ scalar unit loads (constants). But it will stall the SIMD every time when you do vector memory loads (unless in L1$). But this is still practical for some highly ALU dense kernels that mostly operate in registers and in LDS and have huge loops inside the kernel.

Most shaders require SIMD occupancy of 4-7 for peak performance. Never seen any shader gaining anything from higher occupancy than 8. I have seen some shaders that reach peak perf at SIMD occupancy of 3. I would guess that 4K demos and shadertoy shaders with no texture sampling (pure analytic SDF sphere tracing) reach peak perf already at occupancy of 2. But this obviously means that you can't do any groupshared memory (LDS) optimizations or any communication (LDS or cross lane ops) between lanes. Shadertoy only supports pixel shaders, so there's no communication, so this problem doesn't exist.
 
It's true that occupancy = 1 is enough for GCN SIMD to achieve peak performance, but only in massive ALU heavy kernels that don't do any memory operations. Occupancy = 1 prevents any co-issuing, so scalar unit can't give you perf gains (bad if you have ALU operations that could run at wave granularity). And obviously each memory access = stall, even when they hit caches. Even L1$ access is >100 cycle stall (but one instruction takes 4 cycles to execute, so stall is 4x less in instructions). LDS accesses stall the SIMD and barriers stall the SIMD. Same is obviously true for texture sampling (even from L1$). So in practice occupancy = 1 is useless, but occupancy = 2 is already much better, because it can co-issue vector + scalar + mem and hide infrequent LDS latency and infrequent K$ scalar unit loads (constants). But it will stall the SIMD every time when you do vector memory loads (unless in L1$). But this is still practical for some highly ALU dense kernels that mostly operate in registers and in LDS and have huge loops inside the kernel.
This is why I wrote about heavy kernels. Kernels that can run for hundreds to thousands of milliseconds. etc. (I don't do graphics). e.g. sorting millions of lists that are each hundreds to thousands long.

Most shaders require SIMD occupancy of 4-7 for peak performance. Never seen any shader gaining anything from higher occupancy than 8. I have seen some shaders that reach peak perf at SIMD occupancy of 3. I would guess that 4K demos and shadertoy shaders with no texture sampling (pure analytic SDF sphere tracing) reach peak perf already at occupancy of 2. But this obviously means that you can't do any groupshared memory (LDS) optimizations or any communication (LDS or cross lane ops) between lanes. Shadertoy only supports pixel shaders, so there's no communication, so this problem doesn't exist.
The real problem with GCN is that the compiler stops your sweet algorithm that consumes N VGPRs from running sweetly, because the compiler decides to add a ridiculous number of additional VGPRs, trashing your carefully organised memory access patterns designed for N VGPRs (and the resulting wavefront count). So then you have to hunt down a lesser algorithm to work around the compiler.
 
HWS were already present in Tonga and Fiji. They are useful, but not magical unicorns.
HWS were introduced with Polaris as I recall, but being programmable backported to Tonga and Fiji. In reference to the whitepaper and scheduling, HWS didn't exist at the time of publication. Paper dated 2011 and I believe released 2013 going off the URL. Nor did a readily usable implementation for async compute. Mantle was being experimented with at that time. It stands to reason the scheduling process is actively evolving over time. Which also explains the programmable hardware in the first place.

Most shaders require SIMD occupancy of 4-7 for peak performance. Never seen any shader gaining anything from higher occupancy than 8.
While not affecting individual shader performance, it might make sense to use those extra waves for leading edge pixels if possible. Triggering page faults with HBCC as soon as possible as those waves will likely stall. Then using the duration of the frame to cover the stall. It wouldn't be surprising if the raster pattern on Vega was changing direction, going opposite of camera movement. If that's even possible with drivers considering the order the application is drawing or programmable with wave distribution.
 
Back
Top