AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

pTmdfx · Oct 15, 2016

3dilettante said:
The ACEs would be arbitrating the creation of wavefronts, and by extension the work groups.

Just want to point out that the order should be reversed. The ACEs conceptually create only workgroups from the dispatch grids, and track their completion. There is another intermediate layer that shuffles workgroups across the ACEs and the graphics pipeline, and then pipe them into CUs as wavefronts with regard to several constraints, e.g. LDS allocation size.

pTmdfx · Oct 15, 2016

Anarchist4000 said:
True, but there would be more execution units and double the streams. The patent also made mention that timing issues would be avoided as execution was mutually exclusive. A wave executed vector or scalar ops but not both simultaneously. That in my mind leaves a lot of performance on the table as the SIMD would be idle while executing the scalar path.

Usually the scalar path is just solely used for the control flow, which would have dependencies on the output from the vector pipeline. You also have already SMT to fill the slots as much as possible, as different categories of instructions from different wavefronts can be co-issued. I don't think allowing scalar-vector co-issuing within a wavefront could do any help.

Anarchist4000 · Oct 15, 2016

pTmdfx said:
Usually the scalar path is just solely used for the control flow, which would have dependencies on the output from the vector pipeline. You also have SMT to fill the slots as much as possible. I don't think allowing scalar-vector co-issuing within a wavefront could do any help.

If they can co-issue from different waves it shouldn't be a problem. My understanding was they could only co-issue from the same wave, leaving the scalar idle most of the time. Co-issue from multiple waves would make it much easier to craft a scalar only wave to handle prefetching. It would also raise questions why the scalar wasn't already fully capable of all operations for extracting scalar code from the waves.

pTmdfx · Oct 15, 2016

Anarchist4000 said:
If they can co-issue from different waves it shouldn't be a problem. My understanding was they could only co-issue from the same wave, leaving the scalar idle most of the time. Co-issue from multiple waves would make it much easier to craft a scalar only wave to handle prefetching. It would also raise questions why the scalar wasn't already fully capable of all operations for extracting scalar code from the waves.

http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf
Page 12

A CU will issue the instructions of a kernel for a wave-front sequentially
– Use of predication & control flow enables any single work-item a unique execution path

 Every clock cycle, waves on one SIMDs are considered for instruction issue.
 At most, one instruction from each category may be issued.
 At most one instruction per wave may be issued.
 Up to a maximum of 5 instructions can issue per cycle, not including “internal” instructions
– 1 Vector Arithmetic Logic Unit (ALU)
– 1 Scalar ALU or Scalar Memory Read
– 1Vector memory access (Read/Write/Atomic)
– 1 Branch/Message - s_branch and s_cbranch_<cond>
– 1 Local Data Share (LDS)
– 1 Export or Global Data Share (GDS)
– 1 Internal (s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio)

pTmdfx · Oct 15, 2016

Anarchist4000 said:
It would also raise questions why the scalar wasn't already fully capable of all operations for extracting scalar code from the waves.

Not sure what "scalar code" do you mean. If you mean things that are uniform across a wavefront, it just need a way to be expressed in the shader languages.

If what you mean is scalarization of say... a diverged branch, the main obstacle is that the data path designs I'd say. This is the same as saying the scalar unit can access the lane RFs at any time. You would need extra read/write ports for every single RF, likely 3R1W if it is meant to be fully capable. That's said it is possible of them to coalesce the memory/LDS/export ops under the reign of the scalar unit, and share the data path.

sebbbi · Oct 15, 2016

Jawed said:
Graphics shaders rarely, if ever, have incoherent control flow. Only in compute can you really make an argument that incoherent control flow is essential.

Not true. I have seen very complex graphics shaders. Longest so far was 1500 instructions (and it was a vertex shader!!).

A simple example of branching in a pixel shader: triplanar mapping. Most terrain renderers use triplanar mapping. Naive triplanar mapping always samples 3 textures (xyz axis planar projections) and blends them according to surface world space normal. You get big perf boost by brancing out texture reads, blend & uv math according to the xyz blend weights (bias weight and test against zero). This is a commonly used optimization (Witcher series for example uses it, so do we and many others).

Triplanar mapping branch is incoherent in areas where the normal changes rapidly and in discontinuities. Optimized triplanar mapping basically splits the unit sphere (normal vector) to 26 regions. Near the edges of each of these regions the branch will be incoherent.

Another good example of incoherent branching in pixel shader is g-buffer projected decals (almost all engines do this). You render bounding object and branch out pixels that do not hit the projection area. Highly incoherent for example when the decal is projected over a thin object, vegetation or fence.

And the last one is used everywhere: vegetation rendering with alpha mask. Modern engines will branch out rest of the pixel shader after reading the alpha map. This is a huge perf boost. Obviously branching will be incoherent near the alpha clip edges. In some extreme cases like chain link fences (further away) the branching at warp/wave granularity doesn't save anything (too coarse). Brancing at pixel granularity would cut down cost by 70%+.

Conclusion: Packing waves by execution mask would definitely also result in gains of pixel shaders.

Anarchist4000 · Oct 15, 2016

Ok that clears up a lot of concerns. I thought the scalar and vector ALUs were more tightly coupled.

pTmdfx said:
Not sure what "scalar code" do you mean. If you mean things that are uniform across a wavefront, it just need a way to be expressed in the shader languages.

If what you mean is scalarization of say... a diverged branch, the main obstacle is that the data path designs I'd say. This is the same as saying the scalar unit can access the lane RFs at any time. You would need extra read/write ports for every single RF, likely 3R1W if it is meant to be fully capable. That's said it is possible of them to coalesce the memory/LDS/export ops under the reign of the scalar unit, and share the data path.

Scalar code meaning any code that could otherwise execute on vector ALU without being vectorized. So uniform across the wavefront. Was the limitation just a software issue and not hardware issue? My understanding was the scalar was primarily limited to control flow and not a superset of vector instructions which prevented that. Scalar/uniform code extraction should otherwise already work, simplified if a shader language allowed for it.

Diverged branch scalarization would be a separate issue, but should otherwise work with a fully featured scalar unit. It would likely bottleneck the scalar, not accounting for datapath issues. I'd agree that the datapath design would be the concern. Why I suggested packed vectors with a more local porting configuration. Copy a vector register into 16 scalar ones and loop through as needed. Slightly different feature set than the existing scalar, but you wouldn't need to fully port all the RFs.

3dilettante · Oct 15, 2016

pTmdfx said:
Just want to point out that the order should be reversed. The ACEs conceptually create only workgroups from the dispatch grids, and track their completion. There is another intermediate layer that shuffles workgroups across the ACEs and the graphics pipeline, and then pipe them into CUs as wavefronts with regard to several constraints, e.g. LDS allocation size.

I felt I had written something incorrectly a while after posting, but by the time I figured out I had the relationship inverted it seemed too late to go back. Other items like the wavefronts having a flag for whether they are part of a workgroup with more than one wavefront are also more appropriate for an entity lower in the hierarchy.

3dcgi · Oct 16, 2016

Jawed said:
Graphics shaders rarely, if ever, have incoherent control flow. Only in compute can you really make an argument that incoherent control flow is essential.

Except I can't help wondering that if you ditched all of the fixed function geometry hardware (especially TS) and the rasteriser and relied upon shader hardware that can handle incoherence and early-out "properly", you'd get something that was worth having. And it wouldn't choke on bottlenecks for stupid global architectural layout reasons (like shader engine count).

But I've been wishing this would happen for years now and it hasn't.

It hasn't happened for the TS because the fixed function logic is so small that you couldn't even add another CU if you remove the TS. As long as rasterization is the primary task fixed function hardware is here to stay.

pTmdfx · Oct 16, 2016

Anarchist4000 said:
Ok that clears up a lot of concerns. I thought the scalar and vector ALUs were more tightly coupled.

Scalar code meaning any code that could otherwise execute on vector ALU without being vectorized. So uniform across the wavefront. Was the limitation just a software issue and not hardware issue? My understanding was the scalar was primarily limited to control flow and not a superset of vector instructions which prevented that. Scalar/uniform code extraction should otherwise already work, simplified if a shader language allowed for it.

It does not primarily "limit" to control flow, but the nature of the programming model prevents it from being more than the control flow. I wouldn't be surprised if the compilers may have already take trivial cases to the scalar path, say a constant pointer with a constant offset. It also stores uniform data right now, like texture/buffer descriptors and kernel argument pointers. It is just that you can't possibly go beyond this without human intervention IMO. Whether it is worth to resolve at runtime is debatable too, considering the fact that uniform code paths should be quite identifiable by developers.

sebbbi · Oct 17, 2016

pTmdfx said:
It does not primarily "limit" to control flow, but the nature of the programming model prevents it from being more than the control flow. I wouldn't be surprised if the compilers may have already take trivial cases to the scalar path, say a constant pointer with a constant offset. It also stores uniform data right now, like texture/buffer descriptors and kernel argument pointers. It is just that you can't possibly go beyond this without human intervention IMO. Whether it is worth to resolve at runtime is debatable too, considering the fact that uniform code paths should be quite identifiable by developers.

GCN compilers already do optimizations for coherent loads & math. The most common case of course being load from a constant address -> it emits a scalar load and stores the value into a scalar register.

This article introduces a simple algorithm for automatic scalar (coherent) code extraction:
http://hwacha.org/papers/scalarization-cgo2013.pdf

All memory loads & ALU instructions based solely on compile time constants, root constants or SV_GroupID (group >= wave) can be trivially converted to scalar loads & instructions (operating on scalar registers). This can be propagated: All inputs are coherent -> all outputs are coherent. This is also why often constant buffer loads can be optimized as scalar loads (and stored in scalar cache). Indexing array in constant buffer with anything calculated from SV_ThreadID obviously is the exception, and the compiler generated vector loads and vector ALU from code like that.

Unfortunately GCN scalar unit doesn't have a floating point instruction set. Scalar propagation cannot continue over float instructions. When the first float instruction is met, the compiler must broadcast the scalar to a vector register and emit SIMD vector math. I have been hoping for a long time that a future GCN would add full float instruction set to the scalar unit. This would allow full automated scalarization process.

The compiler could also be better at scalar extraction. Now it does it only for some simple known cases. For example it can't produce scalar code if I divide the SV_ThreadID by 64 (...128, 192, 256...) and do a load based on that address. Compiler could emit scalar load to a scalar register (and propagate status further -> more scalar loads & math).

If group is 64 threads (for example 8x8 tile in screen space) then SV_GroupID is perfect for scalarization (1:1 mapping to waves). All per tile operations (culling, etc) could be offloaded to scalar unit. But of course this would require float instruction support in scalar unit.

For groups larger than that it becomes slightly harder. If group X size = 64, then groupId.y is wave coherent, and everything based on that could be automatically turned into scalar code. However this would require the programmer to specify awkward group sizes, such as 64x4. This for example mimics a 16x16 group with 8x8 subgroups. You of course need to do custom math to remap threads to actual screen pixels (luckily GCN already has combined single cycle shift+mask instructions). I haven't tried whether the AMD GCN shader compiler does scalarization automatically based on SV_GroupThreadID.y (if group width is multiple of 64). I could try on some integer heavy compute shader. It wouldn't obviously help with float math heavy shaders.

Gipsel · Oct 18, 2016

3dilettante said:
Am I correct in assuming this is using a wait count for a memory write at one pipe and then another wait count on the memory pipe reading it back in? I'm not sure what other wait counts would help with that scenario.

Actually, wait counts are only used for variable latencies of memory accesses. Transferring data between scalar and vector registers do NOT require the use of the waitcnt instruction. Sebbi's remark probably confused it with the sometimes required wait states. But this is something different as it covers fixed latencies, where dependencies of the operands are not checked in hardware. That means the compiler (or the programmer if you write ISA code) has to insert a certain number of wait states (either NOPs or some independent instructions, if available) in the instruction stream before accessing the registers in question. It basically reflects the actual fixed latency between the scalar and vector register files.
This is of course not necessary when a vector instruction accesses a broadcasted scalar register, as the register accesses (scalar as well as vector) are pipelined and happen quite a few cycles ahead of the critical 4 cycle latency loop of the execution (the result is written back after that 4 cycle execution loop of course) roughly the same way it happens in basically all CPUs. This was a question earlier in the thread, where nobody stated this clearly, as far as I have seen. One can easily decide about what instruction from which wavefront is going to be issued way before the preceding instructions have actually calculated any result as long as it is guaranteed, that the values will be ready, what is easy with fixed latencies for all ALU instructions, result forwarding/bypassing and largely guaranteed collision free register accesses (as in GCN). Total pipeline length in GCN could easily be 12 cycles (including operand fetch and result writeback) or something without impacting the 4 cycle latency for ALU operations (the VLIW architectures had at least 12 stages after the sequencer and 8 cycles ALU latency). This length is not that important, in some way it defines the minimum latency of a memory access.

3dilettante · Oct 18, 2016

Gipsel said:
One can easily decide about what instruction from which wavefront is going to be issued way before the preceding instructions have actually calculated any result as long as it is guaranteed, that the values will be ready, what is easy with fixed latencies for all ALU instructions, result forwarding/bypassing and largely guaranteed collision free register accesses (as in GCN). Total pipeline length in GCN could easily be 12 cycles (including operand fetch and result writeback) or something without impacting the 4 cycle latency for ALU operations (the VLIW architectures had at least 12 stages after the sequencer and 8 cycles ALU latency). This length is not that important, in some way it defines the minimum latency of a memory access.

One of the spitballed scenarios was a low port count setup with 3 operand cycles and 1 exe, with writeback not done until a later cycle--potentially the next execution phase 4 clocks later. That could give 8. There are hints of depth in the upstream stages for items that interact with the instruction buffer and scalar unit, like the wait states for setting hardware flag for vksip directly.
That seems to hint at 4+ cycles for actions that interact with or read from the upstream logic and hardware state.

It seems like it would be easier handle contention between the various operations from other domains that hit the vector register file if there were more ports than the bare minimum to support just the VALUs.

Deleted member 13524 · Oct 18, 2016

So now that we know that TSMC's 16FF gets substantially better performance than Samsung's 14FF (which should be equivalent to GF's 14FF) because of GP107's lower clocks, how likely is it that Vega will be manufactured by TSMC (or even #gasp# Samsung's 10nm)?

3dilettante · Oct 18, 2016

ToTTenTranz said:
So now that we know that TSMC's 16FF gets substantially better performance than Samsung's 14FF (which should be equivalent to GF's 14FF) because of GP107's lower clocks, how likely is it that Vega will be manufactured by TSMC (or even #gasp# Samsung's 10nm)?

Perhaps mostly equivalent?
GF's licensing led to a "copy-smart" where said copy was supposed to be a backup source on the Samsung+GF side versus TSMC--where Samsung may have been a minority supplier with no clear evidence of GF being used.

On one hand, this might do something to redeem Polaris if its basis is that much poorer, but I'd like to see additional data points on GP107 and what its voltages, clocks, and power consumption come out to be.

It's mixing up a lot of marketing points, AMD's "roadmap" versus the reality of Polaris, and Vega's place in it, but it might not be good unless that roadmap's projection has been discarded.
Possibly, if you want a significant architectural improvement you don't want Vega to be on 10nm. The power improvement from the process and HBM would seemingly leave as much slack for a change as FinFET did with Polaris versus 28nm.

Megadrive1988 · Oct 18, 2016

I wonder if Vega 10 is likely to get 128 ROPs, coupled with that HBM2. That would go a long way toward making single GPU/card 60fps 4K viable in an AMD product.

seahawk · Oct 19, 2016

ToTTenTranz said:
So now that we know that TSMC's 16FF gets substantially better performance than Samsung's 14FF (which should be equivalent to GF's 14FF) because of GP107's lower clocks, how likely is it that Vega will be manufactured by TSMC (or even #gasp# Samsung's 10nm)?

Let us wait for the real power consumption and OC potential of the GP107 chips before drawing this conclusion.

sebbbi · Oct 19, 2016

Gipsel said:
Actually, wait counts are only used for variable latencies of memory accesses. Transferring data between scalar and vector registers do NOT require the use of the waitcnt instruction. Sebbi's remark probably confused it with the sometimes required wait states. But this is something different as it covers fixed latencies, where dependencies of the operands are not checked in hardware. That means the compiler (or the programmer if you write ISA code) has to insert a certain number of wait states (either NOPs or some independent instructions, if available) in the instruction stream before accessing the registers in question. It basically reflects the actual fixed latency between the scalar and vector register files.
This is of course not necessary when a vector instruction accesses a broadcasted scalar register, as the register accesses (scalar as well as vector) are pipelined and happen quite a few cycles ahead of the critical 4 cycle latency loop of the execution (the result is written back after that 4 cycle execution loop of course) roughly the same way it happens in basically all CPUs. This was a question earlier in the thread, where nobody stated this clearly, as far as I have seen. One can easily decide about what instruction from which wavefront is going to be issued way before the preceding instructions have actually calculated any result as long as it is guaranteed, that the values will be ready, what is easy with fixed latencies for all ALU instructions, result forwarding/bypassing and largely guaranteed collision free register accesses (as in GCN). Total pipeline length in GCN could easily be 12 cycles (including operand fetch and result writeback) or something without impacting the 4 cycle latency for ALU operations (the VLIW architectures had at least 12 stages after the sequencer and 8 cycles ALU latency). This length is not that important, in some way it defines the minimum latency of a memory access.

Thanks for correcting me. I confused waitcnt for LGKM_CNT for SGPR->VGPR move, but it instead stands for scalar load (scalar memory loads obviously also need waitcnt). Often these two happen sequentially (1. cbuffer value gets loaded to a scalar, 2. wait for scalar load to finish, 3. move scalar register to vector register). Sometimes I see nops inserted when I do SGPR<->VGPR register transfers, but as you said, often the compiler can reorder instructions to mask the latency. Waitcnt also used for cross lane ops (and LDS accesses), but GCN3 added some specilar case faster cross lane ops that don't need this (but instead need fixed amount of wait states).

sebbbi · Oct 19, 2016

A good read if you are interested in GCN wait counts and latency hiding:
https://bartwronski.com/2014/03/27/gcn-two-ways-of-latency-hiding-and-wave-occupancy/

I have one remark. This article mentioned that Xbox 360 didn't require as much occupancy tuning as GCN. I however remember spending weeks solely optimizing my register counts on Xbox 360 shaders, and tuning the VS/PS register balance of each pass separately. Too bad PC hlsl never adopted [isolate] attribute. It was a perfect tool for preventing the compiler in moving your loads up too much (wasting registers). https://msdn.microsoft.com/en-us/library/bb313977(v=xnagamestudio.31).aspx. Now you must rely on hacks if you want to forcefully limit the scope of your registers.

3dilettante · Oct 20, 2016

The latest disclosures about the PS4 Pro mention two changes considered to be in the graphics architecture pipeline for AMD:
2x rate FP16, and an improved method for distributing geometry among compute units.

The PS4 Pro also includes an additional hardware path for recording what triangle went to which pixels in a separate buffer in parallel with the Z-buffer. That wasn't cited as a future item like the prior two items, but it seems like it could be useful in a future GPU for the same reasons as it would be for the console.

http://www.eurogamer.net/articles/d...tation-4-pro-how-sony-made-a-4k-games-machine

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Deleted member 13524

Guest