AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Well at least anandtech has been using the GCN 1.0 for Southern Islands, GCN1.1 for Hawaii and Bonaire, GCN1.2 for Tonga and Fiji and GCN1.3 for Polaris. If Vega turns out to be the largest departure from Southern Islands, then following that logic calling it GCN 2.0 makes some sense.
Though if we follow AMD's nomenclature of using GCNx for each transition, GCN2 already exists in Bonaire + Hawaii and Vega should be GCN5.

IMHO anandtech and all others should start using the same names as AMD, as it would avoid further confusion.
GCN 1.1 was called 2.0 before it was released, 1.2 was possibly called 2.0 before it was released (can't remember for sure), 1.3 was called 2.0 for sure again.
AMD themselves have now settled for 1st, 2nd, 3rd and 4th gen GCN, while Vega will be 5th - there's so far no indications that it would be bigger departure than 7.x > 8.x was (I think 1.2 was first 8.x? or was 1.1 already?)
 
GCN 1.1 was called 2.0 before it was released, 1.2 was possibly called 2.0 before it was released (can't remember for sure), 1.3 was called 2.0 for sure again.
AMD themselves have now settled for 1st, 2nd, 3rd and 4th gen GCN, while Vega will be 5th - there's so far no indications that it would be bigger departure than 7.x > 8.x was (I think 1.2 was first 8.x? or was 1.1 already?)
I think it's because of the rumors of it being a larger transition than the typical GCN iteration. There would seemingly be ample evidence that may be the case as well. FP16, ID buffer, geometry scheduling on PS4 Pro, Zen APU requirements, etc. Then those patents/papers suggesting they were looking at scalars and variable SIMDs around the early design phase. Plus they'll need CR and OIT for DX12 along with any compatibility instructions for Scorpio/XB1 if porting is to be trivial.
 
GCN 1.1 was called 2.0 before it was released, 1.2 was possibly called 2.0 before it was released (can't remember for sure), 1.3 was called 2.0 for sure again.

I don't know if you have a typo there, but I'm pretty sure AMD called "3rd gen GPUs" to the architectures that implemented lossless color compression (Tonga, Fiji and Carrizo IGP):

MXVBgi2.png


Which is what anandtech calls GCN 1.2.
I doubt they would call GCN 2 to 1.3 or vice-versa.
 
I think it's because of the rumors of it being a larger transition than the typical GCN iteration. There would seemingly be ample evidence that may be the case as well. FP16, ID buffer, geometry scheduling on PS4 Pro, Zen APU requirements, etc. Then those patents/papers suggesting they were looking at scalars and variable SIMDs around the early design phase. Plus they'll need CR and OIT for DX12 along with any compatibility instructions for Scorpio/XB1 if porting is to be trivial.
We have no evidence whatsoever that the many patents are related to Vega/5th gen GCN
 
Well at least anandtech has been using the GCN 1.0 for Southern Islands, GCN1.1 for Hawaii and Bonaire, GCN1.2 for Tonga and Fiji and GCN1.3 for Polaris. If Vega turns out to be the largest departure from Southern Islands, then following that logic calling it GCN 2.0 makes some sense.
Though if we follow AMD's nomenclature of using GCNx for each transition, GCN2 already exists in Bonaire + Hawaii and Vega should be GCN5.

IMHO anandtech and all others should start using the same names as AMD, as it would avoid further confusion.

I dont think Anandtech has ever referred to Polaris as GCN 1.3. I've only ever heard them say 4th generation GCN wrt Polaris (Then again..since we dont have a full review of RX480..nor even a partial one of RX460..lets say they havent referred to it yet).

But yes..I agree that to avoid any confusion, we should stick to what AMD calls it, i.e. GCN 4 for Polaris.
 
I dont think Anandtech has ever referred to Polaris as GCN 1.3. I've only ever heard them say 4th generation GCN wrt Polaris (Then again..since we dont have a full review of RX480..nor even a partial one of RX460..lets say they havent referred to it yet).

But yes..I agree that to avoid any confusion, we should stick to what AMD calls it, i.e. GCN 4 for Polaris.

Anandtech continues to use the "old" 1.1, and 1.2 naming schemes for consistency's sake, but they use "gcn 4" to describe Polaris.

http://www.anandtech.com/show/9886/amd-reveals-polaris-gpu-architecture/2

"Thankfully for Polaris, RTG is revising their naming policies in order to present a clearer technical message about the architecture. Beginning with Polaris, RTG will be using Polaris as something of an umbrella architecture name – what RTG calls a macro-architecture – meant to encompass several aspects of the GPU. The end result is that the Polaris architecture name isn’t all that far removed from what would traditionally be the development family codenames (e.g. Evergreen, Southern Islands, etc), but with any luck we should be seeing more consistent messaging from RTG and we can avoid needing to create unofficial version numbers to try to communicate the architecture. To that end the Polaris architecture will encompass a few things: the fourth generation Graphics Core Next core architecture."

And later on, they state:

"Officially RTG has not assigned a short-form name to this architecture at this time, but as reading the 8-syllable “fourth generation GCN” name will get silly rather quickly, for now I’m going to call it GCN 4."

So that's why they mix and match gcn 4 with gcn 1.1/1.2 in tables like this one:

http://www.anandtech.com/show/10446/the-amd-radeon-rx-480-preview
 
I think it's because of the rumors of it being a larger transition than the typical GCN iteration. There would seemingly be ample evidence that may be the case as well. FP16, ID buffer, geometry scheduling on PS4 Pro, Zen APU requirements, etc.
FP16 (int + float) was already introduced in GCN3. Adding 2x fp16 ops is not a huge change. GCN3 already introduced bigger changes to the ISA. GCN4 introduced vastly improved geometry pipeline. Is this "geometry scheduling" the same thing or something else? So far I would guess that Vega is an interative improvement, not a radically new design.

Vega will be the first GPU to show the actual benefits of AMDs new geometry pipeline. Polaris shows huge gains in synthetic benchmarks (pretty close to Nvidia), but not much in real games. It is bottlenecked by other factors. But we all saw how bandly Fury X was bottlenecked by fixed function hardware in games. It had huge bandwidth and 64 CUs and awesome compute performance, but not enough geometry throughput. My guess is that Vega is what Fury X should have been without the bottlenecks, and of course with 8/12/16 GB memory, new process and full DX 12.1 feature set. AMD is not lacking much from DX 12.1 spec (they have already highest tier resource binding model), so they don't need radical changes to the architecture.
 
  • Like
Reactions: Tim
FP16 (int + float) was already introduced in GCN3. Adding 2x fp16 ops is not a huge change.
Not huge, but an added feature nonetheless. The 2xfp16 is what I was referring to. That should at the very least add some new unique instructions.

GCN4 introduced vastly improved geometry pipeline. Is this "geometry scheduling" the same thing or something else?
It's what was laid out in the DF PS4 Pro article. Scheduling patches across CUs during tessellation. I'd even go so far as to consider a move away from fixed function geometry units, but that would be a step further than what was proposed for PS4 with the cross-CU scheduling. Another possibility might be multiple fixed function units and async graphics. XB already had two command processors for that.

So far I would guess that Vega is an interative improvement, not a radically new design.
I don't think it will be a radical change either, but still quite significant. The geometry bottlenecks should be largely removed. The big change I still think will be that scalar if they improved it. Maybe they simply make the existing scalar processor more robust, but with any added load from SM6 and the wave level ops it just doesn't seem like it will be fast enough. Especially not for scalarization or messy waves. That's why I'm still leaning towards the scalar unit being replicated out 4x to run alongside the SIMDs while consuming vector or possibly scalar instructions. It still looks very similar to existing GCN, but peak IPC increases without requiring a whole lot more transistors or bandwidth. That 16+1+1 setup would have 3x the vector IPC with maybe 25% more transistors/area provided small enough waves. That could be far more efficient if the workload was heavily diverging or scalar. For full waves the design maybe has 20% higher peak throughput depending on scheduling hardware.
 
Scheduling patches across CUs during tessellation. I'd even go so far as to consider a move away from fixed function geometry units, but that would be a step further than what was proposed for PS4 with the cross-CU scheduling.
IIRC GCN3 already fixed the tessellator load balancing issue. It resulted in moderate perf boost in high tessellation factor benchmarks. GCN4 furher improved tessellation performance. Now it is pretty close to Nvidia.
The big change I still think will be that scalar if they improved it. Maybe they simply make the existing scalar processor more robust, but with any added load from SM6 and the wave level ops it just doesn't seem like it will be fast enough.
Cross lane ops do not use scalar unit on GCN. Old cross lane permutes (GCN2) used LDS crossbar to swizzle lanes (variable latency). GCN3 introduced fast fixed latency DPP operations (data permuted inside SIMD vector reg file). DPP instructions are perfect fit to implement SM 6.0 wave ops (vote, broadcast, reduction, prefix sum) and quad ops. WaveOnce is the only construct that could benefit from more powerful (and fully featured) scalar unit. My prediction is that WaveOnce will not be used as much as the other features, since it only helps AMD GCN and only integer math and loads/stores. All other wave ops speed up Nvidia and Intel GPUs as well.

I would love to see a powerful fully featured scalar unit (float math, typed/image loads/stores) and an automatically scalarizing compiler. With a good compiler, this would result in noticeable perf & perf/watt improvement. AMDs architecture uses scalar unit to be more flexible than competitors, but this also is a perf/watt tradeoff (vs fixed function constant buffer and resource descriptor hardware). So it would be important that AMD takes full advantage of this improved flexibility, to offload repeated work/data away from the SIMD vector path. This would compensate the perf/watt downsides. But I don't think the scalar unit would need to be a big redesign from the current one. Make it fully featured and focus on a new compiler that automatically offloads as much as possible to the (improved) scalar unit.
 
Last edited:
IIRC GCN3 already fixed the tessellator load balancing issue. It resulted in moderate perf boost in high tessellation factor benchmarks. GCN4 furher improved tessellation performance. Now it is pretty close to Nvidia.
"The work distributor in PS4 Pro is very advanced. Not only does it have the fairly dramatic tessellation improvements from Polaris, it also has some post-Polaris functionality that accelerates rendering in scenes with many small objects... So the improvement is that a single patch is intelligently distributed between a number of compute units, and that's trickier than it sounds because the process of sub-dividing and rendering a patch is quite complex."
http://www.eurogamer.net/articles/d...tation-4-pro-how-sony-made-a-4k-games-machine
Going off that they've further improved it. Improved culling, possibly from that ID buffer, and an iteration on the balancing would seem likely.

Cross lane ops do not use scalar unit on GCN. Old cross lane permutes (GCN2) used LDS crossbar to swizzle lanes (variable latency). GCN3 introduced fast fixed latency DPP operations (data permuted inside SIMD vector reg file). DPP instructions are perfect fit to implement SM 6.0 wave ops (vote, broadcast, reduction, prefix sum) and quad ops. WaveOnce is the only construct that could benefit from more powerful (and fully featured) scalar unit. My prediction is that WaveOnce will not be used as much as the other features, since it only helps AMD GCN and only integer math and loads/stores. All other wave ops speed up Nvidia and Intel GPUs as well.
DPP still had limited patterns while the scalar could allow any programmable permutations. So for vote, broadcast, etc that are uniform the SIMD works well, but that breaks down if you start swizzling random lanes. It's a more robust feature set. That same hardware could actually exist within the scalar as logic would exist within the cache/registers to perform operations like that. It would be required for the scalar to skip inactive lanes while pipelining. That's likely how the current implementation works as it's boolean logic for the most part. Reading a bunch of bools into 16x32 bit registers is a bit wasteful otherwise. I'd agree it's not likely to see a lot of use, but given the design is a rather trivial addition. Some HPC loads likely benefit from it as well. For example a relatively small(64 element?) serial sorting algorithm compiled out of a C library you wouldn't want to dump back to the CPU to complete. Some internal culling functionality may be able to leverage that capability as well.

I would love to see a powerful fully featured scalar unit (float math, typed/image loads/stores) and an automatically scalarizing compiler. With a good compiler, this would result in noticeable perf & perf/watt improvement. AMDs architecture uses scalar unit to be more flexible than competitors, but this also is a perf/watt tradeoff (vs fixed function constant buffer and resource descriptor hardware). So it would be important that AMD takes full advantage of this improved flexibility, to offload repeated work/data away from the SIMD vector path. This would compensate the perf/watt downsides. But I don't think the scalar unit would need to be a big redesign from the current one. Make it fully featured and focus on a new compiler that automatically offloads as much as possible to the (improved) scalar unit.
I'd definitely agree with most of that. Keep in mind they're targeting this for more than just graphics. So a fully featured scalar makes quite a bit of sense and is bordering on required.

http://lpgpu.org/wp/wp-content/uploads/2014/09/lpgpu_scalarization_169_Jan_Lucas.pdf
Definitely give this a read. Bit basic in concepts, but has some benchmarks and is really close to what I was proposing. My approach would have the benefit of the temporary registers effectively handling the scalarization and alignment issues while keeping the traditional GCN structure. It would not however be very effective at dependent scalar math beyond its temporary registers. Why I suggested keeping the existing scalar in addition to the new ones: Vector ALU, Scalar ALU (vector input/loops/DSP), Scalar ALU (scalar input). Sticking with the existing scalar I doubt would be fast enough as it accounts for maybe 2-3% of the throughput of a CU. Great for the uniform workloads, but a serious bottleneck if you really start taking advantage of it.

EDIT: A better description of that scalar I suggested would be DSP as the functionality is extremely similar.
 
Last edited:
DPP still had limited patterns while the scalar could allow any programmable permutations. So for vote, broadcast, etc that are uniform the SIMD works well, but that breaks down if you start swizzling random lanes.
GCN3 also has ds_permute. Fully programmable (random index per lane) permute. Supports both push/pull semantics.

http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
Keep in mind they're targeting this for more than just graphics. So a fully featured scalar makes quite a bit of sense and is bordering on required.

http://lpgpu.org/wp/wp-content/uploads/2014/09/lpgpu_scalarization_169_Jan_Lucas.pdf
Agreed. And modern graphics rendering is also ~50% compute shaders. Our game is almost 100% compute. Robust automatic scalarizatiom would help both professional and gaming workloads. The results in this article show similar gains as many others. Close to 10% perf gain is a common case, while at the same time reducing power consumption by 10%. This is around 20% improvement in perf/watt. Would help AMD a lot vs Nvidia. And this is existing CUDA code, not optimized at all for scalarization. Code specially designed to exploit scalar unit can easily run 30%-50% faster (with practically zero power consumption increase).
 
They are all "sidegrades" until now, the biggest one being Polaris(the one that maybe can be considered a real upgrade).
 
They are all "sidegrades" until now, the biggest one being Polaris(the one that maybe can be considered a real upgrade).
GCN3 was a good iterative upgrade. Delta color compression offered a nice reduction in ROP bandwidth usage. Allowed mid tier cards (Radeon 285) to beat previous gen high end (new card with 256 bit bus was competetive with old 384 bit bus). GCN3 also doubled the geometry throughput. Highly important vs Nvidia. GCN3 also was a big improvement for cross lane ops. Unfortunately we need to wait for DX12 SM 6.0 to see performance impact of this change.

Without these improvements AMD would be even more behind Nvidia at the moment. Unfortunately for AMD Nvidia's Fermi->Kepler->Maxwell upgrades were huge. Nvidia did radical architectural changes to both their compute units and their fixed function units both times. Pascal (consumer) was mostly a shrink, but a very well done shrink for already excellent architecture.
 
Unfortunately for AMD Nvidia's Fermi->Kepler->Maxwell upgrades were huge. Nvidia did radical architectural changes to both their compute units and their fixed function units both times. Pascal (consumer) was mostly a shrink, but a very well done shrink for already excellent architecture.

That's my point. Even from GCN3 to Polaris was something under than even that Kepler to Maxwell was.
 
GCN3 also has ds_permute. Fully programmable (random index per lane) permute. Supports both push/pull semantics.
Fully programmable, but a different performance tier as indicated by the GPUOpen documentation (8.6TB/s vs 51.6TB/s for Fiji) for DPP and permute. Not exactly slow, but much more optimal paths could exist. That bandwidth difference will likely be tied to power. The theorized scalar would have it's own separate pool nearby and along with the ability to run instructions with the cross-lane at less than optimal wave sizes. Depending on the algorithm that may or may not be more efficient and at the very least leave LDS free for other commands.

And this is existing CUDA code, not optimized at all for scalarization. Code specially designed to exploit scalar unit can easily run 30%-50% faster (with practically zero power consumption increase).
Cuda code and designed with a much narrower SIMT in mind than the SIMD GCN uses. The TrueAudio DSPs were about as close as a high performance scalar got. Might also explain why they did away with them in recent designs. For a Zen APU coprocessor DSP functionality could be a huge gain in addition to the scalar for graphics. There is an entire market build around that style of processing, getting some crossover with graphics could be significant along with audio processing.

I guess the biggest question is how busy is that scalar in your experience? If that utilization is already high, making it more flexible isn't likely to help very much.

That's my point. Even from GCN3 to Polaris was something under than even that Kepler to Maxwell was.
Alternating designs isn't what I would call aggregate progress. Yes there have been some good advancements in compression and the tiled rasterization (if they use it all the time), but for the most part they've just been adding processors. The largest gains are attributed to doubling down on certain aspects of the design. P100 is a different distribution, but I wouldn't say the architecture changed much. The ratios just went back to a more even distribution of compute performance as opposed to being focused on FP32 and they built out the caches a bit.
 
That's my point. Even from GCN3 to Polaris was something under than even that Kepler to Maxwell was.
Kepler -> Maxwell was HUGE change for Nvidia.

1. Brand new SM design. Split to quadrants. Less shared stuff (less communication/distance overhead). Doubled max thread group count per SM. Split LDS / L1 cache and increased LDS amount. Combined L1 and texture caches, etc. Source: https://devblogs.nvidia.com/paralle...ould-know-about-new-maxwell-gpu-architecture/

2. New register files. Simpler banking and register operand reuse cache. One of the potential reasons Maxwell/Pascal reaches very high clocks. Source: https://github.com/NervanaSystems/maxas/wiki/SGEMM.

On Maxwell there are 4 register banks, but unlike on Kepler (also with 4 banks) the assignment of banks to numbers is very simple. The Maxwell assignment is just the register number modulo 4. On Kepler it is possible to arrange the 64 FFMA instructions to eliminate all bank conflicts. On Maxwell this is no longer possible. Maxwell, however, provides something to make up for this and at the same time offers the capability to significantly reduce register bank traffic and overall chip power draw. This is the operand reuse cache. The operand reuse cache has 8 bytes of data per source operand slot. An instuction like FFMA has 3 source operand slots. Each time you issue an instruction there is a flag you can use to specify if each of the operands is going to be used again. So the next instruction that uses the same register in the same operand slot will not have to go to the register bank to fetch its value. And with this feature you can see how a register bank conflict can be averted.

3. Tiled rasterizer. Records big chunks of vertex shader output to temporary on-chip buffer (L2 cache), splits to big tiles and executes per tile. Big savings in memory bandwidth. Source: http://www.realworldtech.com/tile-based-rasterization-nvidia-gpus/

4. Kepler was feature level 11_0 hardware. Behind even AMD GCN 1. Maxwell added all feature level 11_1, 12_0 and 12_1 features: 3d tiled resources, typed UAV load, conservative raster, rasterizer order views.

5. Maxwell added hardware thread group shared memory atomics. I have personally measured up to 3x perf boost in some of my shaders (Kepler -> Maxwell). When combined with other improvements, Maxwell is finally faster than GCN in compute shaders.
Fully programmable, but a different performance tier as indicated by the GPUOpen documentation (8.6TB/s vs 51.6TB/s for Fiji) for DPP and permute. Not exactly slow, but much more optimal paths could exist. That bandwidth difference will likely be tied to power. The theorized scalar would have it's own separate pool nearby and along with the ability to run instructions with the cross-lane at less than optimal wave sizes. Depending on the algorithm that may or may not be more efficient and at the very least leave LDS free for other commands.
8.6 TB/s is the LDS memory bandwidth. This is what you get when you communicate over LDS instead of using cross lane ops such as DPP or ds_permute. Cross lane ops do not touch LDS memory at all. LDS data sharing between threads needs both LDS write and LDS read. Two instructions and two waits. GPUOpen documentation doesn't describe the "bandwidth" of the LDS crossbar that is used by ds_permute. It describes the memory bandwidth and ds_permute doesn't touch the memory at all (and doesn't suffer from bank conflicts).

DPP is definitely faster in cases where you need multiple back-to-back operations, such as prefix sum (7 DPP in a row). But I can't come up with any good examples (*) where you need huge amount of random ds_permutes (DPP handles common cases such as prefix sums and reductions just fine). Usually you have enough math to completely hide the LDS latency (even for real LDS instructions with bank conflicts). ds_permute shouldn't be a problem. The current scalar unit is also good enough for fast votes. Compare generates 64 bit scalar register mask. You can read it back a few cycles later (constant latency, no waitcnt needed).

(*) One example where you need lots of DPP quad permutes is "emulating" vec4 code (4 threads run one vec4 "thread"). 2 cycle extra latency might be problematic in this particular case. Intel's GPU supports vec4 execution mode. They are using it for vertex, geometry, hull and domain shaders. One advantage of vec4 emulation is that it reduces your branching granularity to 16 "threads". It is also easier to fill the GPU, as you need 4x less "threads". A fun idea that could be fast enough with DPP, but definitely not fast enough with ds_permute.
I guess the biggest question is how busy is that scalar in your experience? If that utilization is already high, making it more flexible isn't likely to help very much.
I have never seen it more than 40% busy. However if the scalar unit supported floating point code, the compiler could offload much more work to it automatically. Now the scalar unit and scalar registers are most commonly used for constant buffer loads, branching and resource descriptors. I am not a compiler writer, so I can't estimate how big refactoring really good automatic scalar offloading would require.
 
Last edited:
sebbi, since you already touched the topic I might as well ask here: Is it true that the delta-c compression does work only on framebuffer accesses*? So with the move to more and more compute, the overall percentage gains would become much smaller than in a classical renderer? Same ofc for Nvidias implementation.

*and maybe with other limitations?
 
Back
Top