AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Discussion in 'Architecture and Products' started by ToTTenTranz, Sep 20, 2016.

  1. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,108
    Likes Received:
    1,802
    Location:
    Finland
    GCN 1.1 was called 2.0 before it was released, 1.2 was possibly called 2.0 before it was released (can't remember for sure), 1.3 was called 2.0 for sure again.
    AMD themselves have now settled for 1st, 2nd, 3rd and 4th gen GCN, while Vega will be 5th - there's so far no indications that it would be bigger departure than 7.x > 8.x was (I think 1.2 was first 8.x? or was 1.1 already?)
     
    pharma and Razor1 like this.
  2. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    I think it's because of the rumors of it being a larger transition than the typical GCN iteration. There would seemingly be ample evidence that may be the case as well. FP16, ID buffer, geometry scheduling on PS4 Pro, Zen APU requirements, etc. Then those patents/papers suggesting they were looking at scalars and variable SIMDs around the early design phase. Plus they'll need CR and OIT for DX12 along with any compatibility instructions for Scorpio/XB1 if porting is to be trivial.
     
  3. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,727
    Likes Received:
    4,395
    I don't know if you have a typo there, but I'm pretty sure AMD called "3rd gen GPUs" to the architectures that implemented lossless color compression (Tonga, Fiji and Carrizo IGP):

    [​IMG]

    Which is what anandtech calls GCN 1.2.
    I doubt they would call GCN 2 to 1.3 or vice-versa.
     
    no-X likes this.
  4. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,108
    Likes Received:
    1,802
    Location:
    Finland
    I meant in general press it was called 2.0 (and 1.2) before AMD settled for "3rd gen GCN"
     
    ToTTenTranz likes this.
  5. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,108
    Likes Received:
    1,802
    Location:
    Finland
    We have no evidence whatsoever that the many patents are related to Vega/5th gen GCN
     
  6. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    True, but just the fact they exist provide an idea what technologies they were looking at.
     
  7. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    647
    Likes Received:
    92
    I dont think Anandtech has ever referred to Polaris as GCN 1.3. I've only ever heard them say 4th generation GCN wrt Polaris (Then again..since we dont have a full review of RX480..nor even a partial one of RX460..lets say they havent referred to it yet).

    But yes..I agree that to avoid any confusion, we should stick to what AMD calls it, i.e. GCN 4 for Polaris.
     
  8. ImSpartacus

    Regular Newcomer

    Joined:
    Jun 30, 2015
    Messages:
    252
    Likes Received:
    199
    Anandtech continues to use the "old" 1.1, and 1.2 naming schemes for consistency's sake, but they use "gcn 4" to describe Polaris.

    http://www.anandtech.com/show/9886/amd-reveals-polaris-gpu-architecture/2

    "Thankfully for Polaris, RTG is revising their naming policies in order to present a clearer technical message about the architecture. Beginning with Polaris, RTG will be using Polaris as something of an umbrella architecture name – what RTG calls a macro-architecture – meant to encompass several aspects of the GPU. The end result is that the Polaris architecture name isn’t all that far removed from what would traditionally be the development family codenames (e.g. Evergreen, Southern Islands, etc), but with any luck we should be seeing more consistent messaging from RTG and we can avoid needing to create unofficial version numbers to try to communicate the architecture. To that end the Polaris architecture will encompass a few things: the fourth generation Graphics Core Next core architecture."

    And later on, they state:

    "Officially RTG has not assigned a short-form name to this architecture at this time, but as reading the 8-syllable “fourth generation GCN” name will get silly rather quickly, for now I’m going to call it GCN 4."

    So that's why they mix and match gcn 4 with gcn 1.1/1.2 in tables like this one:

    http://www.anandtech.com/show/10446/the-amd-radeon-rx-480-preview
     
    Ryan Smith likes this.
  9. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    FP16 (int + float) was already introduced in GCN3. Adding 2x fp16 ops is not a huge change. GCN3 already introduced bigger changes to the ISA. GCN4 introduced vastly improved geometry pipeline. Is this "geometry scheduling" the same thing or something else? So far I would guess that Vega is an interative improvement, not a radically new design.

    Vega will be the first GPU to show the actual benefits of AMDs new geometry pipeline. Polaris shows huge gains in synthetic benchmarks (pretty close to Nvidia), but not much in real games. It is bottlenecked by other factors. But we all saw how bandly Fury X was bottlenecked by fixed function hardware in games. It had huge bandwidth and 64 CUs and awesome compute performance, but not enough geometry throughput. My guess is that Vega is what Fury X should have been without the bottlenecks, and of course with 8/12/16 GB memory, new process and full DX 12.1 feature set. AMD is not lacking much from DX 12.1 spec (they have already highest tier resource binding model), so they don't need radical changes to the architecture.
     
    Tim likes this.
  10. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Not huge, but an added feature nonetheless. The 2xfp16 is what I was referring to. That should at the very least add some new unique instructions.

    It's what was laid out in the DF PS4 Pro article. Scheduling patches across CUs during tessellation. I'd even go so far as to consider a move away from fixed function geometry units, but that would be a step further than what was proposed for PS4 with the cross-CU scheduling. Another possibility might be multiple fixed function units and async graphics. XB already had two command processors for that.

    I don't think it will be a radical change either, but still quite significant. The geometry bottlenecks should be largely removed. The big change I still think will be that scalar if they improved it. Maybe they simply make the existing scalar processor more robust, but with any added load from SM6 and the wave level ops it just doesn't seem like it will be fast enough. Especially not for scalarization or messy waves. That's why I'm still leaning towards the scalar unit being replicated out 4x to run alongside the SIMDs while consuming vector or possibly scalar instructions. It still looks very similar to existing GCN, but peak IPC increases without requiring a whole lot more transistors or bandwidth. That 16+1+1 setup would have 3x the vector IPC with maybe 25% more transistors/area provided small enough waves. That could be far more efficient if the workload was heavily diverging or scalar. For full waves the design maybe has 20% higher peak throughput depending on scheduling hardware.
     
  11. gamervivek

    Regular Newcomer

    Joined:
    Sep 13, 2008
    Messages:
    714
    Likes Received:
    220
    Location:
    india
  12. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    IIRC GCN3 already fixed the tessellator load balancing issue. It resulted in moderate perf boost in high tessellation factor benchmarks. GCN4 furher improved tessellation performance. Now it is pretty close to Nvidia.
    Cross lane ops do not use scalar unit on GCN. Old cross lane permutes (GCN2) used LDS crossbar to swizzle lanes (variable latency). GCN3 introduced fast fixed latency DPP operations (data permuted inside SIMD vector reg file). DPP instructions are perfect fit to implement SM 6.0 wave ops (vote, broadcast, reduction, prefix sum) and quad ops. WaveOnce is the only construct that could benefit from more powerful (and fully featured) scalar unit. My prediction is that WaveOnce will not be used as much as the other features, since it only helps AMD GCN and only integer math and loads/stores. All other wave ops speed up Nvidia and Intel GPUs as well.

    I would love to see a powerful fully featured scalar unit (float math, typed/image loads/stores) and an automatically scalarizing compiler. With a good compiler, this would result in noticeable perf & perf/watt improvement. AMDs architecture uses scalar unit to be more flexible than competitors, but this also is a perf/watt tradeoff (vs fixed function constant buffer and resource descriptor hardware). So it would be important that AMD takes full advantage of this improved flexibility, to offload repeated work/data away from the SIMD vector path. This would compensate the perf/watt downsides. But I don't think the scalar unit would need to be a big redesign from the current one. Make it fully featured and focus on a new compiler that automatically offloads as much as possible to the (improved) scalar unit.
     
    #332 sebbbi, Nov 12, 2016
    Last edited: Nov 12, 2016
    pharma and Razor1 like this.
  13. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Going off that they've further improved it. Improved culling, possibly from that ID buffer, and an iteration on the balancing would seem likely.

    DPP still had limited patterns while the scalar could allow any programmable permutations. So for vote, broadcast, etc that are uniform the SIMD works well, but that breaks down if you start swizzling random lanes. It's a more robust feature set. That same hardware could actually exist within the scalar as logic would exist within the cache/registers to perform operations like that. It would be required for the scalar to skip inactive lanes while pipelining. That's likely how the current implementation works as it's boolean logic for the most part. Reading a bunch of bools into 16x32 bit registers is a bit wasteful otherwise. I'd agree it's not likely to see a lot of use, but given the design is a rather trivial addition. Some HPC loads likely benefit from it as well. For example a relatively small(64 element?) serial sorting algorithm compiled out of a C library you wouldn't want to dump back to the CPU to complete. Some internal culling functionality may be able to leverage that capability as well.

    I'd definitely agree with most of that. Keep in mind they're targeting this for more than just graphics. So a fully featured scalar makes quite a bit of sense and is bordering on required.

    http://lpgpu.org/wp/wp-content/uploads/2014/09/lpgpu_scalarization_169_Jan_Lucas.pdf
    Definitely give this a read. Bit basic in concepts, but has some benchmarks and is really close to what I was proposing. My approach would have the benefit of the temporary registers effectively handling the scalarization and alignment issues while keeping the traditional GCN structure. It would not however be very effective at dependent scalar math beyond its temporary registers. Why I suggested keeping the existing scalar in addition to the new ones: Vector ALU, Scalar ALU (vector input/loops/DSP), Scalar ALU (scalar input). Sticking with the existing scalar I doubt would be fast enough as it accounts for maybe 2-3% of the throughput of a CU. Great for the uniform workloads, but a serious bottleneck if you really start taking advantage of it.

    EDIT: A better description of that scalar I suggested would be DSP as the functionality is extremely similar.
     
    #333 Anarchist4000, Nov 12, 2016
    Last edited: Nov 12, 2016
  14. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    GCN3 also has ds_permute. Fully programmable (random index per lane) permute. Supports both push/pull semantics.

    http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
    Agreed. And modern graphics rendering is also ~50% compute shaders. Our game is almost 100% compute. Robust automatic scalarizatiom would help both professional and gaming workloads. The results in this article show similar gains as many others. Close to 10% perf gain is a common case, while at the same time reducing power consumption by 10%. This is around 20% improvement in perf/watt. Would help AMD a lot vs Nvidia. And this is existing CUDA code, not optimized at all for scalarization. Code specially designed to exploit scalar unit can easily run 30%-50% faster (with practically zero power consumption increase).
     
    Alessio1989 and Razor1 like this.
  15. el etro

    Newcomer

    Joined:
    Mar 9, 2014
    Messages:
    95
    Likes Received:
    12
    They are all "sidegrades" until now, the biggest one being Polaris(the one that maybe can be considered a real upgrade).
     
  16. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    GCN3 was a good iterative upgrade. Delta color compression offered a nice reduction in ROP bandwidth usage. Allowed mid tier cards (Radeon 285) to beat previous gen high end (new card with 256 bit bus was competetive with old 384 bit bus). GCN3 also doubled the geometry throughput. Highly important vs Nvidia. GCN3 also was a big improvement for cross lane ops. Unfortunately we need to wait for DX12 SM 6.0 to see performance impact of this change.

    Without these improvements AMD would be even more behind Nvidia at the moment. Unfortunately for AMD Nvidia's Fermi->Kepler->Maxwell upgrades were huge. Nvidia did radical architectural changes to both their compute units and their fixed function units both times. Pascal (consumer) was mostly a shrink, but a very well done shrink for already excellent architecture.
     
  17. el etro

    Newcomer

    Joined:
    Mar 9, 2014
    Messages:
    95
    Likes Received:
    12
    That's my point. Even from GCN3 to Polaris was something under than even that Kepler to Maxwell was.
     
  18. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Fully programmable, but a different performance tier as indicated by the GPUOpen documentation (8.6TB/s vs 51.6TB/s for Fiji) for DPP and permute. Not exactly slow, but much more optimal paths could exist. That bandwidth difference will likely be tied to power. The theorized scalar would have it's own separate pool nearby and along with the ability to run instructions with the cross-lane at less than optimal wave sizes. Depending on the algorithm that may or may not be more efficient and at the very least leave LDS free for other commands.

    Cuda code and designed with a much narrower SIMT in mind than the SIMD GCN uses. The TrueAudio DSPs were about as close as a high performance scalar got. Might also explain why they did away with them in recent designs. For a Zen APU coprocessor DSP functionality could be a huge gain in addition to the scalar for graphics. There is an entire market build around that style of processing, getting some crossover with graphics could be significant along with audio processing.

    I guess the biggest question is how busy is that scalar in your experience? If that utilization is already high, making it more flexible isn't likely to help very much.

    Alternating designs isn't what I would call aggregate progress. Yes there have been some good advancements in compression and the tiled rasterization (if they use it all the time), but for the most part they've just been adding processors. The largest gains are attributed to doubling down on certain aspects of the design. P100 is a different distribution, but I wouldn't say the architecture changed much. The ratios just went back to a more even distribution of compute performance as opposed to being focused on FP32 and they built out the caches a bit.
     
  19. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Kepler -> Maxwell was HUGE change for Nvidia.

    1. Brand new SM design. Split to quadrants. Less shared stuff (less communication/distance overhead). Doubled max thread group count per SM. Split LDS / L1 cache and increased LDS amount. Combined L1 and texture caches, etc. Source: https://devblogs.nvidia.com/paralle...ould-know-about-new-maxwell-gpu-architecture/

    2. New register files. Simpler banking and register operand reuse cache. One of the potential reasons Maxwell/Pascal reaches very high clocks. Source: https://github.com/NervanaSystems/maxas/wiki/SGEMM.

    3. Tiled rasterizer. Records big chunks of vertex shader output to temporary on-chip buffer (L2 cache), splits to big tiles and executes per tile. Big savings in memory bandwidth. Source: http://www.realworldtech.com/tile-based-rasterization-nvidia-gpus/

    4. Kepler was feature level 11_0 hardware. Behind even AMD GCN 1. Maxwell added all feature level 11_1, 12_0 and 12_1 features: 3d tiled resources, typed UAV load, conservative raster, rasterizer order views.

    5. Maxwell added hardware thread group shared memory atomics. I have personally measured up to 3x perf boost in some of my shaders (Kepler -> Maxwell). When combined with other improvements, Maxwell is finally faster than GCN in compute shaders.
    8.6 TB/s is the LDS memory bandwidth. This is what you get when you communicate over LDS instead of using cross lane ops such as DPP or ds_permute. Cross lane ops do not touch LDS memory at all. LDS data sharing between threads needs both LDS write and LDS read. Two instructions and two waits. GPUOpen documentation doesn't describe the "bandwidth" of the LDS crossbar that is used by ds_permute. It describes the memory bandwidth and ds_permute doesn't touch the memory at all (and doesn't suffer from bank conflicts).

    DPP is definitely faster in cases where you need multiple back-to-back operations, such as prefix sum (7 DPP in a row). But I can't come up with any good examples (*) where you need huge amount of random ds_permutes (DPP handles common cases such as prefix sums and reductions just fine). Usually you have enough math to completely hide the LDS latency (even for real LDS instructions with bank conflicts). ds_permute shouldn't be a problem. The current scalar unit is also good enough for fast votes. Compare generates 64 bit scalar register mask. You can read it back a few cycles later (constant latency, no waitcnt needed).

    (*) One example where you need lots of DPP quad permutes is "emulating" vec4 code (4 threads run one vec4 "thread"). 2 cycle extra latency might be problematic in this particular case. Intel's GPU supports vec4 execution mode. They are using it for vertex, geometry, hull and domain shaders. One advantage of vec4 emulation is that it reduces your branching granularity to 16 "threads". It is also easier to fill the GPU, as you need 4x less "threads". A fun idea that could be fast enough with DPP, but definitely not fast enough with ds_permute.
    I have never seen it more than 40% busy. However if the scalar unit supported floating point code, the compiler could offload much more work to it automatically. Now the scalar unit and scalar registers are most commonly used for constant buffer loads, branching and resource descriptors. I am not a compiler writer, so I can't estimate how big refactoring really good automatic scalar offloading would require.
     
    #339 sebbbi, Nov 14, 2016
    Last edited: Nov 14, 2016
    Alexko, Lightman, Razor1 and 4 others like this.
  20. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,789
    Likes Received:
    2,049
    Location:
    Germany
    sebbi, since you already touched the topic I might as well ask here: Is it true that the delta-c compression does work only on framebuffer accesses*? So with the move to more and more compute, the overall percentage gains would become much smaller than in a classical renderer? Same ofc for Nvidias implementation.

    *and maybe with other limitations?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...