AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Like I said, it's harder to distinguish what's "new architecture" and what not, as so much DNA of the "old architecture" gets always carried over. It still doesn't change the fact that in some contextes RDNA is still referred as GCN.
Like no-X said, it is worse for compute in terms of transistor budget. CDNA looks to be GCN1.4.x (since it's gfx9xx (908 IIRC)), biggest changes to 1.4.1 (Vega 20) should be removing some graphics related blocks
The transistor ratio per TF disparity doesn't seem too crippling, particularly since there would have been choices made for the consumer Navi's target market that could have been adjusted to favor compute if it were similarly dedicated. That aside there could be other issues like the risk of involving Navi in products aimed at the datacenter and HPC clients.
Navi at this time has some bugs that would make it less compelling for compute in particular (memory addressing mode bugs, LDS bugs, etc.), which may have contributed along with AMD's limited software support for why Navi's compute performance and implementation have been poor or subject to some significant errors.

Driver changes for Arcturus may point to some significant changes, as in some form of vector unit that is architecturally distinct from the existing SIMDs, potentially targeting large vectors/matrices with multiple levels of precision/accumulation.
There were also possible placeholders or existing graphics elements mentioned even if the graphics command processor was specifically missing. It could be that some amount of geometry and pixel capability remains, or it was less disruptive to the architecture or driver base to leave them as-is than to totally remove them.
 
There is more to GPGPU performance than simply the number of TFLOPS on the chip. Otherwise terascale would have been the greatest compute architecture known rather than generally incapable of it. Or take a look at nvidia's diverged designs: their compute focused ones have more SRAM and less TFLOPS per area than gaming. Things like larger and higher bandwidth caches or needing fewer waves to occupy a wavefront bring a larger benefit the more sophisticated a shader is and the less coherent its memory access. And rasterisation has very simple shaders and highly coherent access in the grand scheme of massively parallel algorithms
Yes, but the discussion wasn't about some kind of abstract comparision, it was about two particular architectures, two particular GPUs. Real-world compute performance per transistor of Navi 10 is often worse than real-world compute performance per transistor of Vega 20, too. Real-world results are even worse for Navi than the theoretical comparision I was talking about. So… I'm not sure what is the point of your reply. Both Navi's theoretical compute perfromance per transistor and real world comute performance per transistor are worse than Vega's. That's the reason why AMD decided to split development of gaming and compute architectures. If they plan to use Navi/RDNA for compute in future (and Vega-derived Arcturus was just a short-term solution), they wouln't make entire roadmap for a separate architecture. Because in that case there would not be a separate architecture. It would still be RDNA/Navi just like the gaming one. But it isn't.
 
Yes, but the discussion wasn't about some kind of abstract comparision, it was about two particular architectures, two particular GPUs. Real-world compute performance per transistor of Navi 10 is often worse than real-world compute performance per transistor of Vega 20, too. Real-world results are even worse for Navi than the theoretical comparision I was talking about. So… I'm not sure what is the point of your reply. Both Navi's theoretical compute perfromance per transistor and real world comute performance per transistor are worse than Vega's. That's the reason why AMD decided to split development of gaming and compute architectures. If they plan to use Navi/RDNA for compute in future (and Vega-derived Arcturus was just a short-term solution), they wouln't make entire roadmap for a separate architecture. Because in that case there would not be a separate architecture. It would still be RDNA/Navi just like the gaming one. But it isn't.
Given many OpenCL applications also fail to start or give incorrect results, I suspect the performance has more to do with the state of the drivers than the hardware.
 
CDNA is essentially a rebranded GCN for now. I fully expect it to switch to the same RDNA base architecture down the road.

Why would you even say that...?

Dr Su herself told us AMD's divergence on this point and why they sequestered the rdna team into silence while developing it. Because she is a Gamer, herself and wanted to develop game only architecture.

Now that AMD has spent those resources and molded an entire gaming industry behind it (rdna2) even before any of us gets to see it. We all got a taste of rdna1, but that was a hybrid design, not full uArch. So that now Microsoft, Sony, Samsung & Google have all bought into what rdna2 can do...

You think AMD will fully switch away from what they have been working towards..? Two different dGPU's in two different feilds geared/engineered for efficiency with no cross-market inefficiencies...

AMD-CDNA-vs-RDNA-arc.jpg


That grey area is one architecture for everything = nvidia

Rdna & Cdna are not general purpose, they are specific to their field....
 
Why would you even say that...?
The conclusion makes sense to me too. We know that CDNA in the form of Arcturus is basically GCN (Vega - though I guess the G in the name would be a bit inappropriate here...) with some graphic bits stripped off. So CDNA2 could really be anything, and imho it makes a whole lot of sense if this would be really the same as some rdna version (unless they'd actually stick to GCN even). Despite the flashy diagram, I don't expect amd to develop really two completely separate architectures. Separate chips yes (although I have to say I am still somewhat sceptical about the viability of even this approach, but apparently amd is willing to go there), but there's no real evidence it's really going to be a separate architecture other than in marketing name.
 
Why would you even say that...?
Because (in case you're interested in a serious answer). Vega 20 has 43,4 edit: 43,2 TFLOPS of compute per mm², while Navi 10 has 40,4 - both in their fastest incarnations. And that is with Vega 20's insanely wide memory controllers and half-rate DP, neither of which are free in terms of die space.

And with Vega, compute applications work, whereas Navi still has issues. You don't want your next Supercomputer installation with a 100k cards choke on the first day.
 
The conclusion makes sense to me too. We know that CDNA in the form of Arcturus is basically GCN (Vega - though I guess the G in the name would be a bit inappropriate here...) with some graphic bits stripped off. So CDNA2 could really be anything, and imho it makes a whole lot of sense if this would be really the same as some rdna version (unless they'd actually stick to GCN even). Despite the flashy diagram, I don't expect amd to develop really two completely separate architectures. Separate chips yes (although I have to say I am still somewhat sceptical about the viability of even this approach, but apparently amd is willing to go there), but there's no real evidence it's really going to be a separate architecture other than in marketing name.

So AMD's slide is wrong and they will not be utilizing two different graphic architectures....?

And you are trying to say that with a strait face, even though AMD's CEO was on stage telling us otherwise, just a week ago. Then went into the reasoning behind why they are doing it. Because it allows AMD to leverage each architecture, to fully benefit the customer. Who are on two different ends of the spectrum, that one uarch can't make happy. The reasoning is pretty simple, so perhaps you didn't understand, or don't care, or just dismissing it..?

rdna = gaming optimized uarch
cdna = compute optimized uarch


Really not that hard to understand.
 
I dunno if I'd call RDNA a "GCN fork" really. This implies that the other branch will still be alive for a long time, and I don't see why this would be the case, especially with the alleged perf/watt improvements of RDNA2. The latter will likely destroy GCN in compute workloads just like RDNA1 destroys it in gaming. Perf/watt is very important for HPC space.
 
I dunno if I'd call RDNA a "GCN fork" really. This implies that the other branch will still be alive for a long time, and I don't see why this would be the case, especially with the alleged perf/watt improvements of RDNA2. The latter will likely destroy GCN in compute workloads just like RDNA1 destroys it in gaming. Perf/watt is very important for HPC space.
Who's to say many of the same improvements can't be applied to GCN-base too? I mean, we have no clue whats being improved and how
(edit: also, it helps that in certain contextes RDNA is referred as GCN1.5 and RDNA+DLOps 1.5.1, while for example Vega is 1.4 and Vega20 1.4.1)
 
Last edited:
Who's to say many of the same improvements can't be applied to GCN-base too? I mean, we have no clue whats being improved and how
(edit: also, it helps that in certain contextes RDNA is referred as GCN1.5 and RDNA+DLOps 1.5.1, while for example Vega is 1.4 and Vega20 1.4.1)
That's hardly relevant to the underlying h/w though. CUDA have a straight 1 to whatever it is now "Compute Capability" metric for example which goes as far back as to G80/Tesla. This doesn't mean that Turing is a fork of G80.
 
The conclusion makes sense to me too. We know that CDNA in the form of Arcturus is basically GCN (Vega - though I guess the G in the name would be a bit inappropriate here...) with some graphic bits stripped off. So CDNA2 could really be anything, and imho it makes a whole lot of sense if this would be really the same as some rdna version (unless they'd actually stick to GCN even). Despite the flashy diagram, I don't expect amd to develop really two completely separate architectures. Separate chips yes (although I have to say I am still somewhat sceptical about the viability of even this approach, but apparently amd is willing to go there), but there's no real evidence it's really going to be a separate architecture other than in marketing name.

Some elements that seem likely to benefit CDNA that showed up with Navi are the doubled L0(RDNA)/L1(GCN) bandwidth and an apparently more generous allocation for the scalar register file. The RDNA L1 cache is read-only and it may be that compute loads with a lot of write traffic might be outside its optimum, but on the other hand I'm not sure what Arcturus is doing with subdividing the GPU's broader resources. The larger number of CUs and the lack of a 3d graphics ring might point to it acting more like a set of semi-independent shader engines managed by a subset of the ACEs, and that sort of subdivision might still align with what Navi did with the hierarchy.
The longer cache lines could be a wrinkle in memory coherence, since RDNA's cache granularity is now out of step with the CPU hierarchy. It can be handled with a little bit of extra tracking, however.
How the WGP arrangement may help or hinder (outside of bugs) may need further vetting. It seems like WGP mode can help heavier shader types, but per the RDNA whitepaper there are some tradeoffs like shared request queues that might be less helpful in compute.

Some elements like the current formulation Wave64 may not be full replacements for native 64-wide wavefronts, as there are some restrictions in instances where the execution mask is all-zero, which is a failure case for RDNA. RDNA loses some of the skip modes that GCN has, drops some of the cross-lane options, and drops some branching instructions. On the other hand, it does have some optimizations for skipping some instructions automatically if they are predicated off.
CDNA's emphasis on compute, and Arcturus potentially having much more evolved matrix instructions and hardware, could make the case for a different kind of Wave64, or a switch in emphasis where it's preferred to keep the prior architectural width.
Usually HPC is less concerned with backwards compatibility, but perhaps AMD's tools or existing code may still tend towards the old style?
Physically, the clock speed emphasis may be partially blunted. Arcturus-related code commits seems to be giving up some of the opportunistic up-clocking graphics products use, with the argument being the broader compute hardware would wind up throttling anyway. If the upper clock range is less likely to be used, perhaps the implementation choices would emphasis leakage and density versus expending transistors and pipeline stages on the "multi-GHz" range AMD seems to be claiming for RDNA2.

On top of all that, there seem to be errata for Navi that may be particularly noticeable for compute and might have delayed any RDNA-like introduction into the development pipeline for HPC.

I'm curious what AMD's managed to cull from Arcturus. For example, the driver changes make note of not having a 3d engine, but there are still references to setting up values for the geometry front ends and primitive FIFOs, for example. Also unclear is what that means for the command processor, since besides graphics it is usually the device that the system uses to set up and manage the overall GPU. Losing it doesn't seem to gain much other than a little rectangle in the middle of the chip, for example. If it is gone or somehow re-engineered, perhaps it has more to do with some limitation in interfacing with a much larger number of CUs rather than the area cost of a microcontroller.

Because (in case you're interested in a serious answer). Vega 20 has 43,4 edit: 43,2 TFLOPS of compute per mm², while Navi 10 has 40,4 - both in their fastest incarnations. And that is with Vega 20's insanely wide memory controllers and half-rate DP, neither of which are free in terms of die space.

And with Vega, compute applications work, whereas Navi still has issues. You don't want your next Supercomputer installation with a 100k cards choke on the first day.

One item of note with regards to Vega 20's wide memory controllers is that while they are wide, Navi 10's GDDR6 controllers are physically large. From rough pixel counting from the die shots for both on Fritzchens Fritz for the two, Navi's memory sections have an area in the same range as Vega 20's, which would have a corresponding impact on the FLOPS/mm2. At least my initial attempts at measuring seem to indicate Navi 10's is noticeably larger. Vega 20 is a larger chip, which usually means the overhead of miscellaneous blocks and IO tends to be lower versus what smaller dies must contend with.

If the references to a new architectural register type and matrix hardware are what they seem to me, Arcturus is going to have a large rise in FLOPS/mm2, with the impact dependent on the precision choices and granularity chosen. That wouldn't be an apples to apples comparison, though.

One thing I did find recently is some discussion on certain issues that code generation has for GPUs for mesa code, in which there are some additional details about some of the bug flags for RDNA.
https://gitlab.freedesktop.org/mesa...18f4a3c8abc86814143bf/src/amd/compiler/README
It's not just hardware bugs (includes some unflattering documentation issues) and not just RDNA, but RDNA has a list of hardware problems. I think some of those would be more objectionable for compute, and maybe one reason why the consoles seem to have gone for for a more fully-baked RDNA2.

(edit: fixed some grammar)
 
Last edited:
So AMD's slide is wrong and they will not be utilizing two different graphic architectures....?
Not saying the slide is wrong, just does not quite tell the full truth perhaps.
From what I can tell from the open source drivers, CDNA really is GCN (Vega). Yes there might be some tweaks underneath here and there, but it seems like a stretch to claim this is really a different architecture.
Of course in the future, it could diverge more between graphics and compute products, I'm just not convinced this makes sense, hence IMHO it is more likely cdna will remain a very close relative of an existing graphics architecture. This does not really contradict anything which was said or presented in the slides.
 
Some elements like the current formulation Wave64 may not be full replacements for native 64-wide wavefronts, as there are some restrictions in instances where the execution mask is all-zero, which is a failure case for RDNA. RDNA loses some of the skip modes that GCN has, drops some of the cross-lane options, and drops some branching instructions. On the other hand, it does have some optimizations for skipping some instructions automatically if they are predicated off.

Without hoping to derail too much. For a layman, how are adapting for architectural quirks like this done practically? Are they dealt with insomuch as the compiler is updated to handle it and developers being asked to keep certain code and asset restrictions in mind, or is it mainly for engine coders to tailor for specific behaviour?
 
Without hoping to derail too much. For a layman, how are adapting for architectural quirks like this done practically? Are they dealt with insomuch as the compiler is updated to handle it and developers being asked to keep certain code and asset restrictions in mind, or is it mainly for engine coders to tailor for specific behaviour?
Some of the public information on the bugs or issues come from compiler commits, such as for LLVM or Mesa, so some amount of compiler adaptation is occurring.
The effectiveness of compilers or how often some of the dropped features found use is something I don't know.
VSKIP was discussed by AMD in the past as getting compiler support, but that assembly was also given as a viable option.

Wave64 versus Wave32 has been discussed primarily as a compiler choice, based on some evolving set of heuristics. I presume the heuristics would pay attention to bug flags and the properties of the code being evaluated as to whether it should go for one mode or the other, perhaps being conservative if its analysis is incomplete. Or if such changes are not handled well, it may be corroborated with Navi's woes with compute workloads.

Changes like this are part of why assembly can be harder to justify except in fields where performance is paramount and there's already an assumption of significant code optimization. What first comes to mind is HPC, although even then much of it isn't going to go to that extent and AMD is still working to make up the software deficit.
 
There was a patch on Arcturus talking about something new called ”AccVGPRs”. Previously there has been mentions of AGPRs , but to my knowledge it has never been clarified what the “A” stood for. Is it safe to assume that it stands for “Accelerator”?

New update
https://lists.freedesktop.org/archives/amd-gfx/2020-March/047222.html
Some previous mentions of AGPRs
https://github.com/llvm-mirror/llvm/commit/6644a1885fccc43708cf4486b7f31a9168826ca4
https://github.com/llvm-mirror/llvm/commit/cb57db03360f8247a475e77dc895f7adb573c0b1
 
accvgprs were mentioned before, for example in the first of your additional links. In fact, they were mentioned as far back as july 2019.
 
Last edited:
Back
Top