AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

AMD's Background On The ROCm OpenCL Stack
5 July 2017
A ROCm (Radeon Open Compute) developer at AMD has shared some of their background work on their OpenCL compiler stack, including the LLVM focus, as well as some of their current performance focuses for this open-source compute offering.
...
Note one significant changes the compiler now generate GCN ISA binary object directly. With this change, it makes it easier for the compiler supports Inline ASM support for all of our languages ( OpenCL, HCC, HIP) and also native assembler and disassembler support. It is also a critical foundation for our math library and MiOpen projects.
...
For the last year, we have spent more time focusing on FIJI and Vega10 with Deep Learning Frameworks, MIOpen, and GEMM solvers. We also have been filling in the gaps in LLVM for the optimization we need for GPU Computing, also improving the scheduler, register allocator, loop optimizer and lot more. It is a bit of work as you can imagine. But we already saw where the effort been worth it since it faster on a number of the codes.
...
On Ray Tracer we are just starting our performance analysis and optimization that more specific to this class of work, What you see over the summer is we will be focusing on optimization for the compiler for currency mining and raytracing. I just have to stage this work in with the team. I saw you referenced Phoronix article, for ROCm 1.5 the new compiler was faster than LLVM/HSAIL/SC on FIJI for Blender, but for Luxmark we were slower.
...
One thing I will leave you with is we build standardized loader and linker and object format, with this it allows us to do some you never could do with AMGGPUpro driver, upgrades the compiler before we release a new driver. So we can now address issue independently of the base driver for OpenCL, HCC, and HIP and the base LLVM compiler foundation.
http://www.phoronix.com/scan.php?page=news_item&px=ROCm-OpenCL-Background
 
What about the relative small increase from the 550 MHz difference (~34%) between Vega FE at 1050 MHz to 1600 MHz.
Well it's not 550MHz because the card actually just stays at around 1400MHz most of the time, so it's more of a 350MHz / 33% difference.
 
Maybe RX Vega will be 8GB @ higher memclock?

Even if they overclock those 4-Hi stacks to death, if RX Vega keeps showing a performance-per-CU-per-clock that is lower than Polaris then this chip is a total clusterfuck for gaming.
Final RX Vega gaming drivers need to show at least a 20% boost in games compared to whatever the FE is running right now.
Either that or AMD will have to be selling RX Vega for $350, because at these performance and power levels the miners won't touch the thing.
 
Here we see 15% to 670% boosts at the same clocks, but in gaming this translates to nothing?
I don't buy it.
Fury doesn't have Pro-drivers. They should have used Radeon Pro Duo in single-GPU mode or FirePro/Radeon Pro

Even if they overclock those 4-Hi stacks to death, if RX Vega keeps showing a performance-per-CU-per-clock that is lower than Polaris then this chip is a total clusterfuck for gaming.
Final RX Vega gaming drivers need to show at least a 20% boost in games compared to whatever the FE is running right now.
Either that or AMD will have to be selling RX Vega for $350, because at these performance and power levels the miners won't touch the thing.
I think the results are quite clear - the current drivers are indeed so called "Fiji-drivers".
To my understanding Vega should have everything Polaris has + it's own improvements on top - Fiji has none of each, so it's driver won't be taking paths utilizing the improvements
 
Even if they overclock those 4-Hi stacks to death, if RX Vega keeps showing a performance-per-CU-per-clock that is lower than Polaris then this chip is a total clusterfuck for gaming.

Sure.

I'm just positing one of the things they'll do for the "gamer" version. I don't know what they'll end up showing.
 
I think the results are quite clear - the current drivers are indeed so called "Fiji-drivers".
To my understanding Vega should have everything Polaris has + it's own improvements on top - Fiji has none of each, so it's driver won't be taking paths utilizing the improvements

Not everything Polaris did should have needed drivers to use. Polaris had expanded instruction buffers, improved L2 access handling, and some form of instruction prefetch.
It seems like Vega should have some architectural elements that don't need explicit hand-holding, given how much AMD has said it changed.
 
Yeah I agree. In my uneducated opinion, something in the chip is broken, they can't fix it on time, so we have a fiji on (not that efficient) steroids. But I guess we'll never know for sure (like the r600/msaa thing).
 
It seems like Vega should have some architectural elements that don't need explicit hand-holding
In GN's testing, it appears that Vega FE is having better geometry and tessellation performance than FuryX at the same clock, at least in FireStrike GFX 1 test (which is geometry heavy).

vega-v-furyx-firestrike-normal_fps.png
 
Not everything Polaris did should have needed drivers to use. Polaris had expanded instruction buffers, improved L2 access handling, and some form of instruction prefetch.
It seems like Vega should have some architectural elements that don't need explicit hand-holding, given how much AMD has said it changed.

Then if Vega FE is getting an almost exact carbon copy the gaming results of a GCN3 Fiji at similar clocks, whereas GCN4 Polaris 10 showed significant improvements over GCN3 Tonga, what do you think is happening here?
Broken chip using Fiji fallback driver path just to work? Broken chip using Vega driver but since it's broken this is all it can do? Healthy chip using Fiji fallback driver waiting for Vega driver?
 
I though Tonga->Polaris was only ~+5% (as opposed to Tahiti->Polaris which was well over +30%)

Relevant thread

BTW that thread has a lot of other very relevant information to the the other topics at hand, well worth a re-read.
 
I though Tonga->Polaris was only ~+5% (as opposed to Tahiti->Polaris which was well over +30%)
Depends on the game. Witcher 3 in computerbase.de shows a 15% boost for Witcher 3. Gameworks games (i.e. geometry-intensive games) seem to show the largest boost, between 10 and 15%. Other games go down to 4%.

At the very least, Vega should be showing this 4-15% advantage towards Fiji at equal clocks, though needing 40% more transistors for that sounds a bit embarrassing.
 
Polaris has better delta colour compression than Fiji, so Vega should do too. That should negate Vega's slight bandwidth shortfall.
 
The "Fiji fallback driver" or "Fiji drivers" meme needs to stop. That's not how it works or should be described. Otherwise you should start calling Volta drivers Pascal drivers Maxwell drivers. There is obviously commonality -- that's just how software engineering needs to work for a GPU -- but calling it a Fiji driver or Fiji fallback is wrong.
 
The "Fiji fallback driver" or "Fiji drivers" meme needs to stop. That's not how it works or should be described. Otherwise you should start calling Volta drivers Pascal drivers Maxwell drivers. There is obviously commonality -- that's just how software engineering needs to work for a GPU -- but calling it a Fiji driver or Fiji fallback is wrong.

Can then we call them " Vega drivers without any game specific optimization" then? In the AMD demos we saw Doom running on Vega at 70+ fps whereas in the tests we saw on the sites about Vega FE It barely surpasses 50 FPS. Can this be related to game specific optimization turned off in Vega for the sake of stability?
 
Then if Vega FE is getting an almost exact carbon copy the gaming results of a GCN3 Fiji at similar clocks, whereas GCN4 Polaris 10 showed significant improvements over GCN3 Tonga, what do you think is happening here?
Broken chip using Fiji fallback driver path just to work? Broken chip using Vega driver but since it's broken this is all it can do? Healthy chip using Fiji fallback driver waiting for Vega driver?

There's a constellation of items that occurred to me, that seem like they could factor into it from my standpoint of as an outside observer.
(TLDR: possible fallbacks to unevolved hardware, tradeoffs for same performance per clock, maybe it's not that different)

Architectural features can be disabled or enabled even if they do not need explicit targeting by the driver at runtime.
Some of the most significant alterations could be complex enough or flaky enough to not be turned on if they interact poorly with real-world code.
Risk reduction may have those new features placed in areas that can safely fall back to architectural "bones" that are proven to work. The geometry front end and rasterizer changes could fall back to bones based heavily Fiji, given how closely they match on various parameters.
The primitive discard accelerator in Polaris is something in the same area, and that might mean it doesn't exist separately enough to work if those items aren't active--or there's another possibility I will get to later.

It may also be possible that even with some automatic improvements that there are penalties. One of the changes in the driver patches is a change in how many loads Vega can issue before having to wait, by a factor of 4. That could be a sign of something new not yet revealed, or something GCN was found to be limited by earlier, or a sign that Vega's L2/mem hierarchy needs that much more latency hiding. Something like trading off for higher clocks (or a higher-overhead data fabric?) might leave the designers purposefully aiming for roughly the same per-clock performance, with bonuses and penalties balancing out more closely than one would expect.
One specific oddity to the wait count encoding is that the counter is split, with the portion corresponding to the standard GCN wait count sitting in one portion of the encoding, and the remaining bits set on the far side opposite all the other counts. That may point to wanting some level of binary compatibility, since a more significant architectural departure shouldn't care about a few bits. That could be risk mitigation, the hardware isn't that different, or the different portion is similarly layered on top of something that isn't that different.
Something similar may apply as to why the ROPs changing in position relative to the L2 or the supposed increase in delta compression with Polaris didn't apply here, if some of the features are bypassed or add their own disadvantages. Polaris added some kind of coalescing of L2 client requests, but that might not apply the same way if Vega's L2 changed more significantly.

Comparing Tonga to Polaris may be somewhat unrepresentative because Tonga seemed a little more flaky than usual, as far as gauging improvements versus the Fiji/Vega range.

One other scenario I noted I would get to:
The device IDs for Ellesmere, Baffin, and Greenland showed up quite some time ago. Vega FE was still called Greenland in some areas.
GPU hardware even in the same "generation" isn't wholly equivalent. Tonga and Fiji have some decimal point differences in some of their IP identifiers, so what goes into the various blocks may be variably up to date.
Perhaps some elements were taken as snapshots at times we wouldn't expect, and some of the CU changes that went into Polaris were added after whatever snapshot was taken by Greenland, like specific tweaks to the instruction buffer size. Perhaps some of the other geometry and compression tweaks happened uniquely within the Polaris branch.

In addition to this, Polaris had features touted at launch that were touted at launch for Fiji and Carizzo. Given the overlapping timeframes of the projects and varying delays, perhaps we need to wait and see how many slides Vega RX cribs from prior gens.


Another random bit of comparison:

Since AMD has been touting Vega as being the first GPU with Inifinity Fabric:
One of the marketed benefits is a more modular command fabric, which I noted with the Zen slides has a sensor and control loop that measures and reacts every millisecond.
This was noted in various reviews and AMD's Ryzen page for its precision boost tech. I believe it is also mentioned for EPYC and its more distributed system.

For comparison:
http://www.anandtech.com/show/7457/the-radeon-r9-290x-review/5

For the 290X and 260X, when combined with the IR 3567B controller AMD is currently using, this means translates into the ability to switch voltages as frequently as every 10 microseconds, and to do so by switching between upwards of 255 voltage steps.

AMD has put various items like bees in jar: its highly responsive GPU management, its futuristic fabric, what level of responsiveness was acheived, how much that responsiveness matters, and an unknown level of integration of these items into Vega's hardware. Then it shook the jar. I'm curious what falls out in a month or so.
 
Back
Top