AMD Vega Hardware Reviews

Why would it be using "Fiji" drivers and not even Polaris-derived ones?

Can please actually LINK to a statement from AMD that gives credence to the "Fiji" driver theory? I mean, we are on the Internet, where URLs are a thing.
http://www.pcgameshardware.de/AMD-Radeon-Grafikkarte-255597/Specials/Vega-10-HBM2-GTX-1080-1215734/
There's one, it should be mentioned more or less directly in all articles about the AMD's Doom-demos at Tech Summit last December

I would guess it comes mostly down to similar configuration units wise in the GPU and HBM-memory controllers oppose why "fiji-based" rather than "polaris-based"

edit: and here's the reddit post suggesting Fiji-drivers
edit: ffs autoparsing all the reddit links
reddit.com /r/Amd/comments/6kdwea/vega_fe_doesnt_seem_to_be_doing_tiled/

The reason its not doing tiled rasterization is it has not been "turned on" yet. Vega is using the fallback rendering method instead of tiled.

When it was first discovered that Maxwell used tiled based rendering there was talk about a lot of software that needed to be written or rewritten in order to utilize it correctly and Nvidia implemented that in their drivers.

Vega is using a predominantly Fiji driver and this feature has not been "turned on" actually all but one of the new features in Vega is not functional right now the exception being the pixel engine being connected to the L2 cache as that is hardwired. I tore apart the new drivers in IDA today and the code paths between Fiji and Vega are very close and only differ slightly.

This arch is a massive change from anything they have released with GCN. They built in fallbacks in the hardware because of the massive changes. Its a protection against poorly written games and helps AMD have a starting point for driver development. Hell even architecturally Vega is essentially Fiji at its most basic thats why it is performing exactly like it because none of its new features are enabled or have a driver published for them yet. It is performing like a Fury X at 1400 MHz because that is exactly how every computer is treating it.
 
But it would not have had the performance in FP16 and INT8 of Vega.

If you look at the size of GP100 and GV100, it seems like dual rate FP16 might come with a price in die size.
GP100 has 1:2 FP64, as well as more cache. That's your culprit for die size. FP16 shouldn't take nearly the die size penalty we see on Vega.
 
http://www.pcgameshardware.de/AMD-Radeon-Grafikkarte-255597/Specials/Vega-10-HBM2-GTX-1080-1215734/
There's one, it should be mentioned more or less directly in all articles about the AMD's Doom-demos at Tech Summit last December

I would guess it comes mostly down to similar configuration units wise in the GPU and HBM-memory controllers oppose why "fiji-based" rather than "polaris-based"

edit: and here's the reddit post suggesting Fiji-drivers
edit: ffs autoparsing all the reddit links
reddit.com /r/Amd/comments/6kdwea/vega_fe_doesnt_seem_to_be_doing_tiled/

If true would be enough for AMD to show us a benchmark with tiling activated in at least one game to see the improvements will come...
 
¯\_(ツ)_/¯
AMD did indeed say at the time that those early Doom demos where using the current (at the time) Fiji drivers... now why would it still be the case more than 6 months later.... that's another mystery..

And how would you get the Int8 and FP16 throughput Vega is showing with a Fiji driver?
 
From the reddit post:

I tore apart the new drivers in IDA today and the code paths between Fiji and Vega are very close and only differ slightly.

Is anyone here capable of replicating this process?
 
Everything seen so far suggests the Vega FE is currently acting just like it was Fiji, instead of having all the fancy new features and whatnot enabled.
Even if we take this statement at face value, would it explain the performance?

What kind of performance improvement can you reasonably expect from the Vega improvements?

As I understand it, HBCC does little unless the software is written for it, so it wouldn't make a difference for today's games.

NCU supposedly has better IPC. That's great, but how much of a difference does that really make? Are talking here about being better in cases of warp divergence? For games, would that give better than 5% on average?

Then there's tiling. A nice boost for BW, but that doesn't explain why the FE has a hard time keeping up with a 1080, which has 20% less BW. And let's not forget that the impact of extra BW on gaming performance isn't that high. (That is: if you increase memory clocks by 10% gaming performance typical goes up less than 5%.)

FP16 isn't currently used a lot (if at all) in games, so that doesn't make a difference either.

Something else is going on.
 
There should be some synergy between the binning rasterizer and better divergence handling. Divergence handling wasn't mentioned by AMD when it started talking about Vega in more detail, though.
 
There should be some synergy between the binning rasterizer and better divergence handling. Divergence handling wasn't mentioned by AMD when it started talking about Vega in more detail, though.
The reason I brought up divergence handling is because that's one of the only things I could immediately come up with. :) I thought GCN had already a pretty efficient shader core. But I'm all ears about other potential improvements.
 
Regarding Die Size: I am not 100% sure that the 1/16th DP rate we're seeing advertised for current productization is really the maximum available with this Vega ASIC. I am not saying that it has to have half-rate DP or so, but it's a possibility that it's more than 1/16th.
 
The reason I brought up divergence handling is because that's one of the only things I could immediately come up with. :) I thought GCN had already a pretty efficient shader core. But I'm all ears about other potential improvements.

(*late correction: Meant to say 4 threads per CU below)

GCN does have a pretty coarse threshold for getting utilization, testing of compute loads shows the architecture takes longer to spin up, needing groups that approach the significantly wider wavefronts and having at least 4 threads per SIMD (then probably at least 4 more to help hide a death by thousand low-latency cuts and now the number of register per wave comes into play).
I suppose divergence handling can partially apply to that.
Another way is finding something to match Volta's ability to not roll over and die with divergent or irreducible control flow with synchronization in the mix.

Outside of that, there was discussion about how nice it would be if there were a scalar equivalent to the floating-point and other capabilities in the SIMDs. Also, there were some discussions from a console context that probably apply generally about kernels or shaders benefiting if it were possible to pull more than one scalar or LDS value per clock into a vector operation.

There's some signs that AMD has tried to improve the shared fetch front end and instruction caches, which might be a scaling barrier.

Register file capacity has not changed, so occupancy is still a source of pressure. I suppose FP16 is the hoped-for mitigation, but one reason why it's not as flexible is that packed math doesn't expand items like the conditional masks or EXE mask.
There was some discussion about wanting some way to reduce occupancy pressure or allow execution on pending registers when utilizing GCN's ability to fire off multiple loads. Vega's waitcnt for memory is significantly higher, with an odd way of being split up so as to not conflict with existing encodings that points to NCU perhaps not being as large a departure as some statements indicate. Whether that means new things or perhaps something really increasing the memory subsystem's latency is unknown.

Then there's the static allocation of registers based on the worst-case consumption of a kernel. I think AMD might have some speculative work on this, but no mention in any roadmap or marketing.

There's the desire to have better memory consistency/coherence (quick release, timestamped coherence, etc.), with various papers from AMD but no mention so far here.

It's more of an aesthetic preference on my part, but I feel some of the low-level details leaking into the architecture with regards to wait states or specific architectural quirks could use tightening up.

Maybe more bandwidth from the L1 (Knights Landing is up to 128B)?

*edit: Maybe start looking the whole cadence and CU implementation at some point?
 
Last edited:
Are you sure that they need 4 warps to keep the CU busy? I thought, with a 4 deep pipeline, a 64 wide warp, and a 16 wide SIMD, they could do it with just 2 warps?
 
For full vector throughput, there needs to be a wavefront active for each of the 4 16-wide SIMDs in a CU. The CU will be moving round-robin between the SIMDs, although given single instruction per-wavefront issue and other sundry stall sources it helps to have additional wavefronts to fill in the blanks.
Thrashing and other issues (perhaps bigger L1s would help?) can make that a problem, and there are certain types of compute that can get away with fewer wavefronts with unusually large register allocations.

GCN might be somewhat more prone to switching between threads, or needing it more so than other architectures. That's a potentially cheaper way to hide latency, but perhaps the level of switching behavior may need review at the latest nodes. Losing coherence in the data paths and traffic patterns, and perhaps losing some of the predictability for power/clock gating may be more costly.
GCN's write-through L1s to distant L2 cache hierarchy might need a look as well.
 
Are you sure that they need 4 warps to keep the CU busy? I thought, with a 4 deep pipeline, a 64 wide warp, and a 16 wide SIMD, they could do it with just 2 warps?

Oops, I saw where I made a mistake. I meant CU in the earlier post, but somehow wrote SIMD.
 
Back
Top