AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Will Volta have fp16 too ? If so, yes, I don't see why Futuremark should ignoring this. It's a new feature...

For me its not about "ignoring". Feature Level 12_1 exists since September 2014 with Maxwell v2 and there doesnt exist anything with it from Futuremark. The same goes for nVidia's specific VR features like "Multi-res Shading" which isnt supported in VRMark but there are a few games with it.

It's just strange for me. Dont really care about it but it puts a shadow over their status between vendors.

/edit: It reminds me a little bit about the Jon Peddie report which got sponsored by AMD to showcase that AMD is powering most "gaming devices".
 
Anyone wondering where all the die size of Vega went, wonder no more. It has 45MB of SRAM on die. GP100 has 23MB, GV100 has 33MB.
I'm wondering, where all this goes. Maybe we can put the pieces together. The most obvious ones:
- 4 MiByte L2-Cache
- 64× 64 kiB LDS
- 64× 4 KiB Scalar RF
- 64×4×64 KiB Vector RF
- 64× 16 KiB L1-Cache

If I'm not mistaken, that's 25856 KiByte only.
 
I might of missed it in the disscusion, but has there been much talk of the move to 8 Shader engines? seems a fundamental new aspect of Vega vs Fiji.. I wonder if this would also explain the fairly consistent performance from 1080p through 4K ?
I haven't seen a reference for this. The count of geometry engines is listed as four, and those are distributed at one per engine.
 
One of the slides shows a grouping of eight CUs, but without mentioning explicitly eight shader engines.
 
Futuremark creates a benchmark sponsored by AMD. Yikes, guess that means the end to their business model...
You mean like 3dmark Vantage that had nvidia PhysX GPU acceleration contributing to the final score? At least FP16 calculations aren't vendor specific.

Rest assured though, they mentioned "demo", not benchmark.


a bunch of intel intergrated stuff has fp16 also . So its not really just amd

According to our own intel graphics resident, it's on every gen8 (broadwell) iGPU and newer.
And if FP16 is heavily used on post-processing calculations, I could see Intel GPUs being able to take the post-processing effects all to themselves when a weaker dGPU is detected (say, laptops with Polaris 11/12 or GP107/108). After all, we're looking at a minimum of 800 GFLOPs FP16 even in the broadly available GT2 models. That could make quite the difference on <2 TFLOPs mobile GPUs like the GTX1050 and Macbook's Radeon Pro 460.

It reminds me a little bit about the Jon Peddie report which got sponsored by AMD to showcase that AMD is powering most "gaming devices".
That AMD powers most gaming devices should be common sense to everyone, since the 2013 consoles alone are responsible for well over half the active gamers worldwide (the ones who pay $50 for their games, not the F2P/<$1 mobile crowd, of course). AMD simply hired a consultant to tell them exactly how much they were above anyone else. They're numbers they can now throw at developers who might be hesitant on optimizing for their GPU architecture.
That market research may have been partially responsible for the Bethesda deal, and for this new Ubisoft deal.





I'm wondering, where all this goes. Maybe we can put the pieces together. The most obvious ones:
- 4 MiByte L2-Cache
- 64× 64 kiB LDS
- 64× 4 KiB Scalar RF
- 64×4×64 KiB Vector RF
- 64× 16 KiB L1-Cache

If I'm not mistaken, that's 25856 KiByte only.
Buffers for HBCC?
 
There is nothing wrong with futuremark supporting fp16. they should offer it in such a way to be able to compare fp32 to 16. maybe with timespy. Its a neutral feature just as dx12. Of course how much anyone cares is dependent. they still dont support vulkan and their timespy was a bit meh. FP16 would actually seem more of a core feature to support if it became relevant.

Well Futuremark is (IMHO) mostly irrelevant, but Far Cry 5 is really nice news. Where did you get that info? Do they say it's coming for the PC version or did they just mention the PS4 Pro version?


EDIT: Found it:




So Ubisoft's Dunia is the first second engine to officially support FP16 shaders in the PC.




EDIT2:

Nope, Wolfenstein 2: New Colossus (2 months away) with idTech6 is bringing the whole thing: Vulkan, FP16 and intrinsic shaders (like Doom before it).

fwogKg2.png

I wouldnt care as much if it were just far cry. not that interested in the series. but Wolfenstein has been looking really good. great character and the visuals are so sweet. I would start working heavily towards getting vega if the above comes to fruition. Vulkan, fp16, intrinsic shaders? Pascal does not support fp16 :love::love:. that game should run beautifully on vega at max settings.

On the competitive side pascal does not support this so if its a "simple enough" addition for developers to support fp16 and it does produce significant performance gains, that would be the biggest thing for vega. if they get anywhere near a real 25tflops GPU it should be well above a 1080ti

What do the developers familiar with fp16 and 32 performance think though? I'd imagine if the gains in performance are there this could spread like wildfire.

another question i have is if consoles are using this. it was said ps4 supports doing fp16 operations at once. adoption in the console space that can easily transfer to PCs would help adoption
 
Last edited:
One of the slides shows a grouping of eight CUs, but without mentioning explicitly eight shader engines.
GCN can subdivide the CUs within a shader engine. There is a shader array ID bit tracked by a wavefront in the Southern Islands ISA, and even though AMD stopped documenting the context register it is in in later revisions the driver patches have maintained it. It has almost always been documented as one per engine, but perhaps Vega's shader engines have two.
 
FP16 might actually be more than just a gimmick used in benchmarks. Ps4 Pro supports double rate FP16 after all. If developers optimize for that, why not run the same optimizations on RX Vega?
 
This is my interpretation of the LC in bundle only. A real pity since that was the card I was planning on getting at $599. Guess I'm waiting for AIB versions instead, though worried about the price/perf...
Yep, don't think miners will even target Vega 64 Liquid, It's Vega 56 they should be worried about.
Does anyone one know the underlying reason for min frame rates? The typical bottleneck that gets hit when it happens?
In the context of Vega launch? Cherrypicking? Like testing games where they know NV has trouble with DX12 (like BF1 SP). Testing with less than max settings (Sniper 4, Far Cry Primal, Doom, Deus Ex, Ashes). Also, COD Infinite Warfare has a bug in the current NV driver where erratic fps drops are frequent (it literally drops to 30fps no matter the card or settings). Also FuryX having better min fps than all NV cards was a dead giveaway.

There is also clever product placement, for example in this slide the GTX 1080 has better minimums than Vega in several titles, but is placed to the left to create the illusion of being consistently behind. (also observe COD infinite Warfare where 980Ti and 1080 have the same minimums due the above mentioned bug).
slides-raja-38.jpg


slides-raja-35.jpg
 
Last edited:
Buffers for HBCC?

The full slide deck appears to have a fuzzy wafer shot for Vega.
A fair number of my earlier guesses from the artistic rendering of the die were off. It's actually more standard in terms of layout than I had thought. The CU blocks, caches, and RBEs are where they usually are. Visually, the caches and register files light up where one would expect, and the center strip shows a decent amount of storage as well. The L2 seems to be subdivided into 8 blocks on a side.

That does leave the HBCC area in AMD's diagram, which doesn't appear to be hit by the light to really show off any arrays. However, the diagram and the shadowed regions of the wafer shot show a region of pretty high regularity that could contain a goodly amount of storage. If it's buffering multiple memory pools and tracking irregular page residency and use history, it could add up to a fair amount.

Then there's the memory controllers and the coupled infinity fabric, which should have some pretty deep buffers. The extent of the fabric's presence is still not clear, although as noted a lot of the GPU proper looks rather standard.
 
I'm wondering, where all this goes. Maybe we can put the pieces together. The most obvious ones:
- 4 MiByte L2-Cache
- 64× 64 kiB LDS
- 64× 4 KiB Scalar RF
- 64×4×64 KiB Vector RF
- 64× 16 KiB L1-Cache
You are missing ROP caches. These caches are likely much larger in Vega because of the tiled rasterizer. Also parameter cache (for vertex shader -> pixel shader interpolants). They also need to store the vertex interpolants for much longer time, because of the tiled rasterizer. That might require quite a bit of extra storage.

So my guess is that "draw stream binning rasterizer" is the biggest reason for the added SRAM.
 
Reason I'm asking for the reasons of the min frame rates is because I doubt that it has anything to do with raw calculation power.

If Vega is a GP102 class device (as seen with SpecViewPerf) with some unexplained calculation bottleneck, it's min frame rates should be more or less in line with a Titan Xp.
 
Especially as the G-Sync monitors go lower in minimum Frequency than the FreeSync example, which does not work with the NV cards anyway.

Freesync supports frame doubling just like GSync. Under 48 fps on that monitor would result in 94 and lower hz as it doubled.

They are also ignoring the extra cost of a GSync vs Freesync monitor in that example
 
I'm wondering, where all this goes. Maybe we can put the pieces together. The most obvious ones:
- 4 MiByte L2-Cache
- 64× 64 kiB LDS
- 64× 4 KiB Scalar RF
- 64×4×64 KiB Vector RF
- 64× 16 KiB L1-Cache

If I'm not mistaken, that's 25856 KiByte only.
SIMD PROCESSING UNIT WITH LOCAL DATA SHARE AND ACCESS TO A GLOBAL DATA SHARE OF A GPU

If I had to guess, per SIMD LDS? The filing and publishing dates on that are only months old.

In accordance with a further embodiment of the present invention, the private write space within the LDS is variable. By way of example, and not limitation, private write space is assigned as one register per thread, accommodating up to sixteen wavefronts, or, alternatively, sixteen registers and only one wavefront. One skilled in the relevant arts will recognize that a number of combinations of number of registers assigned per thread, and the resulting total number of wavefronts which may be accommodated by the LDS, exist, and the above grouping is provided by way of example, and not limitation. In an additional embodiment, the wavefronts can also be grouped into variable size groups of threads.
There was a linux driver patch the other day adjusting waves per CU from 40 to 16 in no uncertain terms. So this patent is likely Vega considering it's published already. Seems to be doing some sort of recursive 16:1 reduction involving pixels. At first I thought it was a tensor thing, but "pixels".

Buffers for HBCC?
That's a distinct possibility, but I can't imagine it's more than a MB. Even 128K should be able to handle 16GB of ram plus a few tags and whatever errata is required by the controller. CPUs have had page tables for a while and nothing approaching that scale.
 
You are missing ROP caches. These caches are likely much larger in Vega because of the tiled rasterizer. Also parameter cache (for vertex shader -> pixel shader interpolants). They also need to store the vertex interpolants for much longer time, because of the tiled rasterizer. That might require quite a bit of extra storage.

So my guess is that "draw stream binning rasterizer" is the biggest reason for the added SRAM.

The wafer shot does have some portions where the RBE sections are well-lit. They seem mostly free of large arrays.
I would think that the rasterizer would mostly interact with depth information, which is something like 4KB per RBE in Hawaii. AMD's been pretty consistent with RBE caches with GCN, and GCN's counts seem to be consistent with GPUs prior.
Even if significantly larger, it's starting from a very low point and AMD may be counting on the L2 to mitigate any capacity needs from now on.

I would have thought the actual binning and rasterizer component and most of their storage would be nearer the geometry front ends, given how intimately it links back to primitive setup. It might be a reason for the front end to be constrained if it does heavily rely on the RBE caches rather than hosting its own storage and a snapshot of the bin information. The RBEs are stretched out further in physical terms, and could be servicing requests from the rasterizer while still under load from the CUs.

That could explain why AMD's so gung-ho with primitive shaders and programmable methods that try to keep things away from that path, and may align with why Vega's culled and non-culled performance currently seem to be drawing from a similarly limited pool in some of the synthetics--although the new method was supposedly off for those.
 
So, I'm not surprised to see that gaming performance with Vega FE is hobbled by both immature power management and inactive DSBR.

Is there a decent slide deck or white paper anywhere? Is there a full architecture description? Weren't more architectural features due to be revealed at Siggraph? I haven't seen any hint of anything new so far from Siggraph, judging by this thread.
 
Back
Top