AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

I'm puzzled as to why these features still need enablement when Vega was publicly demonstrated in 2016.

For the same reason that Fermi GF100 needed a GF110/GF100b revision? This story with the Infinity Fabric is eerily similar to the interconnect problems Fermi allegedly suffered from according to JHH. I wonder if we might still see a fixed Vega with much better performance (by the way where is Charlie saying Vega is broken and unfixable now? :D He has been awfully quiet of late about GPUs, made for nice drama :( ) I believe more in a reborn Vega RV670 style (but still quite a big chip) than in miraculous drivers.
 
Last edited:
Intel's Gen 9 architecture supports 2x FP16. And they have majority of the market. Doesn't that indicate there's a significant base for supporting half float in games?
To be fair, Intels drivers for a long time did not expose that under DX12 either. Strangely though, it was available under DX11.3 as „
Pixel Shader Precision 16/32-bit" and „Other Stage Precision 16/32-bit“. (found log files for at least .4352 & .4404)

Recent drivers (i don't remember which version exactly) made it available in DX12.

True, but even with that support, many of those systems aren't up to the task of playing much beyond platformers and some esports if that.
Quite the contrary, every watt or fraction thereof saved in tightly constrained mobile chips frees up some leeway for higher clocks somewhere else on the chip.
 
Last edited:
For the same reason that Fermi GF100 needed a GF110/GF100b revision? This story with the Infinity Fabric is eerily similar to the interconnect problems Fermi allegedly suffered from according to JHH.
While GF100b also enabled full-rate FP16, yes it's eerie. Especially when you take into account, that Fermi was Nvidias first try at fully distributed geometry and Vega is AMDs first chip where geometry can be shared through the shader engines (and i am not talking about that very slim line indicating load balancing in former quad-engine Radeons).
 
Oh yes, they surely would be. Unfortunately, as long as it is arbitrarily disabled in the drivers, developers cannot count on there being an installed base so those who are being ruled by their CFOs would think twive about using it. There needs to be concise drivers support, even it is only benefical in terms of register space saved.
are you sure about that? cause i saw few reddit posts with people running benches along fiji to match the instict based solution and it seems that its "disabled" on dx but on vulkan it shows it clearly enabled
 
I am sure about the DirectX part, yes. That's the game developers' API of choice and what the discussion was about. Good to know that it's enabled on Vulkan. Hope there's some massive uptake there.

Seems like you're linking a private reddit? I cannot see it.
r/realAMD
/r/AMD is full of anti AMD shills and shit posters, this is the place for real AMD enthusiasts.“
 
Anarchist thinks that Volta MPS is a clear indication of the failure of Pascal (and prior) GPU uArch because it is clearly emulating ACEs which establishes GCN's absolute superiority in the market.
It's definitely a hindrance for asynchronous behavior and efficiently accelerating synchronization. Nothing new there and rather obvious with Volta adding the hardware capability for "performance critical" parts of the MPS server now. Interrupting the CPU for every warp dispatch when presented with asynchronous behavior isn't exactly ideal with the latency involved. Same issue with high priority compute tasks. Not exactly a secret Nvidia had been dragging their feet in regards to low level APIs.
 
I am sure about the DirectX part, yes. That's the game developers' API of choice and what the discussion was about. Good to know that it's enabled on Vulkan. Hope there's some massive uptake there.

Seems like you're linking a private reddit? I cannot see it.
r/realAMD
/r/AMD is full of anti AMD shills and shit posters, this is the place for real AMD enthusiasts.“

PEYGyV3.png
 
Yes, my statement was not about technical capability for FP16 per se. It was about the Packed Math in consoles which is only available on PS4. The AMD GPU on Xbox One X does not support it, so its not like this Vega advantage translates directly to PC in the same way. Unless Packed Math is completely worked on by the driver, not requiring any input from the developer that is (he just needs to define the property as FP16).
There are a few reasons why PC games do not already support fp16. The first reason is that Vega was just launched. It is the first discrete GPU with 2x rate fp16. Most recent Intel GPUs also have 2x rate fp16. There was simply no reason bring fp16 shader code to PC (adds testing cost).

The second reason is that PS4 Pro was launched last Christmas. It takes time until developers add fp16 support for their shaders. Some AAA devs, such as DICE have already started doing this, but it takes time to modify large AAA shader code bases to support fp16 on PS4 Pro. Vega is actually very good for PS4 Pro fp16 adaptation, since fp16 optimizations now benefit both PS4 Pro and AMDs latest PC GPUs. You can now also test your fp16 code on PC (Vega GPU on development workstation + PS4 Pro devkit), potentially improving your iteration time. I am sure we will see fp16 more in the future (both PS4 and PC). The fact that Intel GPUs also benefits from fp16 code is another bonus. Everybody knows that Nvidia will eventually follow the suit, as they have 2x rate fp16 already on their mobile GPUs and their professional GPUs (*), so putting developer effort to fp16 code will benefit all PC GPUs in the future. Now is the right time to start spending effort to it.

Packed math is managed by compiler. Compiler handles it similarly than old vec4/VLIW architectures. You don't need to manually write packed (vec2) code. You simply use new half float types (min16float in HLSL) instead of the existing float types. Compiler packs two of them automatically to each 32 bit register. Compiler obviously needs to be clever to pack them in a way that allows most efficient usage of 2x rate packed math instructions. But this vec2 packing is a simpler problem than vec4 (of vec4+1) packing of previous generation GPUs. GPU compiler programmers already have experience about stuff like this.

(*) Nvidia Volta has new Tensor cores for machine learning. Nvidia doesn't need to keep 2x rate fp16 anymore as a professional feature for machine learning. Tensor cores are better for this task. I would guess that future consumer GPUs simply lack tensor cores (or have them disabled).
 
Intel's Gen 9 architecture supports 2x FP16. And they have majority of the market. Doesn't that indicate there's a significant base for supporting half float in games?

Intel may have a majority of the market but that doesn't mean gen9 has a majority within that majority.
How would you define significant?
 
Intel Gen9 supports 2x rate fp16. However AAA game devs mostly concentrate on discrete GPUs, since most gamers playing AAA games have a discrete GPU. And non-AAA devs mostly don't care about low level optimizations such as fp16, since only part of the GPUs and OSs support it. Windows XP and Vista do not support fp16 types in HLSL (if min16float is used on these OS, the game crashes). For these devs, broad hardware support is more important than getting some extra performance out of the latest GPUs.

Intel's 2x rate fp16 support is mostly designed for mobile workloads. OpenGL ES uses fp16 by default (you need to specially define highp if you need more precision). Intel tried to get a foothold of the mobile market with their CPUs and GPUs, but failed. Soon these features will also be useful in modern PC games. I would also assume that Windows 8 and Windows 10 desktop rendering uses fp16 heavily, as it has been optimized for ultraportables and tablets. fp16 desktop composition saves power, and if Windows 10 uses it, it will also be a good thing for Vega.
 
Last edited:
True, but even with that support, many of those systems aren't up to the task of playing much beyond platformers and some esports if that. The hope would be Ryzen Mobile raising the bar a bit, being more affordable, and pushing Intel to do the same. Get more of those integrated systems closer to midrange, likely through more affordable EDRAM or a single HBM2 stack to get the necessary bandwidth.

They are capable of running them at least on low settings if not better. FP16 would not only make it better but save power in mobile.

And there's no evidence Ryzen mobile doing it significantly better. They would need HBM2 integrated doing so and such memory is in extremely costly devices(Knights Landing/Tesla/Vega). In that regard Iris parts are just as capable.

Intel may have a majority of the market but that doesn't mean gen9 has a majority within that majority.
How would you define significant?

Skylake has been in the market since late 2015. I would think the volume is quite significant. While in terms of total PCs in the world that number may be a fraction, you are still talking in numbers likely close to 100 million.

I was only replying on the context that integrated AMD parts would help to proliferate the support. It would be way lower in terms of volume.

Intel Gen9 supports 2x rate fp16. However AAA game devs mostly concentrate on discrete GPUs, since most gamers playing AAA games have a discrete GPU.

Thanks for the explanation.
 
And there's no evidence Ryzen mobile doing it significantly better. They would need HBM2 integrated doing so and such memory is in extremely costly devices(Knights Landing/Tesla/Vega). In that regard Iris parts are just as capable.
I'm not suggesting it will be better as much as more reasonably priced. Iris Pro, like most of Intel's lineup with only four cores, is relatively expensive. Limiting the share of the market that is capable. AMD may have far larger APUs providing performance as well.

I was only replying on the context that integrated AMD parts would help to proliferate the support. It would be way lower in terms of volume.
If AMD produces mid range APUs it could definitely proliferate support as it would displace much of the discrete volume. The question is how high they go. Overtaking 580/1060 may not be unreasonable depending on the designs that show up. A system substituting HBM2 for system memory would have a lot of bandwidth and not be that much more expensive.
 
For the same reason that Fermi GF100 needed a GF110/GF100b revision? This story with the Infinity Fabric is eerily similar to the interconnect problems Fermi allegedly suffered from according to JHH. I wonder if we might still see a fixed Vega with much better performance (by the way where is Charlie saying Vega is broken and unfixable now? :D He has been awfully quiet of late about GPUs, made for nice drama :( ) I believe more in a reborn Vega RV670 style (but still quite a big chip) than in miraculous drivers.

From AMD's slides and the ISA doc's diagrams of the memory system, it's not clear if the Infinity Fabric is near any of the disabled features. The GPU's cache system appears to be its own domain, with the fabric between the L2 and memory controllers. That simplified arrangement, and the fact that the fabric is based on a mature protocol, doesn't seem to leave much room for it to cause problems. As a hardware fabric, it should be mostly invisible to software.

Koduri stated that Vega's fabric is optimized for servers, but I'm not sure what would be limiting it other than perhaps some additional overhead for items like generally unused error correction or expanded addressing. In fact, I'm not sure what "server-optimized" really adds if all the fabric is doing is sitting between memory, GPU, and standard IO.
There's the flash controller and I think the IO for that, though its impact should be modest.
 
Last edited:
From AMD's slides and the ISA doc's diagrams of the memory system, it's not clear if the Infinity Fabric is near any of the disabled features. The GPU's cache system appears to be its own domain, with the fabric between the L2 and memory controllers. That simplified arrangement, and the fact that the fabric is based on a mature protocol, doesn't seem to leave much room for it to cause problems. As a hardware fabric, it should be mostly invisible to software.

Koduri stated that Vega's fabric is optimized for servers, but I'm not sure what would be limiting it other than perhaps some additional overhead for items like generally unused error correction or expanded addressing. In fact, I'm not sure what "server-optimized" really adds if all the fabric is doing is sitting between memory, GPU, and standard IO.
There's the flash controller and I think the IO for that, though its impact should be modest.

My comment about the Infinity Fabric was more related to its power consumption, than the disabled features. Altough by association some features could have been disabled because they added to further power consumption (while Infinity is not something you can disable instead). A bit far fetched I know, but Vega suffers from high power consumption just like Fermi did, while introducing some sort of new interconnect, like Fermi did as well.
 
My comment about the Infinity Fabric was more related to its power consumption, than the disabled features. Altough by association some features could have been disabled because they added to further power consumption (while Infinity is not something you can disable instead). A bit far fetched I know, but Vega suffers from high power consumption just like Fermi did, while introducing some sort of new interconnect, like Fermi did as well.
It's been some time, so I am not sure which interconnects were cited as being problematic. GPUs have quite a few, with Fermi having at least the intra-SM interconnect, a distribution interconnect for sharing geometry, and the connection to the caches.

Unless Vega's Infinity Fabric is more invasive than described, it's a mesh with some rather predictable directionality for most of its traffic. Fermi's interconnects had various behaviors and potentially higher degrees of connectivity.
For GCN, there was always some kind of link between the L2s and their respective memory controllers. It seems like the fabric slots itself as a midpoint between the L2 and controller (HBCC?) block, with some level of perpendicular traffic related to the relatively modest needs of the miscellaneous sections of the GPU. I presume that's more overhead than the prior bespoke connections, but it seems like it's not uprooting the really complex parts.
 
There are a few reasons why PC games do not already support fp16. The first reason is that Vega was just launched. It is the first discrete GPU with 2x rate fp16. Most recent Intel GPUs also have 2x rate fp16. There was simply no reason bring fp16 shader code to PC (adds testing cost).

The second reason is that PS4 Pro was launched last Christmas. It takes time until developers add fp16 support for their shaders. Some AAA devs, such as DICE have already started doing this, but it takes time to modify large AAA shader code bases to support fp16 on PS4 Pro. Vega is actually very good for PS4 Pro fp16 adaptation, since fp16 optimizations now benefit both PS4 Pro and AMDs latest PC GPUs. You can now also test your fp16 code on PC (Vega GPU on development workstation + PS4 Pro devkit), potentially improving your iteration time. I am sure we will see fp16 more in the future (both PS4 and PC). The fact that Intel GPUs also benefits from fp16 code is another bonus. Everybody knows that Nvidia will eventually follow the suit, as they have 2x rate fp16 already on their mobile GPUs and their professional GPUs (*), so putting developer effort to fp16 code will benefit all PC GPUs in the future. Now is the right time to start spending effort to it.

Packed math is managed by compiler. Compiler handles it similarly than old vec4/VLIW architectures. You don't need to manually write packed (vec2) code. You simply use new half float types (min16float in HLSL) instead of the existing float types. Compiler packs two of them automatically to each 32 bit register. Compiler obviously needs to be clever to pack them in a way that allows most efficient usage of 2x rate packed math instructions. But this vec2 packing is a simpler problem than vec4 (of vec4+1) packing of previous generation GPUs. GPU compiler programmers already have experience about stuff like this.

(*) Nvidia Volta has new Tensor cores for machine learning. Nvidia doesn't need to keep 2x rate fp16 anymore as a professional feature for machine learning. Tensor cores are better for this task. I would guess that future consumer GPUs simply lack tensor cores (or have them disabled).
but ati was using fp16 back in the days on games wasnt she?
 
OpenGL ES uses fp16 by default (you need to specially define highp if you need more precision).

Vertex shaders default to highp precision. Fragment shaders don't have a default precision and you must specify highp/mediump/lowp either via the precision statement (precision highp float) or declare each variable in the shader with the required precision (mediump vec4 sum).
 
Back
Top