AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

There's not much difference... other than 2x higher bandwidth and 4x higher density per stack?

It's not even in AMD's best interest to nerf this solution with an old GPU IP. What does AMD stand to gain with delivering a more mediocre GPU here? More people becoming convinced that nvidia GPUs are better?

As far as the JEDEC standard is concerned, there isn't an HBM/HBM2 split as much as there is a more final and generally acceptable revision to an earlier niche implementation.
If it were Polaris and linked to a Gen 9 or later IGP, there would be a notable discontinuity in terms of DX12 feature-level support. Vega would be the first to reach a similar level of completeness, although it's not clear how often a system would leverage the two units in a way where it would matter.
 
As far as the JEDEC standard is concerned, there isn't an HBM/HBM2 split as much as there is a more final and generally acceptable revision to an earlier niche implementation.
If it were Polaris and linked to a Gen 9 or later IGP, there would be a notable discontinuity in terms of DX12 feature-level support. Vega would be the first to reach a similar level of completeness, although it's not clear how often a system would leverage the two units in a way where it would matter.
We already know that AMD does do "mix'n'matchin" as needed, though, as PS4 Pro's APU incorporates parts of Vega to what looks like a Polaris-backbone
 
We already know that AMD does do "mix'n'matchin" as needed, though, as PS4 Pro's APU incorporates parts of Vega to what looks like a Polaris-backbone

Given its apparent binary compatibility with the Sea Islands-based PS4, the Pro might have had some constraints on what it could adopt from Vega.
If trying to match the broad DX12 compliance of Gen 9+, conservative rasterization, raster-order views, and FP16 minimum precision are what show as the three notable differences with everything <GCN5. (Still a virtual addressing difference, unclear how important it would be in this case.)

Conservative raster is presumably built into the fixed-function hardware, and there are indications with the Vega ISA doc that raster-order handling is integrated into the ISA with its Primitive Order Pixel Shading hooks. For whatever reason, DX12 doesn't consider Polaris' FP16 treatment equivalent to Gen 9.

The first two features seem to pull in the guts of Vega, with FP16 potentially also having some kind of refinement that Vega has over prior versions.



It's an introduction mostly to the idea of primitive culling for the GPU. Triangles can often be out of view, facing the wrong direction, or too small to be rendered. Cutting those triangles out as soon as possible can help with efficiency and performance, such as by not polluting on-die caches with excess vertex data.
AMD claims that their GPUs can wind up discarding half of the triangles submitted to them, and that the geometry front end could avoid losing a cycle per culled primitive and avoid thrashing its vertex parameter cache.

How much of that half of submitted triangles isn't already compensated for in the hardware, or how much that fraction shows up in overall performance for Vega and pre-Vega GPUs is not clear.
 
Given its apparent binary compatibility with the Sea Islands-based PS4, the Pro might have had some constraints on what it could adopt from Vega.
If trying to match the broad DX12 compliance of Gen 9+, conservative rasterization, raster-order views, and FP16 minimum precision are what show as the three notable differences with everything <GCN5. (Still a virtual addressing difference, unclear how important it would be in this case.)

Conservative raster is presumably built into the fixed-function hardware, and there are indications with the Vega ISA doc that raster-order handling is integrated into the ISA with its Primitive Order Pixel Shading hooks. For whatever reason, DX12 doesn't consider Polaris' FP16 treatment equivalent to Gen 9.

The first two features seem to pull in the guts of Vega, with FP16 potentially also having some kind of refinement that Vega has over prior versions.




It's an introduction mostly to the idea of primitive culling for the GPU. Triangles can often be out of view, facing the wrong direction, or too small to be rendered. Cutting those triangles out as soon as possible can help with efficiency and performance, such as by not polluting on-die caches with excess vertex data.
AMD claims that their GPUs can wind up discarding half of the triangles submitted to them, and that the geometry front end could avoid losing a cycle per culled primitive and avoid thrashing its vertex parameter cache.

How much of that half of submitted triangles isn't already compensated for in the hardware, or how much that fraction shows up in overall performance for Vega and pre-Vega GPUs is not clear.

Though we can take a good guess from the Wolfenstein 2 optional (software) GPU culling, which is faster on AMD including Vega but slower on Nvidia, and say "not enough". Vega was/is supposed to have improved culling in the form of "Primitive shaders" but the whole thing, while supposedly on Vega, doesn't work. Whether it could work in future and there's just trouble implementing it at the driver level, entirely unsurprising if true, or if the feature is somehow faulty on Vega is unknown by anyone outside AMD at the moment.
 
They've provided some (albeit vague) numbers for NGG in whitepaper. It probably works, at least to some extent.
The feature is called "Primitive Shader", implying that it is fully programmable shader based culling system. It is not fixed function hardware. If I understood properly, the developer needs to write these primitive shaders to improve the culling rate. So far no graphics API exposes this feature. There's also discussion about the possibility of AMD autogenerating these primitive shaders in the future drivers. Without full spec of the system, I can't really say whether this is possible in the general case, and how hard problem it is to solve.
GPU culling was originally designed to avoid GCN2 hardware bottlenecks (consoles). GCN3, GCN4 and GCN5 all reduced geometry related bottlenecks, making techniques like these slightly less useful. There's still lots of GCN1/GCN2 cards around. For example R7 360, R9 390 and 390X were GCN2 based. Only 380, 380X and Fury used GCN3. Also in 400 series, everything below R7 460 is based on GCN2 and GCN1.

GCNs geometry bottleneck is mostly visible when you have high triangle per pixel density. Consoles render usually at 900p or 1080p. This is 2x-3x less pixels than 1440p. GPU executes significantly less pixel shader instances on average per each triangle at 900p vs 1440p. This results in significantly worse GPU utilization when geometry is the bottleneck. Result = culling gives a significant advantage.

GPU culling is still a very good technique, but the biggest impact can be seen on GCN1/GCN2 hardware and/or at lower rendering resolution. These benchmarks are 1440p on GCN5. Apparently the game doesn't have enough triangle density or depth complexity (occlusion) to see benefit of GPU culling in this scenario. We don't know exactly what their algorithm is doing. I am assuming their algorithm is similar to Frostbite's: https://www.slideshare.net/gwihlidal/optimizing-the-graphics-pipeline-with-compute-gdc-2016. Frostbite's algorithm is designed for GCN2, but their algorithm still shows significant gains on GCN3 (Fury X), especially when culling is done using async compute. The culling cost could be easily reduced by removing some culling steps that modern GPUs handle efficiently.
 
Which is part of their next-gen geometry pipeline (NGG).

That should be fairly obvious.

Wasn't that confirmed by @Rys?
These might be obvious things to you, but apparently some people seem to believe that Primitive Shaders are broken. I simply see a lack of software. I will believe shader autogeneration is feasible to implement when I see it.

My personal opinion is that primitive shaders are exactly what we (GPU-driven rendering early adopters) asked for. Just give us an API to write the primitive shaders ourselves. This shader stage is definitely a better solution to a real problem than geometry/hull/domain shaders (which are all struggling to be used by anybody).
 
Last edited:
Apparently the game doesn't have enough triangle density or depth complexity (occlusion) to see benefit of GPU culling in this scenario.
The game is light on geometry complexity indeed, almost all characters and objects have fairly low polygon count. Areas are limited in scope as well.
 
That should be fairly obvious.
Not fixed function, but still using some of the same hardware. It was described as writing assembly to keep efficient. So there is likely some setup work required or AMD would use compute shaders and call it a day. The 4SEs are likely still in play with some synchronization work required.

Well yeah, but @Rys was pretty concrete about it.
Makes me wonder what caused the delays resulting in current state of Vega.
I'm not sure that they are delayed as much as standardizing what could be the start of a next generation graphics pipeline taking time. The automatic part would be invisible. Some driver optimizations possibly making use of them already, but no way of really knowing outside a driver update significantly improving performance.
 
Yes but it's been 4 month since FE launch, along with several new Vega-based products and driver updates.
Even if released devs would still need to use them. The automatic method may very well be what we've already seen. Just that AMD uses it for limited internal optimizations.
 
sebbbi,
Do you see the whole concept of primitive shaders more as a way to unchain programmers from a geometry pipeline straightjacket or as something that has the potential to significantly increase the overall rendering performance?
Jawed already argued that it can fix significant bottlenecks, but from your latest comments, it seems to me that AMD had already fixed quite a bit of those in the fixed pipeline?

Or is it a circular thing: the lack of freely programmable pipeline (and its performance impact for some cases) makes programmers currently avoid techniques because of their potential impact?

While I understand how bad resource issue can impact performance today (as pointed out by Jawed), I don’t have a feel how much this really impacts performance today. Maybe the hopes for the primitive shader as a magic performance solution for Vega for today’s (!) games are just not justified.
 
The feature is called "Primitive Shader", implying that it is fully programmable shader based culling system. It is not fixed function hardware. If I understood properly, the developer needs to write these primitive shaders to improve the culling rate. So far no graphics API exposes this feature. There's also discussion about the possibility of AMD autogenerating these primitive shaders in the future drivers. Without full spec of the system, I can't really say whether this is possible in the general case, and how hard problem it is to solve.

Earlier disclosures from AMD had the opposite impression.

For space reasons, I will just try to summarize the following sequence of posts:
https://forum.beyond3d.com/posts/1997692/
Link to tweet indicating primitive shaders are meant for the most part to be automatic. (Later tweet says more control might be considered, but is not promised).

https://forum.beyond3d.com/posts/1997699/
Confirmed that at the time primitive shaders were disabled. Developer API not ready, automatic generation inoperative.

https://forum.beyond3d.com/posts/1997709/
At least at that point, it was unclear how to reasonably expose primitive shaders to devs. Apparently, it is not straightforward to implement (like assembly) and difficult to realize gains over the automatic generation (driver's general level of generation is supposedly high).


The automatic generation path seems like it could have a conceptual link to Sony's triangle seive optimization, which was a compilation flag that stripped all but position calculations from a vertex shader, after which frustrum/facedness/coverage could be used to cull. Integrating what was originally a separate invocation into one of the front end shaders seems like it wouldn't hold any mystery.
How AMD can claim that the driver is able to spit out highly optimal primitive shader code but then keeps not using it is curious.
That manually implementing this is proving difficult to expose may point to specific glass jaws or internal quirks that might leave dev-generated code prone to error or fragile in the face of changing conditions or devices.

I haven't thought too hard on this, but one area that I was curious about was the topic of culling triangles that did not hit a sample point. For MSAA, that was one area optimized for in Polaris with the primitive discard accelerator. Not knowing how that element was implemented, wouldn't having the GPU's sampling behavior available statically allow for specialized hardware/instructions to improve on code size and per-clock work versus the general ISA?
 
The feature is called "Primitive Shader", implying that it is fully programmable shader based culling system. It is not fixed function hardware. If I understood properly, the developer needs to write these primitive shaders to improve the culling rate. So far no graphics API exposes this feature. There's also discussion about the possibility of AMD autogenerating these primitive shaders in the future drivers. Without full spec of the system, I can't really say whether this is possible in the general case, and how hard problem it is to solve.

GPU culling was originally designed to avoid GCN2 hardware bottlenecks (consoles). GCN3, GCN4 and GCN5 all reduced geometry related bottlenecks, making techniques like these slightly less useful. There's still lots of GCN1/GCN2 cards around. For example R7 360, R9 390 and 390X were GCN2 based. Only 380, 380X and Fury used GCN3. Also in 400 series, everything below R7 460 is based on GCN2 and GCN1.

GCNs geometry bottleneck is mostly visible when you have high triangle per pixel density. Consoles render usually at 900p or 1080p. This is 2x-3x less pixels than 1440p. GPU executes significantly less pixel shader instances on average per each triangle at 900p vs 1440p. This results in significantly worse GPU utilization when geometry is the bottleneck. Result = culling gives a significant advantage.

GPU culling is still a very good technique, but the biggest impact can be seen on GCN1/GCN2 hardware and/or at lower rendering resolution. These benchmarks are 1440p on GCN5. Apparently the game doesn't have enough triangle density or depth complexity (occlusion) to see benefit of GPU culling in this scenario. We don't know exactly what their algorithm is doing. I am assuming their algorithm is similar to Frostbite's: https://www.slideshare.net/gwihlidal/optimizing-the-graphics-pipeline-with-compute-gdc-2016. Frostbite's algorithm is designed for GCN2, but their algorithm still shows significant gains on GCN3 (Fury X), especially when culling is done using async compute. The culling cost could be easily reduced by removing some culling steps that modern GPUs handle efficiently.

The 1080p benchmarks bear this out, with an rx480 gaining near ten percent for minimum framerate: https://wccftech.com/wolfenstein-ii-deferred-rendering-and-gpu-culling-performance-impact/

Anyway, AMD promised that the initial implementation of Primitive Shaders wouldn't need to be touched by developers to work. And apparently they're still disabled on current Vega hardware. With Raja leaving AMD for Intel it seems to support the notion that Vega was an overall failure in achieving it's stated goals, and those responsible may have been let go as a result. Speculation on "inside baseball" perhaps, but Vega certainly didn't do a whole lot over Polaris for most game performance verse clockspeed, and he's the second major engineer on Vega to seek work elsewhere.
 
With Raja leaving AMD for Intel it seems to support the notion that Vega was an overall failure in achieving it's stated goals, and those responsible may have been let go as a result.
Or frustrated with a lack of resources. That letter did indicate more funding going towards RTG. It wouldn't make much sense to saddle Intel with a flawed design and then leave to allegedly go work on it either.
 
These might be obvious things to you, but apparently some people seem to believe that Primitive Shaders are broken. I simply see a lack of software. I will believe shader autogeneration is feasible to implement when I see it.

My personal opinion is that primitive shaders are exactly what we (GPU-driven rendering early adopters) asked for. Just give us an API to write the primitive shaders ourselves. This shader stage is definitely a better solution to a real problem than geometry/hull/domain shaders (which are all struggling to be used by anybody).
would you trust devs such those behind batman games or project cars with a specific tool that will make the amd cards to run far better than they suppose to? i can see why amd is trying to make this automatic on their drivers cause im pretty sure the usual suspects will use it in such a way that they will be able to tank amd perf once more(unless amd is able to force it via async regardless of what the devs do )
 
Back
Top