AMD Vega Hardware Reviews

So looking at this graph from the documentation, where is Vega standing right now?

oiK8Bb9.jpg




Is it using the "Vega Native Pipeline"? That does look like 4 geometry engines at ~1.6GHz (i.e. same per-clock performance as Fiji). The tallest bar looks like 17*1.6 = ~27.
To be honest, it doesn't really look like that manual mode will often (if ever) be used, so I'd say the main question right now should be how much performance AMD can extract from the auto mode. Maybe that's something that will start low and then evolve through the months/years.
 
Is it using the "Vega Native Pipeline"? That does look like 4 geometry engines at ~1.6GHz (i.e. same per-clock performance as Fiji). The tallest bar looks like 17*1.6 = ~27.
To be honest, it doesn't really look like that manual mode will often (if ever) be used, so I'd say the main question right now should be how much performance AMD can extract from the auto mode. Maybe that's something that will start low and then evolve through the months/years.
It would be relatively straightforward to generate a tool that goes through a vertex shader and takes only code related to SV_Position calculation. You could run this code first to determine whether the primitive is visible or not. After that allocate parameter cache storage + execute all other vertex shader code (related to fetching all other vertex parameters and outputting them to parameter cache). I don't know whether this is enough to get a big perf boost, or would you want to add some coarse grained tests first (which would need higher level knowledge of the geometry data and the vertex shader code). If AMD releases an OpenGL or Vulkan primitive shader extension, it would provide better documentation how it works exactly. It would help the discussion. Now we can't do anything else than speculate whether it can be automated or not.
 
Now we can't do anything else than speculate whether it can be automated or not.

I confess I don't really understand if your doubt refers to the specific process you described or the ability to automate or not at all, but @Rys has confirmed that it can and will be automated.
 
I confess I don't really understand if your doubt refers to the specific process you described or the ability to automate or not at all, but @Rys has confirmed that it can and will be automated.


To be honest, you had sometimes contradictory statements even between RTG members the last few weeks. I hope Rys is right about this and the "automatic" mode will be efficient enough to see some gains in most games.
 
I confess I don't really understand if your doubt refers to the specific process you described or the ability to automate or not at all, but @Rys has confirmed that it can and will be automated.
I was mostly thinking about doing it in the most optimal way. A developer could for example include lower precision (conservative) vertex data for culling purposes, saving bandwidth for hidden primitives versus loading full width vertex positions. Also the developer would likely want to separate the position data from the vertex (SoA style layout) to maximize data cache utilization for primitives that are hidden. You don't want to interleave position data and other vertex data to same cache lines if the GPU only reads the position data. You want to split position data separately to pack it tightly in cache lines. Things like that. If the driver is automating early vertex transform + culling, it can't do all the optimizations a developer could do, because the driver can't really change your software's data layouts in memory.

But my vertex transform code (both at Ubisoft and at Second Order) tends to split position data, so at least my code should run very well, if AMD driver splits the vertex shader control flow to SV_Position part and part writing to parameter cache. I am interesting to see more documentation about this feature, and if they automate it, I would like to see developer guidelines to maximize the effectiveness of their system.

I am especially interested in knowing whether their system could split SV_Position related code to completely separate step that is executed at tile binning stage. This way you could execute the parameter cache writing part only for visible triangles, and you wouldn't need to store them to tile buffers either. This would be a huge win. However automatic splitting isn't always a win, as in worst case, you need to process the position transform code twice (and that might be expensive if you for example blend multiple displacement maps in terrain renderer).
 
Last edited:
I am especially interested in knowing whether their system could split SV_Position related code to completely separate step that is executed at tile binning stage. This way you could execute the parameter cache writing part only for visible triangles, and you wouldn't need to store them to tile buffers either. This would be a huge win.
I am guessing this is what the whitepaper means when it says:

We can envision even more uses for this technology in the future, including deferred vertex attribute computation, multi-view/multi-resolution rendering, depth pre-passes, particle systems, and full-scene graph processing and traversal on the GPU
 
See, this is the correct way to get a sale of both a Vega GPU and a Ryzen CPU.

Now that you got it AMD, please stop this Vega64 and games and monitors and cpus and mice pads and anti mining speech bundle nonesense
 
Cool, but I want to see some sample with dedicated HBC buffer D:


@sebbbi : did you spotted something interesting in the Vega ISA manual? http://developer.amd.com/wordpress/media/2017/08/Vega_Shader_ISA_28July2017.pdf

EDIT: what about this instruction?

V_SCREEN_PARTITION_4SE_B32

D.u = TABLE[S0.u[7:0]].

TABLE:
0x1, 0x3, 0x7, 0xf, 0x5, 0xf, 0xf, 0xf, 0x7, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf,
0xf, 0x2, 0x6, 0xe, 0xf, 0xa, 0xf, 0xf, 0xf, 0xb, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf,
0xd, 0xf, 0x4, 0xc, 0xf, 0xf, 0x5, 0xf, 0xf, 0xf, 0xd, 0xf, 0xf, 0xf, 0xf, 0xf,
0x9, 0xb, 0xf, 0x8, 0xf, 0xf, 0xf, 0xa, 0xf, 0xf, 0xf, 0xe, 0xf, 0xf, 0xf, 0xf,
0xf, 0xf, 0xf, 0xf, 0x4, 0xc, 0xd, 0xf, 0x6, 0xf, 0xf, 0xf, 0xe, 0xf, 0xf, 0xf,
0xf, 0xf, 0xf, 0xf, 0xf, 0x8, 0x9, 0xb, 0xf, 0x9, 0x9, 0xf, 0xf, 0xd, 0xf, 0xf,
0xf, 0xf, 0xf, 0xf, 0x7, 0xf, 0x1, 0x3, 0xf, 0xf, 0x9, 0xf, 0xf, 0xf, 0xb, 0xf,
0xf, 0xf, 0xf, 0xf, 0x6, 0xe, 0xf, 0x2, 0x6, 0xf, 0xf, 0x6, 0xf, 0xf, 0xf, 0x7,
0xb, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x2, 0x3, 0xb, 0xf, 0xa, 0xf, 0xf, 0xf,
0xf, 0x7, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x1, 0x9, 0xd, 0xf, 0x5, 0xf, 0xf,
0xf, 0xf, 0xe, 0xf, 0xf, 0xf, 0xf, 0xf, 0xe, 0xf, 0x8, 0xc, 0xf, 0xf, 0xa, 0xf,
0xf, 0xf, 0xf, 0xd, 0xf, 0xf, 0xf, 0xf, 0x6, 0x7, 0xf, 0x4, 0xf, 0xf, 0xf, 0x5,
0x9, 0xf, 0xf, 0xf, 0xd, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x8, 0xc, 0xe, 0xf,
0xf, 0x6, 0x6, 0xf, 0xf, 0xe, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x4, 0x6, 0x7,
0xf, 0xf, 0x6, 0xf, 0xf, 0xf, 0x7, 0xf, 0xf, 0xf, 0xf, 0xf, 0xb, 0xf, 0x2, 0x3,
0x9, 0xf, 0xf, 0x9, 0xf, 0xf, 0xf, 0xb, 0xf, 0xf, 0xf, 0xf, 0x9, 0xd, 0xf, 0x1

4SE version of LUT instruction for screen partitioning/filtering.

This opcode is intended to accelerate screen partitioning in the 4SE case only. 2SE and 1SE cases use normal ALU instructions.
This opcode returns a 4-bit bitmask indicating which SE backends are covered by a rectangle from (x_min, y_min) to (x_max, y_max).
With 32-pixel tiles the SE for (x, y) is given by { x[5] ^ y[6], y[5] ^ x[6] } . Using this formula we can determine which SEs are covered by a larger rectangle.

The primitive shader must perform the following operation before the opcode is called.

1. Compute the bounding box of the primitive (x_min, y_min) (upper left) and (x_max, y_max) (lower right), in pixels.

2. Check for any extents that do not need to use the opcode ---
if ((x_max/32 - x_min/32 >= 3) OR ((y_max/32 - y_min/32 >=3) (tile size of 32) then all backends are covered.

3. Call the opcode with this 8 bit select: { x_min[6:5],y_min[6:5], x_max[6:5], y_max[6:5] } .

4. The opcode will return a 4 bit mask indicating which backends are covered, where bit 0 indicates SE0 is covered and bit 3 indicates SE3 is covered.
 
Last edited:
See, this is the correct way to get a sale of both a Vega GPU and a Ryzen CPU.

Now that you got it AMD, please stop this Vega64 and games and monitors and cpus and mice pads and anti mining speech bundle nonesense
Not for a long time, they're only coming to mobile this year, next year desktop apparently
 
AMD is still trying to figure out how to expose the feature to developers in a sensible way. Even more so than DX12, I get the impression that it's very guru-y. One of AMD's engineers compared it to doing inline assembly. You have to be able to outsmart the driver (and the driver needs to be taking a less than highly efficient path) to gain anything from manual control.
I'm curious what tools would be made available, or architectural/ISA features would enable practical adoption.
If it's on the order of that almost vestigial ISA instruction for shader engine tile coverage, then where is the ROI for developers generally?

It wouldn't just be about outsmarting (or "un-dumbing" the driver as the case may be), but doing so for yesterday's and tomorrow's drivers. Then worrying about doing so along the axis of hardware implementations, and then worrying about doing so along the axis of implementation and evolution of the software.

It would be noted that AMD's internal engineering teams with allegedly years of warning couldn't get it to work acceptably for one GPU, and as the initial implementers of the tech the number of drivers they need to make smart is one. This is with more intimate knowledge than anyone else will ever have. And its ease of use and robustness is on the order of assembly?

Is AMD hinting at scenarios where the number of GPU implementations is very low and drivers get a lot dumber?
That doesn't seem out of line with their long-term scalability plans and chiplet GPUs, and an earlier interview with Raja Koduri where he discussed more explicit management of multi-GPU.
Possibly a console?

Some of the elements, such as the one Vega ISA instruction that explicitly mentions primitive shaders show a somewhat kludgy attempt to support the method that bakes in various architectural assumptions and requires additional work just to use the thing rife with fixed values.
Perhaps an instruction fuzzer like one created for x86 (https://github.com/xoreaxeaxeax/sandsifter) could tease out whether there's more that hasn't surfaced or was lost/broken.
I get that this is a long-term play, but it feels like AMD got caught out on something and is trying to push things ahead in spite of it.


I am especially interested in knowing whether their system could split SV_Position related code to completely separate step that is executed at tile binning stage. This way you could execute the parameter cache writing part only for visible triangles, and you wouldn't need to store them to tile buffers either. This would be a huge win. However automatic splitting isn't always a win, as in worst case, you need to process the position transform code twice (and that might be expensive if you for example blend multiple displacement maps in terrain renderer).
To what level of AMD's DSBR functionality is this taken? Visibility as calculated based on the already mentioned frustrum and facing calculations, or all the way through the deferred HSR step and sub-pixel granularity operations after the scan-converter?
I think there is some implicit suppression of parameter writes in the export pipeline, per AMD's recommendation that position exports be done in advance of parameter exports. There's some driver discussion where the exact balance of exporting early enough that the parameter write is quashed (wouldn't save the math done) versus stalling the pipeline if parameters take too long and the buffers fill up waiting on the data.

One potential downside to this is the addition of shader invocations inside of a unit whose latency-hiding capability is limited, at least as described in patents. It's also described as being after primitive setup, which is a potential contradicting factor for having code evaluated from the shader as described.
The shader code's assumptions may also be challenged by the batching functionality's dynamic behavior and potential timing and data hazards if this splits across shader engines--which have some pattern of screen-space and by extension bin ownership.
It may avoid saturating the tile's state storage, although the relationship between varying levels of complexity for the rasterizer (fixed batch, dynamically removing culled primitives, etc.) could grow complicated.

Saving parameter writes via serializing across this unit may also lead to variable latency in the pixel wavefront launch after the DSBR, if it's not until very near the point of packing and initializing pixel shading wavefronts that the GPU knows which parameter calculations need to be written. Writing blindly and not using sometimes is inefficient and inconsistent, but on the other hand is not an obligatory serial step.

The PS4's triangle sieve method could pass indices to the vertex shader that failed the non-pixel checks for visibility. It seems like getting to the level of the DSBR's output would require a similar method, but one that could be applied to graphics rather than compute shader. Otherwise, if not going with an extra pass for a visibility buffer, possibly use an ID buffer path like the PS4 Pro's to pass primitive IDs to pixel shaders that have been expanded with the capability to do the work?
 
I think the idea of a dumb driver has merit. There are allusions to multiple kinds of higher-level shading in the whitepaper: primitive, surface, deferred attribute, etc.

As I described earlier, a non-compute shader looks to the GPU as a set of metadata defining the inputs, the shader itself and metadata for the output. The metadata specifies data sources and launch setup and the output metadata describes what to do with the shader result. For conventional shaders, the patterns found in the metadata are well defined: there's a well-defined set of vertex buffers usage models, or the fragment shader has options such as whether it is able to write depth. These patterns are so well defined, they're baked into the hardware as simple options that are off or on, and each set is geared towards one or more shader types.

These new higher-level shaders sound as if they are improvisational. It could be that AMD has generalised the hardware support for non-compute shaders. In a sense it would appear as though non-compute shaders now have fully exposed interfaces for input, launch and output metadata. If true, then this would mean that there isn't explicit hardware support for primitive shader, or surface shader. etc. Each of these new types of shader is constructed using the low-level hardware parameters.

In effect the entire pipeline configuration has, itself, become fully programmable:
  • Want a buffer? Where? What does it connect? What's the load-balancing algorithm? How elastic do you want it?
  • Want a shader? Which inputs types and input buffers do you want? Load-balancing algorithm? Which outputs do you want to use?
These concepts are familiar to the graphics API user, as there are many options when defining pipeline state. But this would seem to be about taking the graphics API out of the graphics pipeline! Now it's a pipeline API expressed directly in hardware. Graphics is just one use-case for this pipeline API.

Hence the talk of "low level assembly" to make it work. That idea reminds me of the hoop-jumping required to use GDS in your own algorithms and to make LDS persist across kernel invocations. I've personally not done this stuff, but this has been part of the programming model for a long long time and so a full featured pipeline API in the hardware would be the next step and, well, not that surprising.

Of course Larrabee is still a useful reference point in all this talk of a software defined GPU - you could write your own pipeline instead of using what came out of the box courtesy of the driver.

To generalise a pipeline like this is going to require lots of on-die memory, in a hierarchy of capacities, latencies and connectivities. Sounds like a cache hierarchy? Maybe ... or maybe something more focussed on producer-consumer workloads, which isn't a use-case that caches support directly (cache-line locking is one thing needed to make caches support producer-consumer). GDS, itself, was created in order to globally manage buffer usage across the GPU, providing atomics to all CUs so that a single kernel invocation can organise itself across workgroups.

So, in this model the driver doesn't know about primitive and surface shaders. It just exposes hardware capabilities. The driver team has to code the definitions of these pipelines and then produce a set of metrics that define how the driver would choose which kind of pipeline to setup. So if the driver detects that the graphics programmer is writing to a shadow buffer (lots of little clues in the graphics pipeline state definition!) it would deploy primitive shader type 5B. The driver doesn't know it's a primitive shader, the hardware doesn't either, it is merely a kind of pipeline that produces the effect that AMD is calling "primitive shader".

This would mean there is no one-size-fits-all primitive shader, or surface shader, etc. Amongst other things it relates to what sebbbi was talking about earlier: simple things like the buffer layout can have a large effect on performance (SoA versus AoS being a simple example of why performance can be horrible). Then there's the algorithm itself. .
 
For conventional shaders, the patterns found in the metadata are well defined: there's a well-defined set of vertex buffers usage models ...

GCN doesn't know what a vertex buffer is already. The only fixed semantic parts of a vertex shader are SV_Position and brethren. You can generate triangles without index buffer (the draw contains the number of invokations, just like dispatch), the index buffer is only to tell connectivity for subsequent fixed function units (tesselator and rasterizer) and is indeed hardware-fetched. The so called vertex buffer doesn't exist as a hardware feature, it's all pull model. Now, as tribute to reality, you are actually able to query some of those rasterizer parameters in the vertex shader, because otherwise it would suck hard. But it's not a must have requirement, you can easily go triangle-list-only bufferless SV_GroupID/SV_ThreadID style vertex generation. I believe it's resonable to suspect that dispatch "thread"-IDs and draw "triangle"-IDs and "instance"-IDs are actually the same thing for/register in the hardware.
Which means ... that kind of vertex shader is only a compute shader with some special outputs and some custom outputs. These custom outputs (a.k.a. parameters, a.k.a. attributes) could well be memory (and _is_ memory for some cases in tesselation shaders and geometry shaders right now, and in general when you overcommit parameters).
Now, this simple thing literally looks like a "primitive shader" doing only SV_Position already. The truth is that "primitive shader" has nothing to do with what the code does. It has only to do with how code is scheduled. Once you (read: AMD) are brave enough to shove the change in scheduling down the throat of the software layer you look at the difficiencies and try to get back to original performance. That means more dedicated parameter caches for example, no need to go to memory. Maybe enough of your parameters even make it post-hidden-surface removal of the TBR without causing early flush-points/tile-execution. But in the end this is standard optimization procedure for hardware designers, it's not a feature.
It sucks, really hard, to make something not in the slightest represented in the major APIs; in this case I assume it happened not so much because of innovation pressure, but there opened up an low-risk chance because console developers pulled AMDs legs since PS4's release, and they feel confident that there is a net-gain even if PC ignores it entirely. Not the first time. At least they managed to morph the first tesselator iteration into domain-shader pipeline-stage (hull shader was implicit in that old tesselator of theirs, and a ton of the meta-data).

So, I believed they generalized the graphics-API shader already when they did GCN1. I'm surprised they don't see the pixel as a primitive too and blur the distinction between PS (primitive shader) and PS (pixel shader). Ups ...

This is how I see this in the future:

amplification (ff) -> position calculation (code) -> cull homogenious (ff) -> attribute source caching (code) -> bin/cull tile (ff) -> attribute calculation (code) -> rasterize/amplification (ff) -> attribute interpolation and new calculation (code) -> sort (ff) -> blend (code) -> dump (ff)

I included a possible programable blend FixedFunction (ff) in there. And you see most of the ff is mostly thread invokation or killing + a little bit extra. And most of the code is compute + a little extra. If you are able to actually specify the pipeline yourself you can do whatever you want. Even a screen-size compute-shader writing to ROPs without a rasterizer.

Edit: I feel it, practical decoupled shading will be possible soon, I'm a believer. :D
 
Last edited:
Excellent post, I generally agree.

So, I believed they generalized the graphics-API shader already when they did GCN1. I'm surprised they don't see the pixel as a primitive too and blur the distinction between PS (primitive shader) and PS (pixel shader). Ups ...
I think you're right in the sense that GCN's existing stages (something like: IA, VS, HS, TS, DS, GS, SO, PA, RS, PS, RBE) are "wirable" in a variety of ways to make all the kinds of pipelines that current graphics APIs support. And I think your observations on the console design-pull, the bias in the hardware towards API experimentation that console developers are happy to engage with, are correct. I also think you're right in the sense that some clever configuration of the stages can bring about novel virtual pipelines on GCN.

I think Vega takes the next step towards the goal of a software defined pipeline, where you don't even have a conventional graphics API, the graphics types of vertex, primitive or fragment aren't baked in and inter-stage buffer configuration is determined at run time. I think we both agree this would be neat.

I'm guessing that Vega introduces flexibility in pipeline configuration. The key here is how would AMD code these "new" shader types on Polaris or earlier iterations of GCN? If they revert to compute shaders, can they wire up the inter-stage buffers and keep the data on die? Can these older GPUs do load-balanced scheduling of arbitrary pipelines? I'm guessing this is the difference with Vega, that it has a pipeline abstraction not just the shader type abstraction seen in older GCN.

This is how I see this in the future:

amplification (ff) -> position calculation (code) -> cull homogenious (ff) -> attribute source caching (code) -> bin/cull tile (ff) -> attribute calculation (code) -> rasterize/amplification (ff) -> attribute interpolation and new calculation (code) -> sort (ff) -> blend (code) -> dump (ff)

I included a possible programable blend FixedFunction (ff) in there. And you see most of the ff is mostly thread invokation or killing + a little bit extra. And most of the code is compute + a little extra. If you are able to actually specify the pipeline yourself you can do whatever you want. Even a screen-size compute-shader writing to ROPs without a rasterizer.
Yes, very much in agreement there.

Edit: I feel it, practical decoupled shading will be possible soon, I'm a believer. :D
Key word: practical.

Keeping data on die and having fine-grained load-balancing amongst the stages of an arbitrary pipeline really will be a revolution. Though some might argue that since we're getting close to the compute density required just to swap to path-traced rendering, we might as well wait a bit longer...

Anyway, Vega appears to be indicative that it's one thing to build the flexible hardware, quite another to get it working as intended with software.
 
Back
Top