AMD Vega Hardware Reviews

I also think you're right in the sense that some clever configuration of the stages can bring about novel virtual pipelines on GCN.

The sad news is that the only chance to get this asap into user-land appears to be to make your own eco-system like Nvidia did, except Nvidia's motive weren't exactly to push the pipeline paradigm I suppose.

I think Vega takes the next step towards the goal of a software defined pipeline, where you don't even have a conventional graphics API, the graphics types of vertex, primitive or fragment aren't baked in and inter-stage buffer configuration is determined at run time. I think we both agree this would be neat.

Understatement much. :D

I'm guessing that Vega introduces flexibility in pipeline configuration. The key here is how would AMD code these "new" shader types on Polaris or earlier iterations of GCN? If they revert to compute shaders, can they wire up the inter-stage buffers and keep the data on die? Can these older GPUs do load-balanced scheduling of arbitrary pipelines? I'm guessing this is the difference with Vega, that it has a pipeline abstraction not just the shader type abstraction seen in older GCN.

I don't think it's a "possibility" problem. It's a "performance" problem. You compete with your own older architectures, with the competitions architectures. You fell in a local minimum and you have no way out without making your product less competitive. Except you manage to jump from one local minimum to another further away but better.
We programmers are (I hope) totally aware of local minimum problems, they are everywhere in literally every scope of code and hardware (and life :)), because global minima have a computation problem. There is ... a huge acceptance problem in culture about it, or business, or marketing. Everything needs to be best or die, nothing should ever become worst.
Anyway, you can't optimize the architecture for one paradigm and be optimal under the other as well. The tradeoffs made to make GCN1 competitive under e.g. DX11-API make GCN1 non-competitive under "Primitive Shader"-API. It's the effect of the optimization pass I mentioned before, not because there is a feature inhibitor somewhere in the chip

Keeping data on die and having fine-grained load-balancing amongst the stages of an arbitrary pipeline really will be a revolution. Though some might argue that since we're getting close to the compute density required just to swap to path-traced rendering, we might as well wait a bit longer...

I wonder if memoryless busses are such a good thing. Imagine the data-path would have storage capacity, by design, in the specs, in the API ... and you could do something with it, specifically the FF stuff. Like another thing acompanying registers and caches, the traditional memory-hierarchy. It would make some thing certainly easier and more robust. I'm not a hardware designer, it's possible there is no such spot, and it's futile to think about it. :)

Anyway, Vega appears to be indicative that it's one thing to build the flexible hardware, quite another to get it working as intended with software.

Always. But you don't want the deadlock, you have to break out, one way or the other.
 
The gaming market right now is the Consoles. Unfortunately, PC gamer's have just received "console ports" over the last 5+ years.

Yes, many Game Companies have stuck with the PC, and developed their engines for consoles. But that was the past and EVERYTHING is unified now and will be even-more-so under 4k. The Xbox OneX is essentially a high-end HTPC for the household. But it uses x86 and 64bit windows. And AMD APU.


Vega is forward thinking, and built for a new era of Gaming compute. Volta is coming, and will try to offer all the stuff Vega does. Because modern Game Engine's are already onboard with AMD. Even Volta seems slides seem lacking..
 
It was apparent that graphics APIs were going to the wrong direction when a single new DX11 feature required two new hard coded programmable shader stages (hull and domain shaders) and one new fixed function stage (tessellator). Apparently nobody learned from the obvious failure of geometry shaders. You can't simply design graphics APIs around random use cases extracted from offline/production rendering. I have waited for configurable shader stages (with on-chip storage to pass data between them) since DX10. All we have got is some IHV specific hacks around the common cases that aren't exposed by developers. But I don't have high hopes of getting fully configurable shader stages anytime soon, since the common shading languages (HLSL, GLSL) haven't improved much either. HLSL & GLSL design is still based on DX9 era. Good for 1:1 input:eek:utput (pixel and vertex shaders), but painful for anything else.
 
Last edited:
It was apparent that graphics APIs were going to the wrong direction when a single new DX11 feature required two new hard coded programmable shader stages (hull and domain shaders) and one new fixed function stage (tessellator). Apparently nobody learned from the obvious failure of geometry shaders. You can't simply design graphics APIs around random use cases extracted from offline/production rendering. I have waited for configurable shader stages (with on-chip storage to pass data between them) since DX10. All we have got is some IHV specific hacks around the common cases that aren't exposed by developers. But I don't have high hopes of getting fully configurable shader stages anytime soon, since the common shading languages (HLSL, GLSL) haven't improved much either. HLSL & GLSL design is still based on DX9 era. Good for 1:1 input : output (pixel and vertex shaders), but painful for anything else.
I think the problem with software-defined pipelines is that on-die buffering will be rapidly exhausted, especially if there's no bound on the count of stages in the pipeline. Which suggests that caching is the only way to do arbitrary buffering for these pipelines.

So then we loop back round to the subject of cache-line locking and other techniques(?) to make a cache behave in a manner that makes software defined pipelines viable.

Buffer usage lies at the heart of load-balancing algorithms, so a configurable pipeline requires tight control over load-balancing as well as buffer apportionment.

As a compromise, one might argue that a mini-pipeline, with only two kernels and one intermediate buffer, with off-die output for the pipeline results might be a useful first API improvement. One could argue that such a mini-pipeline could produce output that conforms with later stages of the conventional graphics pipeline, and therefore this mini-pipeline doesn't need to output its results off-die. That sounds like "primitive shader" or "surface shader", doesn't it?

I still think we could have a viable Larrabee like graphics processor now. The available compute in these latest GPUs is barely growing (50% growth in two years from AMD - that is actually horrifying to me) which I think proves that we're truly in the era of smarter graphics (software defined on-chip pipelines) than the nonsense of wrestling with fixed function TS, RS and OM (what else?). All the transistors spent on anything but cache and ALUs just seem wasted most of the time in modern rendering algorithms.
 
LC Vega 64 vs GTX 1080 in VR games

unconstrained-fps-1.jpg

http://www.babeltechreviews.com/rx-vega-64-liquid-10-vr-games-vs-the-gtx-1080-gtx-1080-ti/6/
 
I think the idea of a dumb driver has merit. There are allusions to multiple kinds of higher-level shading in the whitepaper: primitive, surface, deferred attribute, etc.
There are higher level concepts, although whether they would be specific forms of complex shaders or overall renderer algorithms that become more practical with them isn't clear to me. The exception would be the surface shader, which AMD gives as a name for a position in the pipeline matching the merged LS-HS stages. Until there's more information, I am wary of reading too much into what may be more aspirational or version 2.0 possibilities. The stages merged are consistent with either making the output of the more general Vertex stage and its variants more general, but it looks like stages remain separated at junctions with fixed-function and/or varying of item count--be it amplification or decimation.

The programmable stages do not appear like they can fully encapsulate the tessellation hardware or the geometry shader stage without at least some recognition of the discontinuity, although it seems the GS stage is closer to being more generalized except for the VS pass-through stage.
The VS to PS transition point is an area of fixed-function and amplification, potentially covering the post-batch path of the DSBR through scan-conversion and through the wavefront initialization done by the SPI.

A hypothetical splitting of parameter cache writes from other primitive shader functions such that only parameters that feed into visible pixels occur would require straddling the VS to DSBR to PS path, which has some interestingly complex behaviors to traverse.


As I described earlier, a non-compute shader looks to the GPU as a set of metadata defining the inputs, the shader itself and metadata for the output.
I think it also looks to the GPU to implicitly maintain the semantics, ordering, and tighter consistency than the more programmable shader array. The graphics domain's tracking of internal pipelines, protection, and higher-level behavior provides a certain amount of overarching state that the programs are often dimly aware of, like the hierarchical culling and compression methods. It's not entirely alien from some of the meta-state handled for speculative reasons in CPU pipelines (branch prediction, stack management, prefetching, etc.), or for their respective forms of context switching and kernel management.

The idea of having side-band hardware functions is also not unique to the GPU domain and has some ongoing adoption for non-legacy reasons. Compression is making its way into CPUs, where it saves on some critical resource with autonomous meta-work opportunistically without injecting it into the instruction stream that benefits from it.

In effect the entire pipeline configuration has, itself, become fully programmable:
It might not quite be there yet, since load-balancing pain points seem to be part of where the old divisions remain, and the VS to PS division persists for some reason despite them being ostensibly similarly programmable.

Hence the talk of "low level assembly" to make it work. That idea reminds me of the hoop-jumping required to use GDS in your own algorithms and to make LDS persist across kernel invocations.
There's implicit elements implied in current recommendations, like the amount of ALU work between position and parameter exports in an attempt to juggle internal cache thrashing versus pipeline starvation. It's part of where I'm leery of exposing byzantine internal details to software. Low-level details bleeding into higher levels of abstraction have a tendency of obligating future revisions to honor them, or that developers juggle variations. It's an area where having abstractions, or a smarter driver, would stop today's hardware from strangling generation N+1.

Of course Larrabee is still a useful reference point in all this talk of a software defined GPU - you could write your own pipeline instead of using what came out of the box courtesy of the driver.
It would be interesting to see how some of the initial conditions would be revisited. The default rasterization scheme was willing to accept some amount of the front-end work being serialized to a core, which at various points today's GPUs are now more parallel than they were.

Maybe ... or maybe something more focussed on producer-consumer workloads, which isn't a use-case that caches support directly (cache-line locking is one thing needed to make caches support producer-consumer).
Line locking seems to be something that comes up as an elegant solution to specific loads, but much mightier coherent system architectures have taken a look at this and designers consistently shoot it down for behaviors more complex than briefly holding a cache line for atomics and the like. There are implicit costs to impinging on a coherent space that have stymied such measures for longer than GPUs have been a concept.

More advanced synchronization proposals tend towards registering information with some form of arbiter or specialized context like transactions or synchronization monitors.


I still think we could have a viable Larrabee like graphics processor now. The available compute in these latest GPUs is barely growing (50% growth in two years from AMD - that is actually horrifying to me) which I think proves that we're truly in the era of smarter graphics (software defined on-chip pipelines) than the nonsense of wrestling with fixed function TS, RS and OM (what else?). All the transistors spent on anything but cache and ALUs just seem wasted most of the time in modern rendering algorithms.
Increasing programmability does not lessen the dependence on compute, which apparently has slowed for AMD during its laurel-resting period. It's also potentially not a straightforward comparison between Larrabee's x86 cores and GCN. The memory hierarchies remain quite different, and the threading models differ. GCN's domain-specific hardware still performs functions the CUs are not tasked with that Larrabee's cores can/must handle.

Without actual implementations, we wouldn't know how many truths of the time would hold up now.
It's not even necessarily the ALUs and caches being worried about, with the increasing networks of DVFS sensors, data fabrics, interconnects, and all the transistors AMD said it spent on wire delay and clock driving. Sensors, controllers, offload engines, and wires seem to be showing up more.
 
Line locking seems to be something that comes up as an elegant solution to specific loads, but much mightier coherent system architectures have taken a look at this and designers consistently shoot it down for behaviors more complex than briefly holding a cache line for atomics and the like. There are implicit costs to impinging on a coherent space that have stymied such measures for longer than GPUs have been a concept.

More advanced synchronization proposals tend towards registering information with some form of arbiter or specialized context like transactions or synchronization monitors.
Line locking was part of the Larrabee approach, if I remember right. It's worth bearing in mind that spilling off die is basically a good way to lose GPU acceleration entirely.

Increasing programmability does not lessen the dependence on compute, which apparently has slowed for AMD during its laurel-resting period.
Yes for the first time NVidia was ahead.

It's also potentially not a straightforward comparison between Larrabee's x86 cores and GCN. The memory hierarchies remain quite different, and the threading models differ. GCN's domain-specific hardware still performs functions the CUs are not tasked with that Larrabee's cores can/must handle.
Larrabee was approximately within a factor of 2 back when it could have happened.

Now GPUs are implementing algorithms in hardware (some kind of tile-binned rasterisation, which was explicitly part of what Larrabee did in its rasterisation) and using data-dependent techniques such as delta colour compression because the API is such a lumbering dinosaur and developers still don't have the liberty to do the optimisations that they want to do. Unless they're on console, in which case they have slightly more flexibility.

Without actual implementations, we wouldn't know how many truths of the time would hold up now.
It's not even necessarily the ALUs and caches being worried about, with the increasing networks of DVFS sensors, data fabrics, interconnects, and all the transistors AMD said it spent on wire delay and clock driving. Sensors, controllers, offload engines, and wires seem to be showing up more.
I don't see how any of that stuff would be beyond the wit of Intel.
 
Line locking was part of the Larrabee approach, if I remember right. It's worth bearing in mind that spilling off die is basically a good way to lose GPU acceleration entirely.
I saw speculation for it, but I did not find a definitive instruction for locking a cache line. I only found certain references like a prefetch instruction that would set a fetched line to exclusive. If that's the equivalent of locking, it would have been a significant naming collision with the usual meaning of the word for cache lines.

Now GPUs are implementing algorithms in hardware (some kind of tile-binned rasterisation, which was explicitly part of what Larrabee did in its rasterisation) and using data-dependent techniques such as delta colour compression because the API is such a lumbering dinosaur and developers still don't have the liberty to do the optimisations that they want to do. Unless they're on console, in which case they have slightly more flexibility.
It does not look like compression is something that exists because of the API. Putting compression functionality as specialized hardware at specific points in the memory hierarchy relieves the executing code of needing compression/decompression routines embedded in the most optimized routines. An opportunistic method can reduce complexity and power consumption by not engaging a compressor outside of less common events like cache line writeback, which is not something another thread within a core or a different fully-featured core can intercept.

I don't see how any of that stuff would be beyond the wit of Intel.
I would expect Intel could have good implementations of those techniques, since it made note of its enhancements for duty cycling and superior integration with the L3 hierarchy in its consumer chips. I was noting that the old metrics used to judge something like RV770 and Larrabee have been joined by a raft of other considerations in the intervening time period. Xeon Phi doesn't maintain Larrabee's graphics or consumer focus, which makes it an unclear indicator of what Larrabee could have been if it hadn't been frozen in time by its cancellation.
 
Come on give AMD a little time to tweak Vega drivers, it's a new architecture even if based on an existing one.
 
What do you think, how long does AMDs driver programmer need still? In December, they had fallback mode up and running with what appears to be near-final Vulkan-performance in Doom.
 
I was not involved in that game test, I only had to hand over the cards.
„Probemessungen mit der Radeon Software 17.8.2 sowie den Geforce-Treibern 385.41 ergaben jedoch keinerlei Performance-Verbesserungen gegenüber den früheren Treibern - weder im CPU-, noch GPU-Limit. “

This however tells you that my colleague cross-checked with newer drivers and there were no performance changes. I guess he simply did not want to lie about the driver version used - maybe that was too honest. :)
 
Not strictly a vega review, but nonetheless:
http://www.pcgameshardware.de/Playe...s-Battlegrounds-neue-Benchmarks-Vega-1236260/

Vega seems very strong in PUBG in 1080p compared to Geforce cards!

And a disaster at resolutions above... it has completely no sense, probably drivers are still a lot bugged (either the 1080 results are skewed because an incorrect rendering, or the higher resolutions are bugged, ot both, even if there should be a mention in the review about incorrect rendering)
 
Back
Top