Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
I also think you're right in the sense that some clever configuration of the stages can bring about novel virtual pipelines on GCN.
I think Vega takes the next step towards the goal of a software defined pipeline, where you don't even have a conventional graphics API, the graphics types of vertex, primitive or fragment aren't baked in and inter-stage buffer configuration is determined at run time. I think we both agree this would be neat.
I'm guessing that Vega introduces flexibility in pipeline configuration. The key here is how would AMD code these "new" shader types on Polaris or earlier iterations of GCN? If they revert to compute shaders, can they wire up the inter-stage buffers and keep the data on die? Can these older GPUs do load-balanced scheduling of arbitrary pipelines? I'm guessing this is the difference with Vega, that it has a pipeline abstraction not just the shader type abstraction seen in older GCN.
Keeping data on die and having fine-grained load-balancing amongst the stages of an arbitrary pipeline really will be a revolution. Though some might argue that since we're getting close to the compute density required just to swap to path-traced rendering, we might as well wait a bit longer...
Anyway, Vega appears to be indicative that it's one thing to build the flexible hardware, quite another to get it working as intended with software.
I think the problem with software-defined pipelines is that on-die buffering will be rapidly exhausted, especially if there's no bound on the count of stages in the pipeline. Which suggests that caching is the only way to do arbitrary buffering for these pipelines.It was apparent that graphics APIs were going to the wrong direction when a single new DX11 feature required two new hard coded programmable shader stages (hull and domain shaders) and one new fixed function stage (tessellator). Apparently nobody learned from the obvious failure of geometry shaders. You can't simply design graphics APIs around random use cases extracted from offline/production rendering. I have waited for configurable shader stages (with on-chip storage to pass data between them) since DX10. All we have got is some IHV specific hacks around the common cases that aren't exposed by developers. But I don't have high hopes of getting fully configurable shader stages anytime soon, since the common shading languages (HLSL, GLSL) haven't improved much either. HLSL & GLSL design is still based on DX9 era. Good for 1:1 input : output (pixel and vertex shaders), but painful for anything else.
From what I see in that table, only with reduced details (2-line entries), except for The Unspoken, where Vega performs so abysmal that it has to be a driver bug.Fury X better than Vega64 in some VR titles ... what?
There are higher level concepts, although whether they would be specific forms of complex shaders or overall renderer algorithms that become more practical with them isn't clear to me. The exception would be the surface shader, which AMD gives as a name for a position in the pipeline matching the merged LS-HS stages. Until there's more information, I am wary of reading too much into what may be more aspirational or version 2.0 possibilities. The stages merged are consistent with either making the output of the more general Vertex stage and its variants more general, but it looks like stages remain separated at junctions with fixed-function and/or varying of item count--be it amplification or decimation.I think the idea of a dumb driver has merit. There are allusions to multiple kinds of higher-level shading in the whitepaper: primitive, surface, deferred attribute, etc.
I think it also looks to the GPU to implicitly maintain the semantics, ordering, and tighter consistency than the more programmable shader array. The graphics domain's tracking of internal pipelines, protection, and higher-level behavior provides a certain amount of overarching state that the programs are often dimly aware of, like the hierarchical culling and compression methods. It's not entirely alien from some of the meta-state handled for speculative reasons in CPU pipelines (branch prediction, stack management, prefetching, etc.), or for their respective forms of context switching and kernel management.As I described earlier, a non-compute shader looks to the GPU as a set of metadata defining the inputs, the shader itself and metadata for the output.
It might not quite be there yet, since load-balancing pain points seem to be part of where the old divisions remain, and the VS to PS division persists for some reason despite them being ostensibly similarly programmable.In effect the entire pipeline configuration has, itself, become fully programmable:
There's implicit elements implied in current recommendations, like the amount of ALU work between position and parameter exports in an attempt to juggle internal cache thrashing versus pipeline starvation. It's part of where I'm leery of exposing byzantine internal details to software. Low-level details bleeding into higher levels of abstraction have a tendency of obligating future revisions to honor them, or that developers juggle variations. It's an area where having abstractions, or a smarter driver, would stop today's hardware from strangling generation N+1.Hence the talk of "low level assembly" to make it work. That idea reminds me of the hoop-jumping required to use GDS in your own algorithms and to make LDS persist across kernel invocations.
It would be interesting to see how some of the initial conditions would be revisited. The default rasterization scheme was willing to accept some amount of the front-end work being serialized to a core, which at various points today's GPUs are now more parallel than they were.Of course Larrabee is still a useful reference point in all this talk of a software defined GPU - you could write your own pipeline instead of using what came out of the box courtesy of the driver.
Line locking seems to be something that comes up as an elegant solution to specific loads, but much mightier coherent system architectures have taken a look at this and designers consistently shoot it down for behaviors more complex than briefly holding a cache line for atomics and the like. There are implicit costs to impinging on a coherent space that have stymied such measures for longer than GPUs have been a concept.Maybe ... or maybe something more focussed on producer-consumer workloads, which isn't a use-case that caches support directly (cache-line locking is one thing needed to make caches support producer-consumer).
Increasing programmability does not lessen the dependence on compute, which apparently has slowed for AMD during its laurel-resting period. It's also potentially not a straightforward comparison between Larrabee's x86 cores and GCN. The memory hierarchies remain quite different, and the threading models differ. GCN's domain-specific hardware still performs functions the CUs are not tasked with that Larrabee's cores can/must handle.I still think we could have a viable Larrabee like graphics processor now. The available compute in these latest GPUs is barely growing (50% growth in two years from AMD - that is actually horrifying to me) which I think proves that we're truly in the era of smarter graphics (software defined on-chip pipelines) than the nonsense of wrestling with fixed function TS, RS and OM (what else?). All the transistors spent on anything but cache and ALUs just seem wasted most of the time in modern rendering algorithms.
Line locking was part of the Larrabee approach, if I remember right. It's worth bearing in mind that spilling off die is basically a good way to lose GPU acceleration entirely.Line locking seems to be something that comes up as an elegant solution to specific loads, but much mightier coherent system architectures have taken a look at this and designers consistently shoot it down for behaviors more complex than briefly holding a cache line for atomics and the like. There are implicit costs to impinging on a coherent space that have stymied such measures for longer than GPUs have been a concept.
More advanced synchronization proposals tend towards registering information with some form of arbiter or specialized context like transactions or synchronization monitors.
Yes for the first time NVidia was ahead.Increasing programmability does not lessen the dependence on compute, which apparently has slowed for AMD during its laurel-resting period.
Larrabee was approximately within a factor of 2 back when it could have happened.It's also potentially not a straightforward comparison between Larrabee's x86 cores and GCN. The memory hierarchies remain quite different, and the threading models differ. GCN's domain-specific hardware still performs functions the CUs are not tasked with that Larrabee's cores can/must handle.
I don't see how any of that stuff would be beyond the wit of Intel.Without actual implementations, we wouldn't know how many truths of the time would hold up now.
It's not even necessarily the ALUs and caches being worried about, with the increasing networks of DVFS sensors, data fabrics, interconnects, and all the transistors AMD said it spent on wire delay and clock driving. Sensors, controllers, offload engines, and wires seem to be showing up more.
I saw speculation for it, but I did not find a definitive instruction for locking a cache line. I only found certain references like a prefetch instruction that would set a fetched line to exclusive. If that's the equivalent of locking, it would have been a significant naming collision with the usual meaning of the word for cache lines.Line locking was part of the Larrabee approach, if I remember right. It's worth bearing in mind that spilling off die is basically a good way to lose GPU acceleration entirely.
It does not look like compression is something that exists because of the API. Putting compression functionality as specialized hardware at specific points in the memory hierarchy relieves the executing code of needing compression/decompression routines embedded in the most optimized routines. An opportunistic method can reduce complexity and power consumption by not engaging a compressor outside of less common events like cache line writeback, which is not something another thread within a core or a different fully-featured core can intercept.Now GPUs are implementing algorithms in hardware (some kind of tile-binned rasterisation, which was explicitly part of what Larrabee did in its rasterisation) and using data-dependent techniques such as delta colour compression because the API is such a lumbering dinosaur and developers still don't have the liberty to do the optimisations that they want to do. Unless they're on console, in which case they have slightly more flexibility.
I would expect Intel could have good implementations of those techniques, since it made note of its enhancements for duty cycling and superior integration with the L3 hierarchy in its consumer chips. I was noting that the old metrics used to judge something like RV770 and Larrabee have been joined by a raft of other considerations in the intervening time period. Xeon Phi doesn't maintain Larrabee's graphics or consumer focus, which makes it an unclear indicator of what Larrabee could have been if it hadn't been frozen in time by its cancellation.I don't see how any of that stuff would be beyond the wit of Intel.
I know it's difficult at times, but how did you chose review drivers? I thought support for Playerunknown's Battlegrounds started with WHQL 385.41.Not strictly a vega review, but nonetheless:
http://www.pcgameshardware.de/Playe...s-Battlegrounds-neue-Benchmarks-Vega-1236260/
Vega seems very strong in PUBG in 1080p compared to Geforce cards!
Not strictly a vega review, but nonetheless:
http://www.pcgameshardware.de/Playe...s-Battlegrounds-neue-Benchmarks-Vega-1236260/
Vega seems very strong in PUBG in 1080p compared to Geforce cards!