AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

CSI PC · Jan 18, 2017

Gipsel said:
No.
It is a (conservative) estimate of the potential additional benefit of using a primitive shader compared to the traditional pipeline. He explicitly said so. It is very likely, that this is a comparison of Vega with primitive shader vs. Vega without using it. Later in the interview, he says that Vega offers a geometry throughput uplift compared to previous generations also without using a primitive shader (while refusing to quantify it or provide specifics).

Edit:
The better part of the video for your point would be actually starting at the 36 minute mark, when he gets confronted directly with the footnote mentioning the 11 triangles per clock.
He admitted to have been unaware of that footnote, struggled a bit by saying he thinks it's not applying to a specific product and an example what Vega could do in a configuration with 4 geometry engines (reinforcing a bit the point CarstenS was making [that AMD didn't exactly say, that Vega10 has 4 geometry engines, could be a different number], if it was not just hedging from his side). He then came back to one of the "talking points" of the Vega reveal, the primitive shaders (giving some credence to the idea, that this number is really pertaining to that), but basically saying "it's difficult" to explain how one arrives at the number of 11 triangles per clock (allegedly realistic and possibly taking into account multiple constraints like memory bandwidth [I mentioned that before]).
So maybe he was not fully briefed about what exactly is on the slides and was not willing to reveal any specifics. Or someone at AMD pulled some shaky number out of the air with the help of some halfbaked rules of thumb and put them on that slide. As explained, that number and also putting it on that slide doesn't make much sense in that case, as it would be somewhat fundamentally flawed.

Remember I quoted Ryan at PCPer who stated it can be wrapped pertaining to the driver package and that would in theory provide a performance boost, this came as part of the section where it seems he was given additional information by AMD.
I appreciate you do not put much weight on what he said due to different opinion on its context, but he has been absolutely correct when he also stated to fully utilise it must be accessed by API-coding-SDK, and this is backed up in the interview Razor linked as Scott mentions it there.
The 11polygons listening to Scott is a theoretical maximum when/if using 4 Geometry Engines, because he goes on to say "realistically", sort of reminds me of Async Compute and the difference between theoretical maximum and real world.
He did say "I think that is an example and not a specific product detail about the number of Geometry Engines in a particular chip" - Could be construed to mean either there are multiple designs (one being traditional 4 Geometry Engines) and this information is to come out later, or that all the various chips have more than 4 Geometry Engines.
I think we will see multiple types; a standard version still with 4 Geometry Engines as the 1st one is meant/rumoured to be 4096 cores and with the same CU count (known by released info on FP32/clock), if you can increase the Geometry Engines (and associated functions) then it makes sense to also increase the CU count and all that entails as the limitation has been removed.

Cheers

CSI PC · Jan 18, 2017

CarstenS said:
Off-topic, but I find it quite sad how websites/youtubers are stealing other people's IP (in this case fotos, not mine, but still) without even crediting them. Sorry for the µrant.

Couple of seconds before that (around 6:10) the interviewed guy (who in this part of the video does explicitly does NOT sound like Scott...) explicitly says, the effectiveness of the draw stream binning will depend on how much is done in software already - with heavily optimized software culling already in place, the effect of the hardware being less pronounced. AFAIGI as a non-native speaker.

Still sounded like Scott to me just with sound peak limiter issues *shrug*.
Yeah makes sense, but for AMD this is a separate benefit to aspects of the Primitive Shader, and draw stream binning possibly help anyway to some extent in various games.
Also worth noting Scott clarifies the massive polygon count with regards to the Deus Ex example to say it is actually a fraction of that per frame due to the number being also what is around.
Cheers

CarstenS · Jan 18, 2017

revan said:
MORE HERE : http://semiaccurate.com/2017/01/17/amd-talks-vega-high-level/
A very serious article , despite site's repution!
Going after pop-corn...

You really meant that as troll bait, didn't you? That article is as vague as can be while not adding any significant information in addition to what's directly contained in the slides.

3dilettante · Jan 18, 2017

CSI PC said:
The 11polygons listening to Scott is a theoretical maximum when/if using 4 Geometry Engines, because he goes on to say "realistically", sort of reminds me of Async Compute and the difference between theoretical maximum and real world.

My interpretation of that interview was that the 11 was not being derived in the same way as the Fiji count, with the latter being a literal physical maximum of the hardware. If there's a modifier of "achievable" with some kind of test program, that would leave room for a nicer round number in the hardware, like 12 (or 16 if it's "realistic" and 16-wide SIMD utilization penalties were on the order of 20-30% back in the day of Larrabee).

CarstenS said:
You really meant that as troll bait, didn't you? That article is as vague as can be while not adding any significant information in addition to what's directly contained in the slides.

That article seems to be using some acronyms and initialisms in ways I haven't seen before. There are some potential tidbits (graphics pipeline has an equal wave dispatch throughput as the HWS/ACEs?, extended BAR and mapped memory accesses), but I don't know how much I can run with them with the surrounding information being rather contrary to my understanding of the terminology, architecture, and the general interpretation of the architecture by all the other articles and interviews I've seen (NCU is an ACE?, Geometry Command Processor?).

One potential item to contemplate, is the nature of the the allegedly one big Intelligent Workload Distributor. Its ability to load/balance across primitives and calls has been discussed before, although without more detail on the hardware's mapping I am curious what is being balanced versus being re-ordered.
One possibility is if a geometry processor that is running a primitive shader or more traditional geometry-related shaders that finds that execution is starting to move or spawn vertices outside of the local raster tile, an earlier speculated feedback loop from the geometry engine to the IWD can take the split point and shader information and spawn an instance of it (which would filter out the geometry it doesn't need concurrently or later), rather than trying to route triangles from one overworked shader engine to the others.
A banked IWD could provide a queue/buffer per tile range, and the outcome of scheduling work across calls or primitives would fall out of it without centralizing decision making in one monolithic unit, which I would see as a potential scalablility barrier.

If the shader engine and raster/RBE allocation are less static, then perhaps a more traditional load-balancing could occur if the heuristics can weigh the cost of that kind of distribution versus soldiering through locally. Compute may be able to benefit if the lanes in the IWD can serve as work queues anthat can be checked or stolen from.

CSI PC · Jan 18, 2017

3dilettante said:
My interpretation of that interview was that the 11 was not being derived in the same way as the Fiji count, with the latter being a literal physical maximum of the hardware. If there's a modifier of "achievable" with some kind of test program, that would leave room for a nicer round number in the hardware, like 12 (or 16 if it's "realistic" and 16-wide SIMD utilization penalties were on the order of 20-30% back in the day of Larrabee).

Yes because of the way the operation-function probably is with Primitive Shader and not just HW scaling, why I brought Async Compute into this as another example including its theoretical maximum benefit compared to reality and involves a lot from the developers-API.
There is no nice round number you can use that relates to the hardware-operation structure IMO that logically fits in with what you see with Fiji.
From an absolute performance perspective, in theory seems it would be up to 2.75 times more powerful if everything aligned and with perfect programming-API-perfect scene if going by the 11polygon footnote as a 4 Geometry Engine solution - yeah it will never happen like you will never see the absolute thereotical perfect maximum performance gain from Async Compute.
Even "over 2x" is probably stretching it for many PC games when they talk about the various Vega related functions *shrug*

Anyway remember as part of that footnote relating to the up to 11polygons it also said

Data based on AMD Engineering design of Vega

This to me suggests the source is beyond just the marketing team, especially as it was also included within the preview slides.
Cheers

revan · Jan 18, 2017

CarstenS said:
You really meant that as troll bait, didn't you? That article is as vague as can be while not adding any significant information in addition to what's directly contained in the slides.

Well, I know the man has a bad mouth but thought he could had some inside-info... When the informations are sparse (.... and even "sparse" could be an oversaying talking about Vega) you must look after bits of information even in foul places like Fudzilla or Semiaccurate (...sometime water lilies grow out from the mud, it's a saying) . But if nothing of value comes for there so be it, we can go on with what we have right now... not much (true to be told)

3dilettante · Jan 18, 2017

CSI PC said:
There is no nice round number you can use that relates to the hardware-operation structure IMO that logically fits in with what you see with Fiji.

Why not add up the throughput of the physical paths in 4 supposedly parallel engines?

From an absolute performance perspective, in theory seems it would be up to 2.75 times more powerful if everything aligned and with perfect programming-API-perfect scene if going by the 11polygon footnote as a 4 Geometry Engine solution - yeah it will never happen like you will never see the absolute thereotical perfect maximum performance gain from Async Compute.

Fiji's number is the theoretical max of its physical hardware. Any discussion of having an ideal scene or running at 11 for Vega should have made Fiji's listing be 2.x-3.x, and AMD has never shied away from giving the physical ALU throughput of the whole GPU when talking about compute. Every architecture falls short of theoretical peak, so that footnote wouldn't be treating the two equivalently unless there's something physically different with one of the pipelines or there's something architecturally in the way of using the full width of 4 independent units.

Anyway remember as part of that footnote relating to the up to 11polygons it also said
This to me suggests the source is beyond just the marketing team, especially as it was also included within the preview slides.
Cheers

The value in an "up to" is generally the peak. It suggests to me that the footnote was mangled, regardless of source.

CSI PC · Jan 18, 2017

3dilettante said:
Why not add up the throughput of the physical paths in 4 supposedly parallel engines?

Fiji's number is the theoretical max of its physical hardware. Any discussion of having an ideal scene or running at 11 for Vega should have made Fiji's listing be 2.x-3.x, and AMD has never shied away from giving the physical ALU throughput of the whole GPU when talking about compute. Every architecture falls short of theoretical peak, so that footnote wouldn't be treating the two equivalently unless there's something physically different with one of the pipelines or there's something architecturally in the way of using the full width of 4 independent units.

The value in an "up to" is generally the peak. It suggests to me that the footnote was mangled, regardless of source.

I think a lot of us may not be disagreeing too much but have a different POV.
Both are theoretical max in the context of the polygon/clock, just that Vega has greater flexibility that goes beyond the HW scaling of the Geometry Engine, you still end up with up to x2.75 if those engineers are comfortable with the max potential theory of both.
Remember Async Compute and AMD saying improve performance up to 46% in DX12?
At least that was with a public presented shader demo they provided.
Yet in reality it is a fraction of that when the greatest benefit in this regard for Doom Vulkan (as an example) came from the API extensions rather than Async Compute.
What was the theoretical maximum performance gain of Async Compute in Fiji compared to GCN1 that has it currently disabled and what is the real benefit?
Point is I do not think one can try to correlate the Primitive Shader into same template as Fiji and traditional use of the 4 geometry engines, only the end result and theory/real world results or numbers.
For real world (well as much a focused engineer demo gives and would be much higher than games due to being closer to the ideal theory) we have not been given or presented a new shader demo yet like we saw with Async Compute.
Cheers

3dilettante · Jan 18, 2017

CSI PC said:
I think a lot of us may not be disagreeing too much but have a different POV.
Both are theoretical max in the context of the polygon/clock, just that Vega has greater flexibility that goes beyond the HW scaling of the Geometry Engine, you still end up with up to x2.75 if those engineers are comfortable with the max potential theory of both.

This is rearranging AMD's footnote, which might have the effect of correcting it if it is inaccurate but isn't how it is written. The 2.5x was given without the modifier, it was the 11 polygons part that was "up to" .
Pairing a statement of a design's physical characteristic on one side, and then a derived or measured value on the other only has a few consistent interpretations, and they don't really make much sense. The 2.75x factor without caveats actually makes it harder to find a coherent interpretation, since one half of the comparison is an unwavering constant.
That then leaves the question of what is physically preventing fully consistent handling either on a per-clock basis (units cannot behave the same on every cycle) or on a parallel pipeline basis (one or more units cannot function the same as the others).
Alternately, and possibly more likely, there isn't a sensible way to interpret combining incompatible measurements.

What was the theoretical maximum performance gain of Async Compute in Fiji compared to GCN1 that has it currently disabled and what is the real benefit?

That's the difference between a performance claim for some set of software workloads, and a statement of an architectural characteristic. AMD didn't claim that AC made the GPU capable of more FMA operations per clock.

Point is I do not think one can try to correlate the Primitive Shader into same template as Fiji and traditional use of the 4 geometry engines, only the end result and theory/real world results or numbers.

Even if there is a non-traditional component to the throughput in Vega due to the primitive shader or some other feature, there's still a finite set of inputs or paths in the hardware to process it, and those have a finite amount of output.
If this cannot be calculated, then it means something else is being misapplied or misworded, like the "polygon" one architecture is processing is not the same as the other.

CSI PC · Jan 19, 2017

I think we can agree it is a fudge until they present a new engineer shader demo like they did for Async Compute, it still will not necessarily reflect real-world situations/games but closer to the theoretical maximum with a more valid number, same can be said for the draw stream binning rasterizer.
Cheers

sebbbi · Jan 19, 2017

CSI PC said:
What was the theoretical maximum performance gain of Async Compute in Fiji compared to GCN1 that has it currently disabled and what is the real benefit?

Just optimized my console game (GCN2). Got 15% gain from async compute + barrier refactoring (2 week work). Still should see another 10% when everything is fully overlapped. I'd say 25% is achievable easily in modern game code. If the engine architecture was designed around async compute from the start, you could even double your performance in heavily geometry bound cases (100% geometry pipeline usage during the whole frame, next frame = async compute overlap lighting and post processing).

Theoretical maximum gain of async compute... That is a silly question:
Worst case shader for graphics pipe is a single wave wide fully serial shader (with various stalls). Occupies one wave (out of max 40) of a single CU (of 64). 0.03% GPU wide occupancy. It could theoretically saturate one SIMD (if no memory ops) = 0.3% GPU utilization. If you run any async compute simultaneously, you'll get 100x+ GPU utilization gains. Of course the graphics pipe could be just waiting for fences (waiting to load some huge texture from CPU memory). In this case async compute gives you infinite gains (over zero work done)

Minimum gain is negative. In the worst case async compute can trash caches or increase latency -> stalls. This question is as hard to answer properly as "how much multithreading helps performance?".

CSI PC · Jan 19, 2017

sebbbi said:
Just optimized my console game (GCN2). Got 15% gain from async compute + barrier refactoring (2 week work). Still should see another 10% when everything is fully overlapped. I'd say 25% is achievable easily in modern game code. If the engine architecture was designed around async compute from the start, you could even double your performance in heavily geometry bound cases (100% geometry pipeline usage during the whole frame, next frame = async compute overlap lighting and post processing).

Theoretical maximum gain of async compute... That is a silly question:
Worst case shader for graphics pipe is a single wave wide fully serial shader (with various stalls). Occupies one wave (out of max 40) of a single CU (of 64). 0.03% GPU wide occupancy. It could theoretically saturate one SIMD (if no memory ops) = 0.3% GPU utilization. If you run any async compute simultaneously, you'll get 100x+ GPU utilization gains. Of course the graphics pipe could be just waiting for fences (waiting to load some huge texture from CPU memory). In this case async compute gives you infinite gains (over zero work done)

Minimum gain is negative. In the worst case async compute can trash caches or increase latency -> stalls. This question is as hard to answer properly as "how much multithreading helps performance?".

Well about as silly as maximum gain from the Primitive Shader or draw stream binning rasterizer that go beyond HW scaling of the architecture due to their function-design and the figures provided by AMD that includes the footnote that has been the topic for last several pages.
Nice results you got there and you will manage to get the best results (considering console and other developer comments historically) I have read for Async Compute if you hit 25%; Gears of War 4 is anywhere from 2% to 8%, Vulkan Doom I thought was around 8% to 12%, AoTS around 10% best (depends upon resolution),etc.
But fair to say you optimised for Console while Vega so far is dGPU like the presentation for Async Compute with Fiji?
That said still one of the best figures I have heard mentioned so nice work if you manage to get to 25%.
And you agree that the figure can notably fluctuate depending upon settings-resolution-scene for the same game?

Even putting aside theory of Async Compute maximum, AMD showed with their specific Async Compute shader demo on PC with Fiji they managed 46% performance gain, fitting my point about theoretical maximum/engineer demo that probably gets closest/real world including your results compared to dGPU PC games with gains.
I will be curious to see if AMD present any new engineer shader demo for Vega regarding the changes, it is needed because the figures focused on 'dGPU' Vega are meaningless apart from as theoretical maximums-peaks.
Thanks

sebbbi · Jan 19, 2017

Tomorrow's Children async compute gain was 18.5% (33ms -> 27ms):
http://fumufumu.q-games.com/archives/TheTechnologyOfTomorrowsChildrenFinal.pdf

30%+ is definitely possible on consoles if you design accordingly.

CSI PC · Jan 19, 2017

sebbbi said:
Tomorrow's Children async compute gain was 18.5% (33ms -> 27ms):
http://fumufumu.q-games.com/archives/TheTechnologyOfTomorrowsChildrenFinal.pdf

30%+ is definitely possible on consoles if you design accordingly.

Thanks and yeah around some of the figures I have read from other devs that they managed, and even 30% still short of what AMD achieved on a PC with the engineer demo specific for Async Compute but to be expected; it will be interesting to see how close well optimised console game development can get to their figure.
This is not a multi-platform release and interesting point as I would need to check the trend, is your game multi-platform or exclusive to one console/both consoles/all platforms?
The trend I feel will be the same with the Vega functions and the game-rendering engines; single console exclusives with the highest gains, console multi-platform next (specifically Microsoft and Sony), and last multi-platform consoles and PC.
Yeah nothing earth shattering there and as one would expect, but food for thought how such technology benefits occur when talking about a GPU architecture in general-broadly along with its performance increase over previous gens.

Thanks

sebbbi · Jan 20, 2017

CSI PC said:
Thanks and yeah around some of the figures I have read from other devs that they managed, and even 30% still short of what AMD achieved on a PC with the engineer demo specific for Async Compute but to be expected; it will be interesting to see how close well optimised console game development can get to their figure.

I am sure you can get higher gains than AMDs presentation showed.

Simple example:
- Use 100% (16.6ms) of your main/graphics pipe time to rasterize (shadow maps and g-buffer). Normally you could use only up to 8 ms for these tasks.
- Use a very lightweight g-buffer technique: https://forum.beyond3d.com/threads/modern-textureless-deferred-rendering-techniques.57611/. These techniques are almost entirely fixed function (geometry & ROP) bound. Very little ALU/BW/sampler usage.
- Continue with lighting and post processing in async compute pipeline (overlaps with the next frame). Ensure that lighting + post takes roughly 16.6ms (equal time to rasterization tasks).

I would expect to see up to 60%-80% perf gains compared to executing both rasterization and lighting/post sequentially. Downside is that graphics pipe and compute pipe frame lengths need to be roughly equal to get perfect results. Meaning that you really need to spend that 16.6ms for rasterizing triangles, in every scene. This is hard to achieve, as viewport rendering tends to be the most fluctuating part of the scene rendering. When you double your geometry budget (8 ms -> 16 ms), you also increase the fluctuation. Hopefully somebody tries techniques like this and presents the results at GDC/SIGGRAPH. Would be interesting to see how big jump they can make, and what kind of problems they hit.

Jawed · Jan 21, 2017

http://www.freepatentsonline.com/y2016/0378565.html

METHOD AND APPARATUS FOR REGULATING PROCESSING CORE LOAD IMBALANCE

Briefly, methods and apparatus to rebalance workloads among processing cores utilizing a hybrid work donation and work stealing technique are disclosed that improve workload imbalances within processing devices such as, for example, GPUs. In one example, the methods and apparatus allow for workload distribution between a first processing core and a second processing core by providing queue elements from one or more workgroup queues associated with workgroups executing on the first processing core to a first donation queue that may also be associated with the workgroups executing on the first processing core. The method and apparatus also determine if a queue level of the first donation queue is beyond a threshold, and if so, steal one or more queue elements from a second donation queue associated with workgroups executing on the second processing core.

I have no idea if this kind of hybrid of both work-donation and work-stealing has been done before. PS3 game developers appeared to pioneer work-stealing but I don't know if work-donation was a part of those kinds of solutions. I'm not aware of arenas other than game development where work-stealing is commonly used (I presume it is still used) and know nothing about work-donation.

What's really interesting with some of these diagrams is seeing "work" in L1 and L2 caches.

3dilettante · Jan 22, 2017

Perhaps it's too new to be in Vega? The known introduction of the binning rasterizer was first mooted in 2013, and then continued in August 2016.

There are some elements to this that make it unclear how transparent this is to software, or what workgroups and wavefronts are in this scheme, with wavefronts being able to enqueue and dequeue work items after they have been allocated and are executing, and those items are able to be pulled into other workgroups or other processors.
It seems like there would be challenges if this were implemented in the current way of doing things, when a whole workgroup's contexts are allocated fully in advance, and a different workgroup/wavefront currently wouldn't align itself with a foreign context's particulars or have room to support another context on top of its own.
It seems like there's some kind of indirection in the execution loop (not 1:1 between a workgroup and the code/context its is running?), making thread contexts uniform (no negotiation for LDS, register space, etc. before taking items?) to make that happen.

The various ways for monitoring and moving work can lean on specialized hardware, with work being able to be donated locally and stolen more globally. Keeping track of the queue pointers could provide a way to know how much work there is, by knowing the current pointer's offset from the queue's base. Fiddling with that value atomically via a specialized path should help maintain integrity, and I think it might allow scoped operations that could be synchronized via CU-level or GDS-level barriers, at a minimum.
It would seem like some kind of protection would be put in place so that the relevant ranges can be checked by the dedicated hardware without being hit by some external buggy or malicious access.

Jawed · Jan 22, 2017

In the fixed function view of the graphics pipeline, work is solely divided into wavefronts. Compute is when workgroups become entangled in one of a short list of configurations (e.g. a workgroup of 2 wavefronts though of course a compute wavefront may be defined by a workgroup that is populated by a single wavefront).

So, graphics wavefronts are always at the most finely-grained level of tasks in such a system.

This document refers solely to work that has not started execution by a processor core. Once work is underway, it is no longer compatible with a work donation queue. Therefore it cannot be donated or stolen. This document is simply describing a distributed pool of available work whose tasks (wavefronts or groups of wavefronts) can be migrated around component parts of the pool. When tasks are initially assembed and pushed into the distributed pool, a "best estimate" can be made about which processor core is used to localise the available work.

I think it's also worth noting that a processor core in this document is not well defined. e.g. translating this concept into GCN, it can be argued that a set of CUs actually forms a core, since work-queuing and program caching operates at a level above the compute unit. See the 2nd image on this page:

http://www.guru3d.com/articles-pages/amd-radeon-hd-7970-review,5.html

where you can see that K$ and I$ are shared by 4 compute units. The important point here is that program code size for individual kernels may prevent the co-existence of distinct kernels within a "processor core" of 3 or 4 compute units. We know that GCN has a general problem with code size and work distribution, because "code length" is a parameter frequently referenced in optimisation guidelines.

So, code complexity, especially with competing graphics kernels or with very long kernels whose code needs to be paged into I$, effectively becomes another parameter that the work distributor has to handle. Implying that this would also need to be a parameter of work-stealing. There's no point trashing I$ (or K$) within a destination processor core by stealing work from another core.

This seems like a low-quality document to be honest. I couldn't find any aspect of it that solves the supposed problem outlined by:

These methods of rebalancing unprocessed workloads, however, suffer inefficiencies that prevent optimal processing device performance. For example, in high software thread count situations, work donation systems suffer from high software thread contention to data stored in local data storage, such as data stored in L1 cache memory that is accessible only to software threads executing on a particular processing core. Work stealing may suffer similarly in high software thread count situations, and may also suffer from the overhead costs associated with stealing unprocessed workloads from software threads running on different cores (i.e. remote software threads). Thus, there is a need to improve load imbalances in processing devices.

In the end I see this as a fully transparent system. A graphics wavefront that hasn't commenced work is defined by quite a small amount of data, e.g. for a wavefront of fragment quads there's the coordinates of each quad and its gradient (usually several quads share gradient information) and the kernel Id (which translates into program Id and the state Id) shared by all the quads. What else?

For a compute kernel, individual workgroups (and therefore their constituent wavefronts) are defined by "top-left" coordinates and kernel Id.

Generally, the state associated with work that has not yet been commenced is pretty small and easy to move around the GPU. The most arduous state is DS, since it derives from TS output which is entangled with HS state. NVidia solved this problem ages ago and it's been tedious waiting for AMD to tackle it properly.

Sadly this document doesn't touch on that subject.

(One might argue that tessellation is a dying concept in modern rendering - if it isn't already dead.)

Anarchist4000 · Jan 22, 2017

3dilettante said:
Perhaps it's too new to be in Vega? The known introduction of the binning rasterizer was first mooted in 2013, and then continued in August 2016.

Perhaps, although AMD has an affinity towards using FPGAs to manage their queues of late. This may very well be the work distribution mechanism we've seen mentioned.

3dilettante said:
There are some elements to this that make it unclear how transparent this is to software, or what workgroups and wavefronts are in this scheme, with wavefronts being able to enqueue and dequeue work items after they have been allocated and are executing, and those items are able to be pulled into other workgroups or other processors.
...
The various ways for monitoring and moving work can lean on specialized hardware, with work being able to be donated locally and stolen more globally. Keeping track of the queue pointers could provide a way to know how much work there is, by knowing the current pointer's offset from the queue's base. Fiddling with that value atomically via a specialized path should help maintain integrity, and I think it might allow scoped operations that could be synchronized via CU-level or GDS-level barriers, at a minimum.

With a heterogeneous configuration this may make a lot more sense. While not explicitly stated in that patent, a variable SIMD or hetereogenous design might use a similar technique. For example, a parallel workload moved onto a scalar wouldn't easily be transferable until some portion of the work completed. ACEs sort of implement this at a global level, but if broken into subroutines within a CU/cluster for more efficient scheduling, a more local mechanism would be warranted. That would account for donation and stealing. Throw a wave/workgroup back on the local queue when the width changes.

Jawed said:
This document refers solely to work that has not started execution by a processor core. Once work is underway, it is no longer compatible with a work donation queue. Therefore it cannot be donated or stolen. This document is simply describing a distributed pool of available work whose tasks (wavefronts or groups of wavefronts) can be migrated around component parts of the pool. When tasks are initially assembed and pushed into the distributed pool, a "best estimate" can be made about which processor core is used to localise the available work.

I'd hazard a guess this is related to the CWSR (compute wave save restore) they added a year or so ago. If the instructions were transitioned to subroutines, practical if moving a wave onto a CPU/scalar or narrower SIMD, work technically would not have begun execution. Possibly extending the queuing mechanism upwards over a cluster or set of CUs. ACE/HWS-like functionality kicking in at migration, compaction, and synchronization points. Run a parallel code path, then reschedule/migrate within a CU/cluster for a specialized path.

It also mentions that the L1/L2 cache are shown as on-chip memory, but "appreciated that they may be any suitable memory, whether on-chip or off-chip." That's very similar to what the ACEs were doing when spilling contexts into ram.

3dilettante · Jan 23, 2017

Jawed said:
This document refers solely to work that has not started execution by a processor core. Once work is underway, it is no longer compatible with a work donation queue. Therefore it cannot be donated or stolen.

The current method for allocation is all wavefronts in a workgroup are initialized in a CU up-front, if only to avoid potential deadlock on synchronization operations.
However, the proposal seems to weaken the link between a workgroup and the software threads it is running, and later has workgroups or even wavefronts initiating queue/de-queue actions, so execution is at least partially underway or there's some meta-programming capable of asserting a different form of ownership over software threads by other software threads.

"The work donation process allows a workgroup associated with a particular processing core to donate unprocessed workloads (e.g. a software thread waiting or needing to be executed) to another workgroup associated with the same processing core, such that unprocessed workloads may be transferred from one execution unit to another."

"Similarly, workgroup queue N 118 stores work queue elements representing unprocessed workloads that may be executed by software threads associated with workgroup N 136."

"The method begins at block 504, where wavefronts in each workgroup enqueue and dequeue elements from and to each workgroup queue using processing core scope atomic operations. For example, the hybrid donation and stealing control module 140 may allow wavefronts associated with workgroup 1 138 and workgroup N 136 to enqueue and dequeue queue elements to workgroup queue 1 116 and workgroup queue N 118."

The lifetime and scope of the workgroup relative to the instructions executed by it seems to be changed. Perhaps this a case of awkward wording in the document, but it seems to be saying workgroups are resident on a core and they have ownership of some kind over their associated queue of pending elements. They then can take on new elements for execution--without specifying there's uniformity or consistency between them and the arbitrary elements they've just stolen.

Intra-CU work-item movement is more compatible with the current method, since items like barriers and synchronization are still CU-level. It's still not a full match without further elaboration since the current architecture has effectively kicked everything off inside of a workgroup at that time. The movement between processors is even further out.

This document is simply describing a distributed pool of available work whose tasks (wavefronts or groups of wavefronts) can be migrated around component parts of the pool. When tasks are initially assembed and pushed into the distributed pool, a "best estimate" can be made about which processor core is used to localise the available work.

What I find different is that it is saying that the workgroups or their constituent wavefronts are performing the action of donating their items--which hints at something different since the current method will not allow a workgroup or wavefront to be in a state capable of action until all wavefronts are already in a state of fetching instructions into their instruction buffers and executing.

I think it's also worth noting that a processor core in this document is not well defined. e.g. translating this concept into GCN, it can be argued that a set of CUs actually forms a core, since work-queuing and program caching operates at a level above the compute unit. See the 2nd image on this page:

I think the determination of core may be informed by where the per-wave instruction buffers and program counters are maintained, which AMD's GCN presentation shows are present within a CU. The determination of program flow and the various special instructions can be encapsulated by everything contained in the CU (buffer-consumed instructions, scalar pipe, CU-level barriers), if the instruction buffers can be treated as a small L0 or L1 instruction cache.
(edit: On reflection, potentially not an L1 cache due to its consumption of instructions, although complicated somewhat by the supposed "prefetch" that might allow reuse. It would function like instruction buffers in CPUs prior to including instruction caches, which in no way invalidated that they were cores.)

In the end I see this as a fully transparent system. A graphics wavefront that hasn't commenced work is defined by quite a small amount of data, e.g. for a wavefront of fragment quads there's the coordinates of each quad and its gradient (usually several quads share gradient information) and the kernel Id (which translates into program Id and the state Id) shared by all the quads. What else?

The discussion of wavefronts taking actions and specially tagging instructions means it's not transparent to something.

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Moderator