AMD Vega Hardware Reviews

HWS were introduced with Polaris as I recall, but being programmable backported to Tonga and Fiji. In reference to the whitepaper and scheduling, HWS didn't exist at the time of publication. Paper dated 2011 and I believe released 2013 going off the URL. Nor did a readily usable implementation for async compute. Mantle was being experimented with at that time. It stands to reason the scheduling process is actively evolving over time. Which also explains the programmable hardware in the first place.
I'm pretty sure HWS were introduced before Polaris, but you're right that initially (or at least in the initial slides) Tonga/Fiji didn't have them. I think it was some hotchip (or some other such more professional oriented event) where they first published Fiji-slides with HWS
 
Polaris also added HWS, so it stands to reason some conventional wisdom may have changed. As mentioned above, non-dependent cached memory accesses make sense for the prefetch as an OoO optimization to prime caches. Then some barrier close to the expected latency to ensure all waves in a group hit the cache ASAP or at the very least stall only a short time.

From scattered forum posts about this, HWS is a multithreaded processor that handles run lists consisting of kernels to be launched, and can dynamically assign queues in memory to the ACEs. The two HWS blocks in the diagrams are actually one block that can run two threads. The HWS does not handle dispatch itself, and so it is one step removed from the dispatch pipeline that tries to arbitrate for wavefronts. This is in turn one or more levels removed from having visibility or communication with the CU's instruction fetch arbitration or execution progress. The CUs can readily function without it, since HWS can be disabled, and there is no obvious channel between the two domains or reason for them to notice. Since HWS is handing things off to ACEs, it doesn't seem to need awareness of the CUs either.
 
From scattered forum posts about this, HWS is a multithreaded processor that handles run lists consisting of kernels to be launched, and can dynamically assign queues in memory to the ACEs. The two HWS blocks in the diagrams are actually one block that can run two threads. The HWS does not handle dispatch itself, and so it is one step removed from the dispatch pipeline that tries to arbitrate for wavefronts. This is in turn one or more levels removed from having visibility or communication with the CU's instruction fetch arbitration or execution progress. The CUs can readily function without it, since HWS can be disabled, and there is no obvious channel between the two domains or reason for them to notice. Since HWS is handing things off to ACEs, it doesn't seem to need awareness of the CUs either.
While not very well documented, my take was ACEs handling dispatch and dependencies of assigned queues. I thought I read a while ago 8 pointers per ACE and probably some registers for tracking progress. HWS instructing the ACEs where to dispatch a wave as needed and load balancing. ACEs and HWS being the same hardware, including other functions for HSA, etc. They should all have the same exposure as HWS used to be ACEs. They seem like PLDs sitting on a data bus polling register values and making decisions based on whatever program was loaded. Somewhere in there they handle task graphs and dependencies.

ACEs shouldn't be involved in instruction dispatch beyond a pointer to the kernel for fetching. The documentation I recall was old, but they had 8 pointers and would step through lists of kernel pointers potentially in the tens of thousands. The only concern should be kernel progress regarding waves dispatched for determining indices.

I could actually see similar hardware for scheduling in each CU, but again nothing there would ever require public documentation as even the ACE/HWS is signed firmware.
 
While not very well documented, my take was ACEs handling dispatch and dependencies of assigned queues. I thought I read a while ago 8 pointers per ACE and probably some registers for tracking progress. HWS instructing the ACEs where to dispatch a wave as needed and load balancing. ACEs and HWS being the same hardware, including other functions for HSA, etc. They should all have the same exposure as HWS used to be ACEs.
From some descriptions on the Phoronix forums by AMD staff, they are actually not the same hardware. It was AMD's artistic license to try to draw ACE-labelled squares for the externally visible queues, just at it continued to be artistic license to place two HWS blocks in the diagrams when there is in actuality only one.

The HWS does not have the same dispatch capability of an ACE, and an ACE is one end of a pipeline used to arbitrate, construct, and launch workgroups. My impression is that there's further hardware that does much of the allocation and arbitration work, so what the ACEs are actually aware of is unclear to me.
The HWS hardware is further away from any change in the behavior of wavefront instruction issue and caching than the ACEs are, so it seems even less likely to have changed the realities of CU execution.
 
The HWS does not have the same dispatch capability of an ACE, and an ACE is one end of a pipeline used to arbitrate, construct, and launch workgroups. My impression is that there's further hardware that does much of the allocation and arbitration work, so what the ACEs are actually aware of is unclear to me.
The HWS hardware is further away from any change in the behavior of wavefront instruction issue and caching than the ACEs are, so it seems even less likely to have changed the realities of CU execution.
I'm not disputing HWS's lack dispatch, but that the hardware performing that function is interchangeable. Four blocks as I understand, each existing as 1 HWS, 2 ACEs, or something else entirely.

I don't believe either unit has any involvement with instruction caches beyond a single pointer to the kernel for reference. Instruction fetching left to the CUs or other hardware block.

These blocks are very simple as I understand. Handful or registers and some basic math capabilities. Simple microcontrollers that may be working with each other. No direct wiring to CUs or anything. The CUs on the other hand will need some direction on how to communicate with the units as a firmware update could relocate registers. My thinking is when a CU needs work it pings an ACE, HWS evaluates metrics from that CU and in progress kernels on the ACE, then the ACE dispatches a wave based on that result. Accounting for priority, age, dependencies, etc. Could be completely wrong on that, but it seems likely they are working together. HWS being added late to extend some capability that didn't exist in early GCN iterations. For that reason it shouldn't be critical, but have added better async handling. GCN1 was simple round robin across queues as I understand. Some of the console guys may have a better grasp on that original behavior as it should be the XB1/PS4 method.
 
I'm not disputing HWS's lack dispatch, but that the hardware performing that function is interchangeable. Four blocks as I understand, each existing as 1 HWS, 2 ACEs, or something else entirely.

I don't believe either unit has any involvement with instruction caches beyond a single pointer to the kernel for reference. Instruction fetching left to the CUs or other hardware block.

These blocks are very simple as I understand. Handful or registers and some basic math capabilities. Simple microcontrollers that may be working with each other. No direct wiring to CUs or anything. The CUs on the other hand will need some direction on how to communicate with the units as a firmware update could relocate registers. My thinking is when a CU needs work it pings an ACE, HWS evaluates metrics from that CU and in progress kernels on the ACE, then the ACE dispatches a wave based on that result. Accounting for priority, age, dependencies, etc. Could be completely wrong on that, but it seems likely they are working together. HWS being added late to extend some capability that didn't exist in early GCN iterations. For that reason it shouldn't be critical, but have added better async handling. GCN1 was simple round robin across queues as I understand. Some of the console guys may have a better grasp on that original behavior as it should be the XB1/PS4 method.
There is an intermediate layer called SPI, that sits between front-ends (ACEs & all graphics front-ends) and CUs, for the purpose of work distribution. It receives dispatches from the front-ends, and assigns workgroups to CUs with idle resources.

This is not a thing that is well-documented. But you can find relevant clues in the GPU hardware register docs, ISA docs, and open source drivers. It was also illustrated in the diagrams of the VGLeaks PS4/XB1 GPU architecture leaks.
 
Last edited:
There is an intermediate layer called SPI, that sits between front-ends (ACEs & all graphics front-ends) and CUs, for the purpose of work distribution. It receives dispatches from the front-ends, and assigns workgroups to CUs with idle resources.

This is not a thing that is well-documented. But you can find relevant clues in the GPU hardware register docs, ISA docs, and open source drivers. It was also illustrated in the diagrams of the VGLeaks PS4/XB1 GPU architecture leaks.

Just a minor correction: ACEs spawn workgroups (not dispatches), according to https://www.slideshare.net/DevCentralAMD/gs4152-michael-mantor.

So the assumption is that ACEs get pinged back only after an outstanding workgroup is completed. Since the commands of ACEs work at the granularity of dispatches (grids/cubes), these ping backs should not touch the software queues directly, but only the ACE's internal grid-to-group dispatcher, which in turn wakes queues as appropriate.
 
Last edited:
Just a minor correction: ACEs spawn workgroups (not dispatches), according to https://www.slideshare.net/DevCentralAMD/gs4152-michael-mantor.

So the assumption is that ACEs get pinged back only after an outstanding workgroup is completed. Since the commands of ACEs work at the granularity of dispatches (grids/cubes), these ping backs should not touch the software queues directly, but only the ACE's internal grid-to-group dispatcher, which in turn wakes queues as appropriate.
https://www.slideshare.net/mobile/D...-gcn-architecture-a-crash-course-by-layla-mah

Slight correction: "dispatch one wavefront" per cycle, creating workgroups as needed. At the bottom of one of the slides from another presentation. So 4 waves in a workgroup take 4 cycles to schedule. All of this subject to change as HWS don't exist and ACEs are programmable.
 
https://www.slideshare.net/mobile/D...-gcn-architecture-a-crash-course-by-layla-mah

Slight correction: "dispatch one wavefront" per cycle, creating workgroups as needed. At the bottom of one of the slides from another presentation. So 4 waves in a workgroup take 4 cycles to schedule. All of this subject to change as HWS don't exist and ACEs are programmable.
The one wavefront per cycle figure, as far as I know, belongs to the publicly invisible SPI. SPI schedules resources at the granularity of workgroups, because of the presence of group-level semantics and resources (LDS & barriers).
 

Seems kind of dumb to release a review this late when the conclusion includes this statement: "After a year of teasing us with tales of the Vega architecture, it's just days away from releasing three desktop-oriented gaming cards and a driver that'll enable the much-discussed Draw Stream Binning Rasterizer. If the impact of that software falls flat, this could be much ado about very little."

So why not wait until the new driver is out? The review is already late.
 
Seems kind of dumb to release a review this late when the conclusion includes this statement: "After a year of teasing us with tales of the Vega architecture, it's just days away from releasing three desktop-oriented gaming cards and a driver that'll enable the much-discussed Draw Stream Binning Rasterizer. If the impact of that software falls flat, this could be much ado about very little."

So why not wait until the new driver is out? The review is already late.
It contains comparisons to P6000 in professional workloads. This is where Vega FE is supposed to shine.

Also, the review used to be in German near Vega FE's release, it just took them this long to translate it.
 

I know its an old review just refreshed with english, but why would they only test a $1000 GPU against the $5500+ Quadro version and not the $2000ish P5000 or $1200 Titan XP?

It just is odd to me to leave that kind of information off the graphs or at least on the "feature comparison" from page 1. They say its "Included in the mix is Nvidia's Quadro P6000, which is around three times more expensive than AMD's Frontier Edition board." which doesn't make any sense, because its not 3x, its 5x+.

https://www.bhphotovideo.com/c/product/1319335-REG/hp_z0b12at_nvidia_quadro_p6000.html is the cheapest I can find @ $5,600

There is one cheaper off amazon, but its by a 3rd party with a total of 37 reviews in their lifetime... so I'd be hardpressed to send them $5k. HP Sells it in one of their systems for $5500 as well so that seems to be the going rate.

Maybe its just me, but it seems wrong to compare a product against one that costs 5x as much without making it very clear that is the case. I mean how often do we compare a 1050 vs a 1080 Ti to see what delivers better FPS?
 
That it does. This is the only Vega SKU that looks appealing to me, at this point. That is not to say no other SKUs will appeal to anyone else, just that this is the only one I would consider buying. I'm sure it will cost an arm and a leg though.

I think they said that the SSG will cost $7k
 
In germany, you can order a P6000 at about 3,200 EUR at amazon.de (sold by them as well). Currently, though, it's out of stock - maybe it was in stock, when the review was initially published?
 
Truthfully, if I was filthy rich this would be the card I'd want. Or a couple of those even. How cool would it be to virtually end all loading times in games and streaming hiccups in free-roaming games?
Does the SSG-storage actually show as HDD? At least they have some sort of API for it, currently supported (in beta) by Adobe Premiere & After Effects only
 
Back
Top