AMD Architecture Discussion

[...], which ensures no overlapping pixels gets packed separately with a different wavefront with a larger ID.
Oops, typo. It should be: which ensures overlapping pixels gets packed separately with a different wavefront with a larger ID (that is in submission order).

I am not quite sure about if Intel GenX works this way, but it was a thing I read in this paper, which was where the idea was kind of derived from.
 
It does not necessarily work at the granularity of pixels though. At least in theory, the post-rasteriser wavefront packing can be batched and sorted, which ensures no overlapping pixels gets packed separately with a different wavefront with a larger ID.
It might be in theory, but I don't recall it being done this way for GCN. Some of the proposed methods might locally batch or defer, but this provides no global guarantee.

Static ownership. The global 2D screen space is already interleaved in Z order between the rasterisers. So the sub-partitioning ideally would be interleaving in Z order within the local 2D space of the global partitions.

Did the Realworldtech tests of deferred tiling show a Z-order handling of screen space? The testing showed a more linear scan. The tiles are configured to match the stride length already, and the ROPs are set up to match up the physical channels. Is there a reference to the space filling curve being used for ROPs?

Yes, but then at least for alpha blending, AFAIU the pixel export process would have to reorder out-of-order exports back into the submission order for stable and consistent results.
I'm assuming you mean with the proposed new scheme?
The current scheme is that no such reordering is done and the results can be unstable and inconsistent.
 
Last edited:
6. Since there is already ordered sync at CU group level, perhaps a workgroup (that needs only unordered sync) can also be allowed to span over CUs within the same group. That would require the LDSes moving a bit further from the CUs though, and not as pipelined as it was.
Currently they have LDS and GDS, which are roughly equivalent hardware, but with GDS having additional sync capabilities. The LDS moving away I don't see as a problem. I've been envisioning them more as the memory controllers, part of the export pipeline when joined with other CUs, and synchronization mechanisms. A significant amount of their workload moved to that proposed scalar. Having a full crossbar it would be reasonably efficient at repacking data and some of the interpolation tasks. It should work well for de/compression as well without blocking SIMD execution by my thinking.

Along the lines of the current scalar unit being more flexible, it could even do some of the cache work if a thread was designed around alternating scalar and vector pathways. SIMDs for heavy lifting and scalar doing control flow and data sharing. An extension of the current capabilities with more robust hardware and SGPRs moved into VGPRs. Along the lines of pushing a workgroup onto a SIMD. Certainly not required, but an option to explore. Avoids register limitations, but some vector space is likely wasted.

So in the end, say with 64CU Greenland, it may have 16 CU groups, each of which has 4 CUs, 4 Color ROPs, 16 Z/Stencil ROPs, 128KB L2 Cache and 256KB shared LDS.
In theory it also has 16 memory channels which might line up well with that 16 CU group organization. Could be a baseline to extrapolate the organization as only the HBM designs had that many channels.

Just to point out for future clarification, coherence wouldn't change the problem that OIT solutions seek to solve: wavefronts reaching their export phase for transparent data at varying times based on the dynamic behavior of the CUs involved, leading to inconsistent final output. Having multiple threads spread across different cores in a coherent multicore CPU would have similar problems in the absence of any additional synchronization.
Raster-order views inject a stall if it is determined that a pixel in a ROP tile is currently being worked on by a primitive earlier in submission order.
Not directly, but I'm imagining a significant rework of the export pipeline along with ROP and LDS/GDS behavior with the mesh. At least with Navi, although I guess Vega could have it. OIT would be relevant when all the transparent samples got read back in and processed. This would be up to the programmer if they desired unlimited samples or some sort of compression like PixelSync. My theory might consider breaking the submission order in favor of binning and synchronization. ROVs with opaque samples, while not ideal from a performance standpoint, wouldn't necessarily care about the submission order. It should address certain scaling concerns, although you may need to be careful reading back samples. Tiled rasterization, where all geometry is known along with unknown samples, could help here.

I'm having some difficulty parsing this, and what parts are different or the same as before.
The programmable CUs are the first stage already, if the definition of the second stage for the ROPs is basic blending and data manipulation.
Having the working cache compressed for the second stage would be different, but it would be a change because it would likely make parallel work significantly more painful to achieve.
I don't follow how QR is supposed to be conditional on the CUs doing something strange. It's either being used or it isn't. It's a little late to use it if it was not used and then we realize something strange happened. That would sound like some kind of very optimistic speculation with an intractable fallback case.
I'm still working this idea out myself. My definition of the second stage was more along the lines of aggregating results at a macro level. One ROP per memory channel arbitrating exports from each CU or cluster. It might be more than just a ROP along the lines of GDS/LDS with the heavy lifting having occurred in the first stage. As I mentioned before, I was expecting some of these ideas to be related to Navi and the "scalability" on the roadmap.

The QR wouldn't be conditional, but an option for efficiently coalescing results within multiple CUs prior to export. Possibly part of the solution for that first stage. Partitioning cache resources, possibly in different locations, to a task allocated to a group of CUs. It would make more sense for a multi-chip solution, and with the paper presented would seemingly be a better match to that timeframe. It's definitely optimistic speculation as I wasn't expecting some of this to be occur with Vega. It doesn't seem out of the realm of possibility though, just need to rethink a lot of conventional wisdom on past designs.

As noted in that paper, ~2048 bits of interface per core was prohibitive in terms of die are lost to pad space. (the paper's projections are actually optimistic for what we have currently for bump pitch)
In that regard, could you clarify what you mean by having most the interface disabled? These are physical objects with non-zero area that doesn't become available if they are turned off.
The physical implementation of the network spans both
the multi-core processor die as well as the silicon
interposer, with shorter core-to-core links routed across
the multi-core processor die and the longer-distance
indirect network links routed across the interposer.

Selective concentration is employed to limit the area
overheads of vertical connections (i.e., micro-bumps)
between the multi-core processor and interposer layers.
Not all of that 2048 needs pads though. The vast majority would likely exist within the die like most FPGAs as that is where most nodes on the mesh should reside. Selective grouping reducing the number further. The portion through the interposer reserved for multi-chip solutions, including the Zen APU. The interconnect I'm envisioning would be a full mesh, say 17 nodes all linked together through variably sized channels. All global channels totaling 2048(probably another channel or two for IO and communication). Local channels likely increasing that number within a fixed organization. For a point to point route all but one link within a channel would be disabled. All links enabled if broadcasting or requiring a shared bus for whatever reason. Four nodes for example might share a channel for some purpose. The paper alluded to this as "indirect links on the interposer".

Export is given to what wavefront is able to negotiate an export buffer over the export bus at the end of a thread group's life span. ROV has a more global view, hence why Pixelsync stalls a group until earlier ones are able to finish their export. Without it, it's a race like any other concurrent access situation. At least with opaque geometry the constraint is what is closest, rather than the order of what was overdrawn/culled.
Two stage arbitration. LDS for local export and arbitration within a CU cluster, GDS (one per memory channel) arbitrating over global memory and IO resources from each cluster. PixelSync would only have to stall if the buffer was filled and blending/compression required. It would simply append the result, possibly opaque results, until read back and optimized. The ID buffer may stem from that as well if using it to bin geometry.

LDS should also be closely tied to a GDS unit and hopefully prioritize traffic to memory in order to spill cache. Allow the LDS unit to reasonably efficiently coalesce results that won't fit into L2 prior to writing out results through GDS that may have to arbitrate through a large number of channels and stalls.

I'm assuming you mean with the proposed new scheme?
The current scheme is that no such reordering is done and the results can be unstable and inconsistent.
With the ID buffer added to PS4 and "future AMD architectures" some reordering seems likely. Tiling or binning of geometry makes a lot of sense as demonstrated with Nvidia's solution. With the ROV capabilities described above and a theorized scalar that could efficiently sort the results it makes even more sense. Scaling should be easier if order is less of a concern.
 
Not directly, but I'm imagining a significant rework of the export pipeline along with ROP and LDS/GDS behavior with the mesh. At least with Navi, although I guess Vega could have it. OIT would be relevant when all the transparent samples got read back in and processed. This would be up to the programmer if they desired unlimited samples or some sort of compression like PixelSync.
I have a more fundamental question as to why a higher-level consideration like API-level fragment submission order would be brought down to the level of memory consistency. OIT or related solutions are concerned about indeterminism between primitives due to data races, whose individual threads could very well be operating on the strongest memory model and still have problems with.

My theory might consider breaking the submission order in favor of binning and synchronization.
GCN has an option for relaxing ordering requirements, and it already does so when it can. This specifically for scenarios where it is known that there will be no overlap, or it's something like a depth-only pass where order doesn't matter.

It should address certain scaling concerns, although you may need to be careful reading back samples. Tiled rasterization, where all geometry is known along with unknown samples, could help here.
Without the additional modifier of "deferred", the tiling does not change from the status quo. AMD's GPUs already tile their graphics hardware for the purposes of spatial locality.
AMD has proposed some form of binning or deferment to a limited number of primitives, however, this is not a guarantee since the limits of the bin can be exceeded, special conditions or operations can force the GPU to fall back to immediate rendering.
One such scenario may be when depth cannot cull output in the case of transparency.

The QR wouldn't be conditional, but an option for efficiently coalescing results within multiple CUs prior to export.
What is the more efficient method? With a tiled setup like the current raster output, the CU's are either writing to different locations that QR doesn't affect, or they do and they compromise the assumption QR makes that data is rarely read after writing.

Not all of that 2048 needs pads though.
Pad area is baked in. The large via and pad make the area around them unable to be patterned for other uses, and around that area is the silicon hard-wired to interface with the pad. If those pads are not needed, the matching portion of the interface going off-die and related silicon need to be removed at the design level.

The vast majority would likely exist within the die like most FPGAs as that is where most nodes on the mesh should reside.
Do you mean an FPGA like the models from Xylinx that have slices put on top of an interposer?
If not, then there's no pad space to worry about because it's not trying to go off-die.
If yes, then the die area used by the interface is there, and the hope would be that it gets used.

Two stage arbitration. LDS for local export and arbitration within a CU cluster, GDS (one per memory channel) arbitrating over global memory and IO resources from each cluster.
The "one per memory channel" arbiter right now is the export/GDS path already as far as the GCN ISA is concerned.
Using the LDS in this manner points to further changes, in light of the risk of forcing stalls simply due to being unable to allocate LDS storage and because the LDS isn't supposed to span different workgroups--which the export/GDS path does.


With the ID buffer added to PS4 and "future AMD architectures" some reordering seems likely.
The ID buffer as described is a write path that works in parallel with depth.
It's not clear in this case what reordering is needed if it works in concert with depth.
The ability to track the order and tag what primitive covers a pixel can be associated to some kind of binning, although the nearest proposals from AMD to the ID buffer that mention something like this pixels go into a wavefront for a given primitive after culling based opaque geometry coverage. I'm not sure I can safely interpret if it can handle transparent outputs that don't fall behind the nearest opaque sample.
 
It might be in theory, but I don't recall it being done this way for GCN. Some of the proposed methods might locally batch or defer, but this provides no global guarantee.

Did the Realworldtech tests of deferred tiling show a Z-order handling of screen space? The testing showed a more linear scan. The tiles are configured to match the stride length already, and the ROPs are set up to match up the physical channels. Is there a reference to the space filling curve being used for ROPs?
The one tested by RWT was... Radeon HD 6670, which was pretty old.

Here is what GCN looks like.
https://michaldrobot.com/2014/04/01/gcn-execution-patterns-in-full-screen-passes/


I'm assuming you mean with the proposed new scheme?
The current scheme is that no such reordering is done and the results can be unstable and inconsistent.
This is quite surprising if it is true. I have read long ago that alpha blending is applied in primitive submission order. This is also why OIT is considered order independent.
 
I have a more fundamental question as to why a higher-level consideration like API-level fragment submission order would be brought down to the level of memory consistency. OIT or related solutions are concerned about indeterminism between primitives due to data races, whose individual threads could very well be operating on the strongest memory model and still have problems with.
Simplifying the submission model and some scaling consideration. Submitting fragments from two overlapping draws simultaneously. Cases where races are more likely to be encountered or certain restrictions relaxed. The solution could very well be something we haven't seen or is years out. Some async behaviors could also be problematic, so erring on the side of caution here. I'm also partial to using the theorized scalars for geometry processing and tessellation which changes the requirements a bit. Fiji had 4 geometry processors, my design in theory could be ~256 using the scalars.

For example collect 16 samples for a single pixel. The ROP just collects the samples so order is irrelevant. Then have a programmable stage invoked to consolidate those results either through culling, compression, or streaming. Let the programmer decide what gets kept. Then start binning again. Once the shading finishes, finalize all the bins with a shader pass that makes sense for the programmer. Culling opaque geometry, highly transparent multiple samples, or whatever method suits the application best prior to fragment shading, postprocessing, or timewarp. That could be deferred, immediate or a hybrid like Doom. Might not be as efficient as the current design, but it's programmable along with the capabilities that come with it. The timewarp especially may benefit from overlapping opaque samples.

Without the additional modifier of "deferred", the tiling does not change from the status quo. AMD's GPUs already tile their graphics hardware for the purposes of spatial locality.
AMD has proposed some form of binning or deferment to a limited number of primitives, however, this is not a guarantee since the limits of the bin can be exceeded, special conditions or operations can force the GPU to fall back to immediate rendering.
One such scenario may be when depth cannot cull output in the case of transparency.
What if the number of primitives wasn't limited? Close proximity to a CPU, dedicated cache, or some compression(on IDs) could relax those limits considerably. Better approximation of tile size could help as well. Moving away from the fixed function primitive engines with a significantly large bin could make it reasonable. This would be a significant evolution on the current method. In large part because of reorganizing the ROP design.

What is the more efficient method? With a tiled setup like the current raster output, the CU's are either writing to different locations that QR doesn't affect, or they do and they compromise the assumption QR makes that data is rarely read after writing.
The CUs "should" be writing to different locations. There could be some corner cases.

Pad area is baked in. The large via and pad make the area around them unable to be patterned for other uses, and around that area is the silicon hard-wired to interface with the pad. If those pads are not needed, the matching portion of the interface going off-die and related silicon need to be removed at the design level.
...
Do you mean an FPGA like the models from Xylinx that have slices put on top of an interposer?
If not, then there's no pad space to worry about because it's not trying to go off-die.
If yes, then the die area used by the interface is there, and the hope would be that it gets used.
Yes and no to the Xylinx model as this is a hybrid approach. For a MCM there will be some portion going off die. Either to the CPU, another GPU, FPGA, etc. All FPGAs I've encountered use a subtraction method for their interconnect. One to all then start disabling links at a specific node to create the desired connections. They're called Field Programmable Gate Arrays because you start gating off arrays to get the desired result. For example take a 32b data path, connect it to 10 separate 32 bit channels simultaneously, then gate off 9 of them to form a direct connection between points. Bit more complicated as they can generally route around conflicts by bridging at various points.

The "one per memory channel" arbiter right now is the export/GDS path already as far as the GCN ISA is concerned.
Using the LDS in this manner points to further changes, in light of the risk of forcing stalls simply due to being unable to allocate LDS storage and because the LDS isn't supposed to span different workgroups--which the export/GDS path does.
This would be a significant change, one that I'm not currently expecting, but may be possible. Given increased data sharing this should reduce the burden on GDS for tasks falling within a single workgroup. Doesn't currently work this way, but with increasingly larger chips it should avoid some of the contention as the cache gets segmented and specialized towards local units. Along the lines of increasing cache levels with GDS becoming L3, LDS to L2, and actual local possibly rolled into a scalar.

The ID buffer as described is a write path that works in parallel with depth.
It's not clear in this case what reordering is needed if it works in concert with depth.
The ability to track the order and tag what primitive covers a pixel can be associated to some kind of binning, although the nearest proposals from AMD to the ID buffer that mention something like this pixels go into a wavefront for a given primitive after culling based opaque geometry coverage. I'm not sure I can safely interpret if it can handle transparent outputs that don't fall behind the nearest opaque sample.
Other than a bin for IDs to index I'd agree the capabilities are a bit unclear. How they choose to handle it will be interesting. There are also motion vectors for the spacewarp techniques in VR. Even overlapping opaque geometry might be worth retaining along with the transparency for those techniques. Probably a handful of papers that could be written on techniques for that subject.
 
The one tested by RWT was... Radeon HD 6670, which was pretty old.

Here is what GCN looks like.
https://michaldrobot.com/2014/04/01/gcn-execution-patterns-in-full-screen-passes/
That's tagging the execution pattern, but that's not necessarily going to give what regions of the 2D screen belong to a given rasterizer, which should be a static assignment. It gives a 512x32 scan band, but indicates that wavefront issue follows Z order.

This is quite surprising if it is true. I have read long ago that alpha blending is applied in primitive submission order. This is also why OIT is considered order independent.
Apologies, I misread the sentence and missed the alpha portion. That should be applied in primitive order.

For example collect 16 samples for a single pixel. The ROP just collects the samples so order is irrelevant.
Which "order" is this? Order of wavefront write-back, shader program order, or source primitive submission order?
Hardware is nominally aware of the first. QR governs what happens with the second. The last exists at another level of abstraction where it's not really on the cache to know the difference.
Having the ROP "collect" samples seems to take this out of QR's purview entirely, since this is an invalidation protocol, not a value broadcast one. The ROP would need to be running some kind of polling loop if it's still participating in the coherence protocol.

What if the number of primitives wasn't limited?
At some point the finite storage available for the unit will be exceeded and the debate is how much the performance drops. If it's effectively unbounded, you can consider the whole screen just one big bin--rendering it moot or just transforming the whole thing into a TBDR.

Yes and no to the Xylinx model as this is a hybrid approach. For a MCM there will be some portion going off die. Either to the CPU, another GPU, FPGA, etc. All FPGAs I've encountered use a subtraction method for their interconnect.
This would be mixing up nomenclature from the NoC paper, which only addresses links in terms of connections that go into the interposer. The memory controllers and cores are part of the interposer mesh. The internal concentrating networks on the main die fall outside the scope, so what you are describing is the on-die portion of the concentrated network scheme.
 
Which "order" is this? Order of wavefront write-back, shader program order, or source primitive submission order?
Hardware is nominally aware of the first. QR governs what happens with the second. The last exists at another level of abstraction where it's not really on the cache to know the difference.
Having the ROP "collect" samples seems to take this out of QR's purview entirely, since this is an invalidation protocol, not a value broadcast one. The ROP would need to be running some kind of polling loop if it's still participating in the coherence protocol.
Any order in the case of a large bin. QR, or quite possibly another technique, would still be applicable to any compression techniques where the data is still somewhat dependent on the other samples. It would mimic blending in requirements, but you would still be able to extract all the samples. It would be applicable if another tile or cluster had to export data for any reason. Keeping certain counters or metrics in cache should accelerate the process. Still a question of if a ROP is a fixed hardware unit or run atop the CU. Some form of export caching mechanism.

At some point the finite storage available for the unit will be exceeded and the debate is how much the performance drops. If it's effectively unbounded, you can consider the whole screen just one big bin--rendering it moot or just transforming the whole thing into a TBDR.
TBDR is what I'm envisioning. I foresee designs with rather large caches in the future. With the APU strategy AMD seems to be pursuing there is 20MB on the CPU alone that could be linked in. An EDRAM node that only exists on GPUs could also do it. Future designs with a cache die stacked atop a logic die on an interposer that capacity should increase significantly. I'd agree binning is currently limited, but one big bin in the future doesn't seem unreasonable. Especially if we see any development of reconstruction techniques opposed to brute force rasterization. ASW for example changed the VR requirements rather significantly and that technique could extend to most games. Occluded geometry was a significant issue for that technique that an ID buffer or ROV could address.

This would be mixing up nomenclature from the NoC paper, which only addresses links in terms of connections that go into the interposer. The memory controllers and cores are part of the interposer mesh. The internal concentrating networks on the main die fall outside the scope, so what you are describing is the on-die portion of the concentrated network scheme.
Sorry, mix-up in the concentration nomenclature there. The interposer routing would be concentrated. The un-concentrated portion would primarily reside on the GPU die where most of the nodes reside. A portion of that would likely extend to the interposer to connect separate processors. That's what I was getting at in regards to concentration.
 
Random thought on NCU. Integrated PCIE switch as opposed to the bridge chips they've been using for SSGs, dual adapter, or what seems likely to be a fabric/network for the Instinct line. APUs wouldn't require one as the CPU obviously has ample PCIE lanes. Further what if it is an actual CPU, albeit a small one, that acts as command processor, external routing, etc. A larger APU like Naples wouldn't need one, smaller APUs have the CPU portion already. A standalone Vega might be designed to lean on a CPU for the command processor and in theory has the off die link designed in.
 
Last edited:
The L3 cache is a victim private to the core complex though. So unless there is an optional memory-side LLC (which is possible), or evictions can bounce between L3 slices, the GPU is not gonna thrash any of these CPU caches.
That assumes one or more cores aren't used exclusively as the command processor. A design that AMD has been hinting at. That's 2MB L3 per core which is probably sufficient for binning geometry on a relatively small APU. In the case of a dedicated ARM core they likely have nearly all available cache if they really want it. We already know they have plans to tightly couple the CPU to GPU, just a question of implementation. That assumes they don't have an additional pool.
 
That assumes one or more cores aren't used exclusively as the command processor. A design that AMD has been hinting at.
Any reference for this?
The command processor(s) currently have a very heavy specialization in grabbing a command packet and looking it up in a dedicated code store, then coordinating through dedicated queues and signals.
Taking that function so far away from the rest of the GPU seems problematic.
A high-performance core doesn't necessarily behave that consistently, and the cache hierarchy is not built to push updates or give consistently low latency all the way out to the L3/interconnect. Even with all the coherence talk from AMD, I'm not certain that the memory model will be even that tight, since HSA and QR promise something far weaker.

That aside, the command processor is not particularly concerned with binning geometry. That would happen somewhere in the dedicated hardware in the shader engines. There could be a processor or state machine (or many) in there, since it's bespoke processors all the way down.

edit:
The following about the PS4 being jailbroken has links to a presentation that includes discussion of the multiple proprietary F32 cores in the the command processor.
https://forum.beyond3d.com/posts/1959060/
 
Any reference for this?
I have to run in 30m, but I'll try to dig up some stuff later. I've seen some suggestions about it in XB1 documentation and some patents, will take some time to dig up though.

The command processor(s) currently have a very heavy specialization in grabbing a command packet and looking it up in a dedicated code store, then coordinating through dedicated queues and signals.
Taking that function so far away from the rest of the GPU seems problematic.
A high-performance core doesn't necessarily behave that consistently, and the cache hierarchy is not built to push updates or give consistently low latency all the way out to the L3/interconnect. Even with all the coherence talk from AMD, I'm not certain that the memory model will be even that tight, since HSA and QR promise something far weaker.
I'm talking in terms of APUs here, so while the command processor would be further away, it wouldn't be that far. Nothing like CPU to discrete graphics. This design would be foregoing the specialization in favor of adaptability. There would be a likely split between HWS/ACE and possibly a similar unit in functionality for graphics? Something hinted at with the PS4 Pro article. With dedicated hardware keeping queues close to the hardware moving the CP away might not be that problematic. It need only fill the queues which occurs at a far lower frequency and it's on the same package with a configurable mesh. With the HSA work there could very well be specialized hardware to speed translations as even the CPU might have to interpret, generate, or even execute GPU commands.

That aside, the command processor is not particularly concerned with binning geometry. That would happen somewhere in the dedicated hardware in the shader engines. There could be a processor or state machine (or many) in there, since it's bespoke processors all the way down.
Didn't mean to imply that the CPU or command processor would be doing the binning. Just that it's a rather likely node on the network that would likely have adequate storage for a bin. Possibly analogous to NCU or specialized core for the implementation. Given the mesh it should be reasonably linked in. Say an ARM core with extra PCIE lanes to drive a SSG, external network, or just house a large cache that could be the cores memory channel.
 
I'm talking in terms of APUs here, so while the command processor would be further away, it wouldn't be that far. Nothing like CPU to discrete graphics.
The context was using cores in the separate CPU memory cache hierarchy.
Discretes would be difficult, although the board distance is not the major contributor to latency.
Even in an APU, it's a significant change since the comparison was using cores insulated behind 3+ cache layers and an interconnect with X number of links, northbridge, and an IOMMU, versus a command processor block that is physically wired and speaks the language of hardware it is interfacing with.

This design would be foregoing the specialization in favor of adaptability.
I'm not following what it's adapting to, or the restriction since they are programmable. The job currently is to process API/device commands and signals for a bunch of esoteric hardware in a timely and reliable fashion. The meat of the graphical/computational load isn't there, but there is an assumption that it is available and consistent for hardware expecting interactions handled in a fixed/consistent number of cycles--no page faults, context changes, evicted microcode tables, mesh congestion, or tens to hundreds of cycles of fun cache/northbridge variability.

There would be a likely split between HWS/ACE and possibly a similar unit in functionality for graphics? Something hinted at with the PS4 Pro article.
There's apparently some kind of work distributor, although as described it sounds like it's somewhere past the command processor, in the area of the setup pipelines and shader engines.

With dedicated hardware keeping queues close to the hardware moving the CP away might not be that problematic. It need only fill the queues which occurs at a far lower frequency and it's on the same package with a configurable mesh.
The CP is dedicated hardware, and how does one fill the queues inside the GPU from the host side of the cache hierarchy, unless there's some sort of "processor" that monitors and pulls in "commands" on the other?

With the HSA work there could very well be specialized hardware to speed translations
This is the microcode engines in the command processors, and why there are discrepancies in which GPUs could take the necessary microcode updates to support it.
From my interpretation, HSA's command packets for non-host agents are assumed to be handled by a specialized controller, since HSA says they must be processed without taking up kernel execution resources.
 
I'm talking in terms of APUs here, so while the command processor would be further away, it wouldn't be that far. Nothing like CPU to discrete graphics. This design would be foregoing the specialization in favor of adaptability. There would be a likely split between HWS/ACE and possibly a similar unit in functionality for graphics? Something hinted at with the PS4 Pro article. With dedicated hardware keeping queues close to the hardware moving the CP away might not be that problematic. It need only fill the queues which occurs at a far lower frequency and it's on the same package with a configurable mesh. With the HSA work there could very well be specialized hardware to speed translations as even the CPU might have to interpret, generate, or even execute GPU commands.
Queues are in the shared memory, regardless of whichever existing programming models (incl. HSA). The oversubscription-supported command processors only bind in a selected few at a time, streams the commands from the memory, and context-switch them back to the memory as appropriate.

Especially for HSA, this is actually something defined clearly — heterogeneous agents communicate in the user space through shared memory with a common queueing format and doorbell interrupt mechanics.
 
The context was using cores in the separate CPU memory cache hierarchy.
Discretes would be difficult, although the board distance is not the major contributor to latency.
Even in an APU, it's a significant change since the comparison was using cores insulated behind 3+ cache layers and an interconnect with X number of links, northbridge, and an IOMMU, versus a command processor block that is physically wired and speaks the language of hardware it is interfacing with.
Significant change, but not necessarily unreasonable. It may be possible to strip out the northbridge and IOMMU and interface more directly to an AQL. Say 16 core Naples treating 16x4 CU clusters as an extension of it's resources as opposed to one large device. Each core reserving 4 CUs and the mesh interconnect tailored to suit that design. Shaders are capable of moving data and the CPU could be wired into the control mechanism. According to the HSA requirements a kernel should be capable of filling it's own queue. The CP aggregates commands from across a bus, but if there was only a single source and no need for global synchronization it could probably be bypassed. If the control network was part of the mesh design the CPU could tap into that along with some specialized hardware.

I'm not following what it's adapting to, or the restriction since they are programmable. The job currently is to process API/device commands and signals for a bunch of esoteric hardware in a timely and reliable fashion. The meat of the graphical/computational load isn't there, but there is an assumption that it is available and consistent for hardware expecting interactions handled in a fixed/consistent number of cycles--no page faults, context changes, evicted microcode tables, mesh congestion, or tens to hundreds of cycles of fun cache/northbridge variability.
This would be an ARM core tailor built to drive the GPU. There could be a FPGA or custom hardware along with it, however I think the ARM core might be simpler to interface. The point of adaptability, or keeping it separate, would be to have different versions for the desired task. One with a large cache as addressable memory/ESRAM, perhaps certain IO mechanisms (NVMe/SSG, Infiniband, 10GbE, USB3.1/Thunderbolt, display outputs), etc that might be desirable to update without completely changing the GPU design. I've even seen some 60Gbps external switched fabrics mentioned. Swapping the ARM core and updating the interposer and PCB would be all that is required to significantly modify the design without wasting die space. Without changing the GPU die they could update to a newer Displayport standard for instance. Some of those features wouldn't make much sense in consumer products just as display outputs wouldn't in server chips. An extension of that ambidextrous strategy they keep mentioning. Virtualization and their security mechanisms might want it as well.

This is the microcode engines in the command processors, and why there are discrepancies in which GPUs could take the necessary microcode updates to support it.
From my interpretation, HSA's command packets for non-host agents are assumed to be handled by a specialized controller, since HSA says they must be processed without taking up kernel execution resources.
Currently yes, but there has been mention of streamlining the scheduling process. Dedicating a core could be construed as not taking up kernel execution resources. Limit the scope to smaller datasets, similar to using a CU to accelerate SSE style instructions for a single core, being able to interface more directly might be preferable. They've also mentioned "Initial SW GPU scheduler (disabled by default)". Link Which is a bit strange considering all their hardware scheduling capabilities, but would suggest the ability to bypass the hardware. It would make sense from the standpoint of heavily partitioning the device and lower latency. The dedicated hardware shouldn't be needed if the maximum scope of synchronization is a single CU or possibly cluster. In addition to the mesh connecting the CPU to each cluster directly.

I'll see if I can find that ARM core slide, but it was a while ago I recall seeing it. I think it had to do with XB1 bypassing the CP in favor of software for indirect execution or something.
 
They've also mentioned "Initial SW GPU scheduler (disabled by default)". Link
If you have been following AMDKFD for a lomg time, you would have known that the "SW scheduler" was said to be only for GPU bring-up and debug purpose. What AMD prefers is the GPU side hardware scheduling with over-subscription, which runs independently from the CPU and communicates with the CPU via the shared memory.

https://lkml.org/lkml/2014/11/8/161
 
It seems the "initial SW scheduler" is actually something specific to the graphics stack, according to bridgman.

https://www.phoronix.com/forums/for...er-being-implemented-for-amdgpu-kernel-driver

This is more for OpenGL workloads (and compute going through kernel graphics driver I guess).

The HSA stack uses a hardware-based scheduler, but the primary purpose of that scheduler is to provide support for an arbitrary number of user-space compute rings. The kernel graphics driver doesn't need that because it uses a single kernel ring and allows multiple processes to take turns submitting to it. The GPU scheduler Alex just posted basically maintains a number of kernel rings that are not connected directly to the hardware, and feeds work from those rings into that "single kernel ring" which the HW pulls from.
 
It doesn't seem like the HSA specification has been interpreted correctly, and the understanding of the programming model doesn't feel right either.

According to the HSA requirements a kernel should be capable of filling it's own queue.
A kernel does not have a queue.

An application process can request multiple user-mode queues from any specific HSA agent, and enqueue packets to the queues. Packets (so far) can be either a kernel dispatch packet or a wait packet. A kernel is just a piece of program.

A kernel dispatch contains any arbitrary number of work-items that can be organised as a one, two or three dimension space. In reality, it is just X size, Y size, Z size, a kernel pointer, a pointer to kernel arguments, the workgroup size and a few other flags and stuff. It is up to the HSA agent to break a kernel dispatch down in whatever ways they prefer.

In a GCN GPU, all CUs sit behind a common scheduling layer, which is connected to multiple packet processors (ACEs) that can serve multiple processes at a time. Not "CPU cores" or "threads". We are talking about application processes that own a virtual address space.

The CP aggregates commands from across a bus, but if there was only a single source and no need for global synchronization it could probably be bypassed. If the control network was part of the mesh design the CPU could tap into that along with some specialized hardware.
The packet (command) processor itself does not aggregate packets. The packets are written to the coherent memory, and the packet processor pulls them in upon being notified by other agents through user-mode signalling (doorbell interrupt).
 
Last edited:
Significant change, but not necessarily unreasonable. It may be possible to strip out the northbridge and IOMMU and interface more directly to an AQL.
And the cache hierarchy, variable execution time, and lack of a domain-specific microcode store. On top of that, the command processor generates additional signals and instructions based on the contents of the packets, which generally should mean data amplification and a loss of abstraction if this is exposed to the host core complex. This is compromising the CPU's effectiveness in general if hooks for GPU packet processing and arbitrary hardware fiddling are added there.

This would be an ARM core tailor built to drive the GPU.
There are ARM cores for embedded or control with items like local memory storage and an emphasis on fixed cycle costs for all operations. What is not done with them is taking that and putting it into the CPU host complex. There are dozens of similar cores already driving the GPU, for which the ISA is not a particularly relevant distinction. It may be the case that the ISA is changed, but the role doesn't shift because of it.

There could be a FPGA or custom hardware along with it, however I think the ARM core might be simpler to interface. The point of adaptability, or keeping it separate, would be to have different versions for the desired task. One with a large cache as addressable memory/ESRAM, perhaps certain IO mechanisms (NVMe/SSG, Infiniband, 10GbE, USB3.1/Thunderbolt, display outputs), etc that might be desirable to update without completely changing the GPU design.
All of these change already without the command processor needing to be re-architected, most of those already change pretty freely since they are already behind microcontrollers or an IO stack. The ESRAM or cache would be an architectural change, but as in the case of the Xbox One, not particularly relevant since it's not there for the command processor.
 
Back
Top