AMD Architecture Discussion

lanek · Dec 21, 2016

Anarchist4000 said:
It would definitely have to be restricted to major stops or points requiring the bandwidth. So "all" probably wasn't a good description. Zen for example should be one(8c16t?) node with ideally full memory bandwidth for an APU. Same logic would apply to a second GPU, however if it adds 4+ stops that could make the mesh a bit large. A smaller bus for control, GCP/ACE/HWS, masks, etc might be able to maintain the mesh. Still doesn't really address the full bandwidth issue. Maybe design the chip so each link gets cut in half to accommodate more nodes? Rely on the interposer to change the network along with some firmware.

I understand what you were saying. I was suggesting that with wider links, or an interesting implementation, the transaction time could be reduced. Avoid some collisions, however some synchronization would still be required. Only exception is if the mesh is actually signal based as opposed to data, where L2 runs at the the speed of a wide bus. Treat it like a register port as opposed to network. Sort of like how FPGAs handle their interconnects. FPGA might be a good example of a mesh interconnect as they are giant configured meshes. L2 would likely run a bit slower and any clients off package wouldn't work very well. Addressing would be interesting. That might still be worthwhile though, treat it more like L3 than L2.

As i have not really follow much the discussion, i could have interpretate it bady, but, doesnt Vega and Zen introduct a new interconnect link ? with extremely high bandwith compared to previous one ( witth nearly no latency at all ) ? The R&D have been extremely costly ( in term of peoples and time on working on it for AMD. )

pTmdfx · Dec 21, 2016

lanek said:
As i have not really follow much the discussion, i could have interpretate it bady, but, doesnt Vega and Zen introduct a new interconnect link ? with extremely high bandwith compared to previous one ( witth nearly no latency at all ) ? The R&D have been extremely costly ( in term of peoples and time on working on it for AMD. )

AMD (reported by EETimes) tells that they have a new data fabric that can scale not only for their mobile and server SoCs, but also to the scale of what Vega and beyond need (512 GB/s+). There is no detail about it, other than it being a network-on-chip and having only one variant with coherency ("comes only in a coherent version"). What we called GMI before is apparently the inter-chip component of this new architecture.

So we are talking about both the inter-chip link AND the on-chip cache coherent interconnect.

http://www.eetimes.com/document.asp?doc_id=1330981&page_number=2

Anarchist4000 · Dec 21, 2016

pTmdfx said:
Time to bring this up again: http://research.cs.wisc.edu/multifacet/papers/hpca14_quick_release.pdf

Just the abstract on that makes a ton of sense. I'll need to read through that later when I get more time.

One possibility is that they have the cache mapped to different pools and in turn interconnect channels. That would be easier to maintain than spreading cache across an increasing number of logical units. Cache size could then be tailored towards a specific task. Combine 6 CUs for one task, 4 CUs for another, etc and disable some of the interconnect links. So while it uses a mesh, not all the links would be active. Have a pool of interconnect links/buses from which to map out logical, task specific, hardware units. That would go a long way towards alleviating the spaghetti ball of an exponentially large mesh. Treat Vega as a giant FPGA, but instead of adders they'd be using entire CUs. Works well with what I've seen of that paper you linked as it might be increasing effective CU size in the process. Great for cache, but the cadence which normally accounts for synchronization would be interesting.

lanek said:
As i have not really follow much the discussion, i could have interpretate it bady, but, doesnt Vega and Zen introduct a new interconnect link ? with extremely high bandwith compared to previous one ( witth nearly no latency at all ) ? The R&D have been extremely costly ( in term of peoples and time on working on it for AMD. )

That's what we were discussing. The EETimes article said it was a mesh interconnect in excess of 512GB/s for Vega and they could rework it in a matter of hours as opposed to months. The debate was over just how to scale and implement that. My previous assumption was the link they were describing was just a point to point interconnect between GPU and CPU. A faster, lower latency PCIE link thanks to the interposer. Given the mesh it seems highly likely it's a FPGA style interconnect.

pTmdfx · Dec 21, 2016

Anarchist4000 said:
Just the abstract on that makes a ton of sense. I'll need to read through that later when I get more time.

One possibility is that they have the cache mapped to different pools and in turn interconnect channels. That would be easier to maintain than spreading cache across an increasing number of logical units. Cache size could then be tailored towards a specific task. Combine 6 CUs for one task, 4 CUs for another, etc and disable some of the interconnect links. So while it uses a mesh, not all the links would be active. Have a pool of interconnect links/buses from which to map out logical, task specific, hardware units. That would go a long way towards alleviating the spaghetti ball of an exponentially large mesh. Treat Vega as a giant FPGA, but instead of adders they'd be using entire CUs. Works well with what I've seen of that paper you linked as it might be increasing effective CU size in the process. Great for cache, but the cadence which normally accounts for synchronization would be interesting.

Kind of. But I was thinking about bandwidth amplification, which 3dilettante has touched on.

One key thing about QR IMO is that it can be layered easily, and it is designed with the assumption of weak consistency so that the only external coherence traffic would be invalidations. So it is possible to have (1) a building block of a small number of CUs and a private L2 for local bandwidth amplification; and (2) multiple building blocks on the mesh, with per channel L3 handling GPU local atomics and global write combining. Both levels could then use the same interconnect (e.g. ring locally, mesh globally).

Anarchist4000 · Dec 22, 2016

The timeline for that paper will be interesting. It may have been too late (~2014 best I can tell) into the development cycle of Vega to get implemented. The fabric design might allow for it though. The applications for that will be interesting, as it enables capabilities that GPUs previously avoided in the past. It may address the ROP scaling issues rather efficiently with the layering as you described. Obvious benefits for compute when needed. Might be the closest thing to compute compression we've seen yet. I could see delta compression working alongside that rather nicely for certain workloads.

pTmdfx said:
One key thing about QR IMO is that it can be layered easily, and it is designed with the assumption of weak consistency so that the only external coherence traffic would be invalidations. So it is possible to have (1) a building block of a small number of CUs and a private L2 for local bandwidth amplification; and (2) multiple building blocks on the mesh, with per channel L3 handling GPU local atomics and global write combining. Both levels could then use the same interconnect (e.g. ring locally, mesh globally).

In the case of Scorpio they could probably map an entire block of EDRAM into a CU cluster. For that matter an entire node on the mesh could be nothing but specialized cache, ROPs, or geometry engines. Definitely changes the programming model for certain elements.

One other interesting possibility is that the mesh could in theory be used like a series of buses to broadcast data. Or possibly a <2048b memory channel. Unsure what applications that might have, but if every CU had to read the same data it could work.

pTmdfx · Dec 22, 2016

Anarchist4000 said:
The timeline for that paper will be interesting. It may have been too late (~2014 best I can tell) into the development cycle of Vega to get implemented. The fabric design might allow for it though. The applications for that will be interesting, as it enables capabilities that GPUs previously avoided in the past. It may address the ROP scaling issues rather efficiently with the layering as you described. Obvious benefits for compute when needed. Might be the closest thing to compute compression we've seen yet. I could see delta compression working alongside that rather nicely for certain workloads.

Not quite sure how it would fix ROP scaling.

Anarchist4000 said:
In the case of Scorpio they could probably map an entire block of EDRAM into a CU cluster. For that matter an entire node on the mesh could be nothing but specialized cache, ROPs, or geometry engines. Definitely changes the programming model for certain elements.

One other interesting possibility is that the mesh could in theory be used like a series of buses to broadcast data. Or possibly a <2048b memory channel. Unsure what applications that might have, but if every CU had to read the same data it could work.

GPUs are already screen space partitioned in several stages (rasteriser, ROP, etc). It seems meshes could fit well. Say as ROPs are bound to shader engines and memory channels, they could virtualise the export bus to run on top of the NoC by exploiting locality in the mesh.

Anarchist4000 · Dec 22, 2016

pTmdfx said:
Not quite sure how it would fix ROP scaling.

https://www.eecis.udel.edu/~lxu/resources/TOP-PIM: Throughput-Oriented Programmable Processing in Memory.pdf
If for example the ROPs weren't in the CUs and you needed to coalesce the writes a bit more. Still reading through that one myself.

Throwback to the Navi discussion in March/April. Hitting on all the same stuff we're looking at now. Looks like parts of it happened with Vega though.
https://forum.beyond3d.com/threads/amd-navi-speculation-rumours-and-discussion.57684/

3dilettante · Dec 22, 2016

Anarchist4000 said:
https://www.eecis.udel.edu/~lxu/resources/TOP-PIM: Throughput-Oriented Programmable Processing in Memory.pdf
If for example the ROPs weren't in the CUs and you needed to coalesce the writes a bit more. Still reading through that one myself.

To clarify, the ROPs as we know them are not in the CUs, and are currently not coherent at all.
Their usage model has very separated synchronization points, and various behaviors break the Quick Release assumption of rare read after write traffic.
ROPs move tiles of data at a time, and the current method of use when dealing with depth and transparency puts a higher premium on reading data that was written earlier. Delta compression is currently not coherent and depending on the compressor's choice of tile and format can generate a variable and wide swath of invalidations based on whether a single value change flips the delta for the tile and possibly any other tiles/control words in the hierarchy.

How ROPs could be brought into the coherent domain would need to be outlined given some of the current design choices.
If they were brought in, it would seemingly mean bringing some kind of tiled memory pipeline into the neighborhood of the CU, or as a subsidiary export pipeline in the CU proper. There could be uses for doing so, including expanding the programmability of export operations , creating a different memory access type with different sizes and visibility controls, and possibly using ordering information and delta compression values as inputs to the shaders or the shader engine's wavefront logic.

Anarchist4000 · Dec 23, 2016

3dilettante said:
How ROPs could be brought into the coherent domain would need to be outlined given some of the current design choices.
If they were brought in, it would seemingly mean bringing some kind of tiled memory pipeline into the neighborhood of the CU, or as a subsidiary export pipeline in the CU proper. There could be uses for doing so, including expanding the programmability of export operations , creating a different memory access type with different sizes and visibility controls, and possibly using ordering information and delta compression values as inputs to the shaders or the shader engine's wavefront logic.

It's the programmable blending and OIT I was looking to address. At least the OIT should be coming with Vega, ideally the blending as they're similar. In theory that's somehow coherent along with the interconnect. My thinking was a two stage design with programmable first stage in the CU(s) as opposed to a programmable ROP elsewhere. Second stage being some basic blending or data manipulation logic. Possibly actual data compression and indexing to conserve space as opposed to just bandwidth. The QR for if programmable units in different CUs did something strange. Computing a tile wide value perhaps once the CUs simplified results. Perhaps related to DCC or a compression scheme. Ensure another CU didn't make any changes that would affect the outcome if several were working together. That would be at a rather course granularity.

Razor1 · Dec 23, 2016

OIT was in consoles before If I'm not mistaken, Sega comes to mind but was later dropped due to limitations depending on what you want and the accuracy you want, its extremely expensive and this is probably way it hasn't been used in GPU's much, so I wouldn't think that would be even remotely a reason for CU's and ROP's needs.

PS OIT performance has more to do with register pressure and local GPU cache and how the communication between the CU's in different blocks are handled.

pTmdfx · Dec 23, 2016

Razor1 said:
OIT was in consoles before If I'm not mistaken, Sega comes to mind but was later dropped due to limitations depending on what you want and the accuracy you want, its extremely expensive and this is probably way it hasn't been used in GPU's much, so I wouldn't think that would be even remotely a reason for CU's and ROP's needs.

PS OIT performance has more to do with register pressure and local GPU cache and how the communication between the CU's in different blocks are handled.

OIT can always be implemented in DX11 using per-pixel linked list, and AMD TressFx is using it. The problem is the unbound memory growth (if not hard capped), the information loss (if hard capped) and the scattered access pattern.

He probably meant Rasterizer Ordered View, which is IIRC essential for approximated/adaptive OIT.

pMax · Dec 23, 2016

I tried to follow up this discussion, but I do honestly not understand why are you talking of the bus interconnects.
AMD stated they moved to a coherent bus: coherence is made at MC level access, as at DC level is already too late.
So, given they were doing coherency between the GPU MMU and the CPU MMU using the PCI interconnect or something like that, imho this mean they made finally an intel-like unified-unified northbridge.
At least, that is what I understood.

pTmdfx · Dec 23, 2016

pMax said:
I tried to follow up this discussion, but I do honestly not understand why are you talking of the bus interconnects.
AMD stated they moved to a coherent bus: coherence is made at MC level access, as at DC level is already too late.
So, given they were doing coherency between the GPU MMU and the CPU MMU using the PCI interconnect or something like that, imho this mean they made finally an intel-like unified-unified northbridge.
At least, that is what I understood.

A cache coherent interconnect doesn't mean everything has to be cache coherent. It just means cache coherency is maintained, and clients can very well bypass the coherence domain as needed — GPU is an obvious one.

pMax · Dec 23, 2016

pTmdfx said:
and clients can very well bypass the coherence domain as needed — GPU is an obvious one.

...yet i miss why people are discussing about bumpers and so on: those terminations must end in DCT/PHYs, do you multiply them as well (edit: rethinking, they were thinking of interconnecting the varous MCs with bumpers, maybe?)? Plus, as they are talking of coherence, that thing happens in the northbridge. Coherence between CPU and GPU block means an unified NB across both. So, again, I miss the point?
(plus, once you have a full 512 gps coherent, why would you add an extra inchoerent bus? You use what you already have, with a different protocol maybe).
Also, a fully coherent unified-unified northbridge allows you to put L3 behind the CPU and GPU, thus getting those nifty intel benefit with large L3 caches..

pTmdfx · Dec 23, 2016

pMax said:
...yet i miss why people are discussing about bumpers and so on: those terminations must end in DCT/PHYs, do you multiply them as well (edit: rethinking, they were thinking of interconnecting the varous MCs with bumpers, maybe?)? Plus, as they are talking of coherence, that thing happens in the northbridge. Coherence between CPU and GPU block means an unified NB across both. So, again, I miss the point?

There are multiple facets of coherences that are being talking about. You are focusing on system-level cache coherence. We are discussing the coherence and consistency in GPU's private domain that is preferably incoherent to the system while reusing the cache coherence fabric (protocol & interconnect).

pMax said:
(plus, once you have a full 512 gps coherent, why would you add an extra inchoerent bus? You use what you already have, with a different protocol maybe).

Maintaining cache coherency has overheads, and in some cases things are intentionally designed to avoid it for maximum efficiency. Let's say a GPU writing its local frame buffer that is designed to be coherent with the CPU only at API boundaries. There is no point to put it within the coherence domain — it gives you a hell of ownership and invalidation traffic at the system level that wasn't necessary at all. Moreover, it is not "adding an extra incoherent bus" — modern SoC interconnects usually support both coherent and incoherent accesses under the same umbrella.

pTmdfx · Dec 23, 2016

pMax said:
...yet i miss why people are discussing about bumpers and so on: those terminations must end in DCT/PHYs, do you multiply them as well (edit: rethinking, they were thinking of interconnecting the varous MCs with bumpers, maybe?)? Plus, as they are talking of coherence, that thing happens in the northbridge. Coherence between CPU and GPU block means an unified NB across both. So, again, I miss the point?
(plus, once you have a full 512 gps coherent, why would you add an extra inchoerent bus? You use what you already have, with a different protocol maybe).
Also, a fully coherent unified-unified northbridge allows you to put L3 behind the CPU and GPU, thus getting those nifty intel benefit with large L3 caches..

If you need an example, the white paper of AMBA 4 ACE might help. Starting from Page 9, you will be informed that the interconnect supports both "non-shared" transactions that are non-coherent at system level, and various levels of coherent transactions (generally: caching vs non-caching client).

Anarchist4000 · Dec 23, 2016

pTmdfx said:
OIT can always be implemented in DX11 using per-pixel linked list, and AMD TressFx is using it. The problem is the unbound memory growth (if not hard capped), the information loss (if hard capped) and the scattered access pattern.

In DirectX 11, r/m/w operations had undefined behavior. These algorithms were frequently designed around per-pixel linked lists, which had unbounded memory requirements. In many cases the unbounded memory requirement can be completely removed, as various forms of lossy compression can be done while the data is inserted to keep the data set within a fixed size. An implementation of order independent transparency used in GRID 2 and GRID Autosport did exactly this—the first k pixels were stored in the k buffer and once its limit was reached any additional transparent pixels were merged with the current data using a routine to minimize the variation in the result from the ground truth.

https://software.intel.com/en-us/gamedev/articles/rasterizer-order-views-101-a-primer

This is the approach I was thinking. Accelerated with those high performance scalars I've been theorizing when complete or overflowing. They'd be perfect for the sorting or linked lists so long as they were small enough to fit within the 256B-1KB(guessing) scratch pad. Keep in mind I was theorizing packed vectors (ideally all the ROV samples) to feed them data where the scattering is a bit less of a concern. You just end up with the larger initial memory allocation. Use the traditional ROP design for aggregating results, then the CU/scalars if it starts overflowing as highlighted above. Along with finalizing those results in a programmable manner.

In the case of an overflow it would either compress, or get a new memory allocation to link to. One which likely isn't readily available to the ROP. Dump all the samples to a new linked location and start collecting again. Maintains performance until a shader decides to make use of all the samples.

pTmdfx said:
He probably meant Rasterizer Ordered View, which is IIRC essential for approximated/adaptive OIT.

Extension of ROV. Traditional ROP handling opaque samples, possibly appending to a list of transparent samples. The CU stage, along with that theorized scalar, doing the sort, blending, and overflow.

pMax said:
...yet i miss why people are discussing about bumpers and so on: those terminations must end in DCT/PHYs, do you multiply them as well (edit: rethinking, they were thinking of interconnecting the varous MCs with bumpers, maybe?)? Plus, as they are talking of coherence, that thing happens in the northbridge. Coherence between CPU and GPU block means an unified NB across both. So, again, I miss the point?
(plus, once you have a full 512 gps coherent, why would you add an extra inchoerent bus? You use what you already have, with a different protocol maybe).
Also, a fully coherent unified-unified northbridge allows you to put L3 behind the CPU and GPU, thus getting those nifty intel benefit with large L3 caches..

A multi-network topology that blends direct and indirect
network approaches, where any-to-any cache-coherence
traffic is supported on a subset of the NoC implementing
a direct-network topology, while the any-to-few core-
to-memory traffic is routed across a subset of the NoC
with an indirect topology.

http://www.eecg.toronto.edu/~enright/micro14-interposer.pdf
Current thinking was 2048b(?) interface on each CU or SE. Part of that tied directly to memory, another part remapped to a configurable network topology through an internal crossbar to create the mesh. The coherence would be a protocol over whatever topology was configured. Most of the interface would be disabled as it would be used by other nodes and would require a really large crossbar and ability to consume data in each node.

pTmdfx · Dec 26, 2016

Anarchist4000 said:
https://software.intel.com/en-us/gamedev/articles/rasterizer-order-views-101-a-primer

This is the approach I was thinking. Accelerated with those high performance scalars I've been theorizing when complete or overflowing. They'd be perfect for the sorting or linked lists so long as they were small enough to fit within the 256B-1KB(guessing) scratch pad. Keep in mind I was theorizing packed vectors (ideally all the ROV samples) to feed them data where the scattering is a bit less of a concern. You just end up with the larger initial memory allocation. Use the traditional ROP design for aggregating results, then the CU/scalars if it starts overflowing as highlighted above. Along with finalizing those results in a programmable manner.

In the case of an overflow it would either compress, or get a new memory allocation to link to. One which likely isn't readily available to the ROP. Dump all the samples to a new linked location and start collecting again. Maintains performance until a shader decides to make use of all the samples.

Extension of ROV. Traditional ROP handling opaque samples, possibly appending to a list of transparent samples. The CU stage, along with that theorized scalar, doing the sort, blending, and overflow.

A few wild ideas:

1. Let's assume the rasterisers are already owning large interleaved partitions in Z order of the screen space.

2. A CU group (let's say four of them) can further own a fixed subpartition in the logical space of their parent rasteriser.

3. Since the subpartitions are fixed, maintaining the API ordering can be done locally in the CU group. This means the synchronisation needed by ROV can be maintained locally between just four CUs.

4. Exporting to ROP AFAIU has the same requirement in the API (re)ordering as ROV, just that you can only write data out with a predefined atomic operation. As now the CU group maintains the API order for a fixed set of partitions in the screen space, perhaps ROPs can be tied to the same mapping, and moved into a CU group too. With a mesh NoC, it would not be a problem if the partitioning is configured with the proximity to memory controllers in mind (?).

Then the pixel exporting process would effectively become no difference from ROV, except ROV reads and writes like UAV in the protected section, while pixel exporting is just writing stuff to the local ROP.

6. Since there is already ordered sync at CU group level, perhaps a workgroup (that needs only unordered sync) can also be allowed to span over CUs within the same group. That would require the LDSes moving a bit further from the CUs though, and not as pipelined as it was.

So in the end, say with 64CU Greenland, it may have 16 CU groups, each of which has 4 CUs, 4 Color ROPs, 16 Z/Stencil ROPs, 128KB L2 Cache and 256KB shared LDS.

3dilettante · Dec 27, 2016

Anarchist4000 said:
It's the programmable blending and OIT I was looking to address. At least the OIT should be coming with Vega, ideally the blending as they're similar. In theory that's somehow coherent along with the interconnect.

Just to point out for future clarification, coherence wouldn't change the problem that OIT solutions seek to solve: wavefronts reaching their export phase for transparent data at varying times based on the dynamic behavior of the CUs involved, leading to inconsistent final output. Having multiple threads spread across different cores in a coherent multicore CPU would have similar problems in the absence of any additional synchronization.
Raster-order views inject a stall if it is determined that a pixel in a ROP tile is currently being worked on by a primitive earlier in submission order.

My thinking was a two stage design with programmable first stage in the CU(s) as opposed to a programmable ROP elsewhere. Second stage being some basic blending or data manipulation logic. Possibly actual data compression and indexing to conserve space as opposed to just bandwidth. The QR for if programmable units in different CUs did something strange. Computing a tile wide value perhaps once the CUs simplified results. Perhaps related to DCC or a compression scheme. Ensure another CU didn't make any changes that would affect the outcome if several were working together. That would be at a rather course granularity.

I'm having some difficulty parsing this, and what parts are different or the same as before.
The programmable CUs are the first stage already, if the definition of the second stage for the ROPs is basic blending and data manipulation.
Having the working cache compressed for the second stage would be different, but it would be a change because it would likely make parallel work significantly more painful to achieve.
I don't follow how QR is supposed to be conditional on the CUs doing something strange. It's either being used or it isn't. It's a little late to use it if it was not used and then we realize something strange happened. That would sound like some kind of very optimistic speculation with an intractable fallback case.

Anarchist4000 said:
http://www.eecg.toronto.edu/~enright/micro14-interposer.pdf
Current thinking was 2048b(?) interface on each CU or SE. Part of that tied directly to memory, another part remapped to a configurable network topology through an internal crossbar to create the mesh. The coherence would be a protocol over whatever topology was configured. Most of the interface would be disabled as it would be used by other nodes and would require a really large crossbar and ability to consume data in each node.

As noted in that paper, ~2048 bits of interface per core was prohibitive in terms of die are lost to pad space. (the paper's projections are actually optimistic for what we have currently for bump pitch)
In that regard, could you clarify what you mean by having most the interface disabled? These are physical objects with non-zero area that doesn't become available if they are turned off.

pTmdfx said:
A few wild ideas:

1. Let's assume the rasterisers are already owning large interleaved partitions in Z order of the screen space.

Currently, 2D screen space is partitioned into rectangles and the rasterizers get a subset, which helps match up with the way the address space is striped to allow optimal utilization of the DRAM channels. A rasterizer can handle one primitive per clock, 16 pixels are rasterized per clock, meaning 16 quads for the 64-lane wavefront.

2. A CU group (let's say four of them) can further own a fixed subpartition in the logical space of their parent rasteriser.

Is this static ownership, or some kind of dynamically balanced ownership? There is a decent amount of static partitioning based on a shader engine having a rasterizer with a static mapping, but the programmable resources are more flexible within that subset to handle variability in demand.

4. Exporting to ROP AFAIU has the same requirement in the API (re)ordering as ROV, just that you can only write data out with a predefined atomic operation.

Export is given to what wavefront is able to negotiate an export buffer over the export bus at the end of a thread group's life span. ROV has a more global view, hence why Pixelsync stalls a group until earlier ones are able to finish their export. Without it, it's a race like any other concurrent access situation. At least with opaque geometry the constraint is what is closest, rather than the order of what was overdrawn/culled.

pTmdfx · Dec 27, 2016

3dilettante said:
Just to point out for future clarification, coherence wouldn't change the problem that OIT solutions seek to solve: wavefronts reaching their export phase for transparent data at varying times based on the dynamic behavior of the CUs involved, leading to inconsistent final output. Having multiple threads spread across different cores in a coherent multicore CPU would have similar problems in the absence of any additional synchronization.
Raster-order views inject a stall if it is determined that a pixel in a ROP tile is currently being worked on by a primitive earlier in submission order.

It does not necessarily work at the granularity of pixels though. At least in theory, the post-rasteriser wavefront packing can be batched and sorted, which ensures overlapping pixels gets packed separately with a different wavefront with a larger ID (that is in submission order).

3dilettante said:
Currently, 2D screen space is partitioned into rectangles and the rasterizers get a subset, which helps match up with the way the address space is striped to allow optimal utilization of the DRAM channels. A rasterizer can handle one primitive per clock, 16 pixels are rasterized per clock, meaning 16 quads for the 64-lane wavefront.

Is this static ownership, or some kind of dynamically balanced ownership? There is a decent amount of static partitioning based on a shader engine having a rasterizer with a static mapping, but the programmable resources are more flexible within that subset to handle variability in demand.

Static ownership. The global 2D screen space is already interleaved in Z order between the rasterisers. So the sub-partitioning ideally would be interleaving in Z order within the local 2D space of the global partitions.

3dilettante said:
Export is given to what wavefront is able to negotiate an export buffer over the export bus at the end of a thread group's life span. ROV has a more global view, hence why Pixelsync stalls a group until earlier ones are able to finish their export. Without it, it's a race like any other concurrent access situation. At least with opaque geometry the constraint is what is closest, rather than the order of what was overdrawn/culled.

Yes, but then at least for alpha blending, AFAIU the pixel export process would have to reorder out-of-order exports back into the submission order for stable and consistent results. Then in this sense, even if blending opaque fragments does not require the order but only checking the depth, the exporting process as a whole would still have an overlapping requirement that can share a synchronization mechanic with the ROV.

Moreover, the idea is that with static partitioning and mapping of the screen space, the synchronization can be localised.

AMD Architecture Discussion

Similar threads