AMD Architecture Discussion

http://www.eetimes.com/document.asp?doc_id=1330981&page_number=2

Interesting article with some stuff I hadn't seen before.

The fabric is described as a superset of Hypertransport, and also that there is only a coherent version of it.
It states that Vega should introduce a mesh version, with bandwidths 512GB/s and up.
Classical Hypertransport has a baked-in crossbar, which would be a significant contributor to the complexity of an SOC change, which it seems like AMD has changed somehow.

It's curious how that plays with GCN as we know it.
If Vega starts at 512 GB/s for memory bandwidth, it's a step down if the fabric is introduced between the GPU L2 and L1, which already greater that 1 TB/s in high-end GPUs, and given the protocol's overhead and coherent nature seems excessive as long as the dozens of L1s are write-through every 4 cycles.
If it's outside L2s or in the memory controllers, that should allow the promised 1:1 memory and fabric bandwidth ratio and wouldn't disrupt the classic GCN L1-L2 model. In that case, though, it would seem to be bit excessive as a coherent fabric in a discrete GPU context if coherence is handled the way it usually is.

The SenseMI system mentioned for Zen would be using some form of the fabric, or at least a control variant of it. Could the control protocol be kept as a parallel fabric, rather than running over the data paths?
 
The fabric is described as a superset of Hypertransport, and also that there is only a coherent version of it.
It states that Vega should introduce a mesh version, with bandwidths 512GB/s and up.
Classical Hypertransport has a baked-in crossbar, which would be a significant contributor to the complexity of an SOC change, which it seems like AMD has changed somehow.

It's curious how that plays with GCN as we know it.
If Vega starts at 512 GB/s for memory bandwidth, it's a step down if the fabric is introduced between the GPU L2 and L1, which already greater that 1 TB/s in high-end GPUs, and given the protocol's overhead and coherent nature seems excessive as long as the dozens of L1s are write-through every 4 cycles.
If it's outside L2s or in the memory controllers, that should allow the promised 1:1 memory and fabric bandwidth ratio and wouldn't disrupt the classic GCN L1-L2 model. In that case, though, it would seem to be bit excessive as a coherent fabric in a discrete GPU context if coherence is handled the way it usually is.

The SenseMI system mentioned for Zen would be using some form of the fabric, or at least a control variant of it. Could the control protocol be kept as a parallel fabric, rather than running over the data paths?
Assuming the fabric is designed to make a distinction between non-coherent vs coherent accesses (for accelerator & peripheral IPs), they could move L2 into shader engines (or finer grained blocks) by exploiting the fact (IIRC) that CU R/W accesses generally are non-coherent, except when GLC/SLC bits are set, or it is an atomic op. The coherent protocol can then be used to back atomic operations and coherent accesses to the system.

How the export bus and ROPs gonna play their part is a serious question though...
 
Last edited:
Assuming the fabric is designed to make a distinction between non-coherent vs coherent accesses (for accelerator & peripheral IPs), they could move L2 into shader engines (or finer grained blocks) by exploiting the fact (IIRC) that CU R/W accesses generally are non-coherent, except when GLC/SLC bits are set, or it is an atomic op. The coherent protocol can then be used to back atomic operations and coherent access to the system.

Naively moving the L2 slices to a per-engine arrangement would break GCN's coherence method, which relies on per-channel assignment that physically prevents a location from being cached in more than one coherent location.

If it's a superset of Hypertransport, it does support non-coherent accesses. There's a non-coherent form of it, although the article seems to be saying there's no non-coherent option for the Infinity Fabric.
For HT, it's a bit set in the command portion of the packet. That may be more costly if moved into the parts of the GPU that were more "dumber" in the past if it were to have the 8 or 12 bytes of overhead per 4-64 bytes of payload that the main protocol has.
 
The fabric is described as a superset of Hypertransport, and also that there is only a coherent version of it.
It states that Vega should introduce a mesh version, with bandwidths 512GB/s and up.
Classical Hypertransport has a baked-in crossbar, which would be a significant contributor to the complexity of an SOC change, which it seems like AMD has changed somehow.
Why make a mesh with memory channels though? I can't imagine they needed to improve the ability to copy data within memory. A crossbar would aggregate memory channels, a mesh would in theory be connecting them together. That would only make sense if the channels were closely associated to a shader engine or CU. Say 4 channels per SE with 16 CUs each like Fiji. Then have the mesh connecting each SE along with other IO. A large mesh would also be increasingly problematic to implement.

The SenseMI system mentioned for Zen would be using some form of the fabric, or at least a control variant of it. Could the control protocol be kept as a parallel fabric, rather than running over the data paths?
A separate control fabric for SenseMI along with ACE/HWS metrics might make sense. SenseMI I can't imagine requires that much communication, nor the ACE/HWS metrics.

If it's outside L2s or in the memory controllers, that should allow the promised 1:1 memory and fabric bandwidth ratio and wouldn't disrupt the classic GCN L1-L2 model. In that case, though, it would seem to be bit excessive as a coherent fabric in a discrete GPU context if coherence is handled the way it usually is.
Multiple coherent pools? L2 coherency is a bit pointless if splitting graphics and compute into separate CUs. Or separate users in the case of virtualization.
 
Naively moving the L2 slices to a per-engine arrangement would break GCN's coherence method, which relies on per-channel assignment that physically prevents a location from being cached in more than one coherent location.
As said in the previous post, L2 acts as a point of coherence only for memory accesses that meet any of the three conditions (GLC/SLC/atomics). For the bulk of the accesses, the L2 is not coherent due to non-coherent L1 caching. So for accesses under those conditions, the private L2 instances can be coordinated by the coherent data fabric.

If it's a superset of Hypertransport, it does support non-coherent accesses. There's a non-coherent form of it, although the article seems to be saying there's no non-coherent option for the Infinity Fabric.
I am rather inclined to read it as "having no variant that drops all the cache coherent mechanics". Non-coherent accesses are rather essential for a SoC-friendly (rather than a CPU centric) interconnect to achieve optimal performance AFAIK.

For HT, it's a bit set in the command portion of the packet. That may be more costly if moved into the parts of the GPU that were more "dumber" in the past if it were to have the 8 or 12 bytes of overhead per 4-64 bytes of payload that the main protocol has.
I believe nothing prevents on-chip network from using its custom internal encoding though — they just have to appear to be HT at a certain boundaries. Moreover, AMD's generations of on-chip HT interconnect all separate the control datapath from the data datapath IIRC.
 
Last edited:
Why make a mesh with memory channels though? I can't imagine they needed to improve the ability to copy data within memory. A crossbar would aggregate memory channels, a mesh would in theory be connecting them together.
It would be a mesh that is capable of supporting full memory bandwidth, which historically AMD has not been able to offer coherently in its chips due to the mentioned mixture of multiple on-die interconnects. A crossbar would be the most consistent in terms of latency, but as implemented it is not modular enough to isolate its parts of the system from knock-on effects and validation for changes in client count or the types of clients in it.
AMD's APUs and CPUs have generally had limited changes to the number of crossbar clients and a rather confused mixture of on-die connections for various sub-components.

By contrast, a modular system that has interconnect stops of a fixed complexity and more abstracted message protocol can be scaled and individual stops modified without potentially revalidating or cascading changes across the whole chip, which might be what AMD is mentioning.
For a potentially similar scenario:
http://www.realworldtech.com/sandy-bridge/8/

Sandy Bridge is tied together with a high bandwidth coherent interconnect that spans the three major domains. Nehalem and Westmere used crossbar interconnects, which are extremely efficient and high bandwidth for a small numbers of agents – but must be redesigned to vary the number of agents. In contrast, Nehalem-EX and Westmere-EX both rely on a ring topology where the wiring and design effort scales better with the number of agents.

However, per that same article the ring bus is still simpler than a mesh, although due to scaling issues at high client counts Knights Landing also goes with a mesh. That might be more like a system of perpendicular rings, maybe.

Since Vega is on an interposer, there has been research done by AMD on implementing a mesh network in the interposer. However, a practical implementation became some kind of butterfly network or a concentrated the mesh into a smaller number of stops shared by multiple clients on the chip above it.

Multiple coherent pools? L2 coherency is a bit pointless if splitting graphics and compute into separate CUs. Or separate users in the case of virtualization.
Graphics and compute can feed into one another, and the L2 also supports the atomics the CUs rely on. It's comparatively cheap the way it's implemented now.
 
As said in the previous post, L2 acts as a point of coherence only for memory accesses that meet any of the three conditions (GLC/SLC/atomics). For the bulk of the accesses, the L2 is not coherent due to non-coherent L1 caching. So for accesses under those conditions, the private L2 instances can be coordinated by the coherent data fabric.
Coherence is maintained by a read/write miss to the next level of the hierarchy as needed, since there is no snooping built into the system. In the case of localized L2s and their atomic units, that would mean SLC and L2 atomics resolve to some kind global miss, unless the L2s start snooping or there's another cache.

For non-coherent accesses, unless there's a change to the sliced L2 design they're going to somehow behave at an SE level as if they are striped across all memory channels. Capacity-wise, they would act as if they were an L2 1/(SE count) the size of the overall L2.

On a side note, there's also a system request queue that is centralized in the classical HT northbridge whose fate is somewhat unclear in this superset fabric. A mesh and the more distributed architecture of a GPU seem to indicate that this may have changed, although as a superset of the original protocol some provision for a queue would remain.
 
Coherence is maintained by a read/write miss to the next level of the hierarchy as needed, since there is no snooping built into the system. In the case of localized L2s and their atomic units, that would mean SLC and L2 atomics resolve to some kind global miss, unless the L2s start snooping or there's another cache.
Yes, cache coherence between distributed L2s is what I meant. It feels like an inevitable step to me, as they can't possibly indefinitely grow the crossbar. (The crossbar in Fiji is probably a fat tree instead of a full 64x16 one. But still.)

Though I just recalled from my memory that L1D is write-through with dirty byte mask. Decentralizing the L2 would definitely be a problem even for non-coherent accesses in this case, hmm.

Edit: Hmm, a second thought tells it is not if the dirty byte mask is carried through by the fabric to the memory controller.

On a side note, there's also a system request queue that is centralized in the classical HT northbridge whose fate is somewhat unclear in this superset fabric. A mesh and the more distributed architecture of a GPU seem to indicate that this may have changed, although as a superset of the original protocol some provision for a queue would remain.
SRI seems to aggregate requests from cores to reduce the complexity of the NB crossbar. So I guess it would be gone if one has a ring or a mesh, especially when Zen quad-core blocks presumably presents only one outfacing I/F.
 
Last edited:
It would be a mesh that is capable of supporting full memory bandwidth
Is that really necessary though? Given the latency hiding mechanics, no single node should require full bandwidth. It might make more sense to tailor the links to the anticipated demands.2.

A crossbar would be the most consistent in terms of latency, but as implemented it is not modular enough to isolate its parts of the system from knock-on effects and validation for changes in client count or the types of clients in it.
Planning for inconsistency might be a better approach with SenseMI. Use the links to compensate for different clocks between stops. Slow/throttled processors would be less of a concern in that case.

By contrast, a modular system that has interconnect stops of a fixed complexity and more abstracted message protocol can be scaled and individual stops modified without potentially revalidating or cascading changes across the whole chip, which might be what AMD is mentioning.
This seems likely. Say 4 stops for each SE, 1 for IO, possibly one spare for additional chips on the interposer. Definitely tracks with being able to rework the network in a matter of hours. FPGAs might be a good reference here, as each interconnect typically becomes a layer. With a 2048b memory bus that is likely a fair approximation. I doubt AMD is planning 10+ metal layers. It would be interesting if all interconnects occurred in the interposer. Avoid really thick 14nm wafers that way. Could also plop logic dice all over the interposer that way.

Since Vega is on an interposer, there has been research done by AMD on implementing a mesh network in the interposer. However, a practical implementation became some kind of butterfly network or a concentrated the mesh into a smaller number of stops shared by multiple clients on the chip above it.
16+ 128b channels in a mesh is one hell of a spaghetti ball. I'd hate to see a mesh with a 4096 bit bus like Fiji and twice the stops. KNL was still using DDR. Higher clocked narrow links would help, but that also plays against the HBM design. 512b per SE with 128b links for the primary mesh should be more doable. Drop to 32b for some of the off chip IO. Still leaves a question of how many vacant stops get built in to allow new configurations with an interposer? I was expecting this stuff to show up with Navi, not Vega. My thinking on Navi was stacking cache or possibly memory on the logic die. Memory/cache on top, logic in the middle, interconnects/interposer on the bottom. Then use multiple power efficient dice to increase real estate. Might be something to keep in mind for the current Vega design. Abstract each memory controller from the mesh and keep data localized.

Although I guess that breaks down on HPC where the ROPs aren't utilized.

Decentralizing the L2 would definitely be a problem in this case, hmm.
If the links were fast enough, along with high priority settings, it might work. I can't imagine centralizing the L2 for the entire chip in one location is very beneficial from a power standpoint. It could be it's own stop with the GCP, arm core, or CPU with wide links to cut latency.
 
If the links were fast enough, along with high priority settings, it might work. I can't imagine centralizing the L2 for the entire chip in one location is very beneficial from a power standpoint. It could be it's own stop with the GCP, arm core, or CPU with wide links to cut latency.
It is a problem for non-coherent accesses if the interconnect handles dirty lines at a granularity of lines, but not a granularity of bytes (i.e. dirty byte masks). Specifically, if two CUs edit the same cache line with unequal dirty byte mask, while the order is indeterministic for writes to the same byte, the non-overlapping changes of both sides should be visible regardless. If the interconnect handles writes without the dirty byte mask, the memory controller would not be able to "diff" the change but simply overwrite the conflicting line, i.e. either version of it wins.

That's said I believe modern interconnects already do this to minimise the bandwidth use (so as GCN L1 -> L2 despite having full 64B interconnect), so it might not be a problem.
 
Last edited:
It is a problem for non-coherent accesses if the interconnect handles dirty lines at a granularity of lines, but not a granularity of bytes (i.e. dirty byte masks). Specifically, if two CUs edit the same cache line with unequal dirty byte mask, while the order is indeterministic for writes to the same byte, the non-overlapping changes of both sides should be visible regardless. If the interconnect handles writes without the dirty byte mask, the memory controller would not be able to "diff" the change but simply overwrite the conflicting line, i.e. either version of it wins.
Not implying line granularity, but the ability to update the entire line in one(?) cycle with a wide interconnect. One made possible by effectively having the routes on chip/interposer. Especially for a SE which could be moving a lot of data. Connecting multiple chips generally doesn't assume a large number of lanes being possible.

Maybe they're going for a Shield TV 2 competitor with P12.
Would make sense. Something related to that Magnum FPGA board that showed up on the manifests. That Raja interview in your link would seem to suggest it's a bit larger than P11, although that doesn't mean faster. Splitting the difference between P10 and P11 with low clocks for maximum efficiency would make some sense. Maybe a different process targeted towards mobile. Still, that seems like a better market for Vega APUs.
 
The ARM "semi-custom design win" is still to show up, unless it's also been scrapped.

Regardless, there are AotS results in there, so even if it's an APU, it's using a x86 CPU.
 
Is that really necessary though? Given the latency hiding mechanics, no single node should require full bandwidth. It might make more sense to tailor the links to the anticipated demands.2.
Perhaps in a compute, APU, or multi-chip scenario it matters more. What's not entirely clear right now is how Vega makes use of it since the number is both very high for a lot of interconnect work, but also low for GCN's historical numbers for internal cache bandwidth.

I doubt AMD is planning 10+ metal layers. It would be interesting if all interconnects occurred in the interposer. Avoid really thick 14nm wafers that way. Could also plop logic dice all over the interposer that way.
I'm not sure all interconnects could do that. The interposers are 65nm and the actual bumps like with HBM are ~40-50 um pitch. There's supposed to be improvement there someday, but it's orders of magnitude above the size that some of the highest-bandwidth on-die paths. This was one area where Intel was in the past skeptical of how quickly the pitches could be scaled.
 
Last edited:
Not implying line granularity, but the ability to update the entire line in one(?) cycle with a wide interconnect. One made possible by effectively having the routes on chip/interposer. Especially for a SE which could be moving a lot of data. Connecting multiple chips generally doesn't assume a large number of lanes being possible.
That's not quite relevant to my point though. Sorry if it wasn't cleared enough. I was commenting on a certain aspect of memory consistency in GCN, which AFAIK relies on write combining in L2 with dirty byte mask.

To illustrate with a simple example, let's say you have two CUs referenced the same cache line. Initially it is AAAAAAAA. The first CU modifies it locally as ABABDDDD, and the second CU modifies it locally as CACAEEEE. Write combining with dirty byte mask would eventually yield CBCB????, where `?` is indeterministic.

If it is winner-takes-all, all locations would be indeterministic without synchronisation. In other words, non-competing writes by different CUs to the same cache line would intervene with each other. This is definitely bad.
 
Last edited:
I'm not sure all interconnects could do that. The interposers are 65nm and the actual bumps like with HBM are ~40-50 um pitch. There's supposed to be improvement there someday, but it's orders of magnitude above the size that some of the highest-bandwidth on-die paths. This was one area where Intel was in the past skeptical of how quickly the pitches could be scaled.
It would definitely have to be restricted to major stops or points requiring the bandwidth. So "all" probably wasn't a good description. Zen for example should be one(8c16t?) node with ideally full memory bandwidth for an APU. Same logic would apply to a second GPU, however if it adds 4+ stops that could make the mesh a bit large. A smaller bus for control, GCP/ACE/HWS, masks, etc might be able to maintain the mesh. Still doesn't really address the full bandwidth issue. Maybe design the chip so each link gets cut in half to accommodate more nodes? Rely on the interposer to change the network along with some firmware.

That's not quite relevant to my point though. Sorry if it wasn't cleared enough. I was commenting on a certain aspect of memory consistency in GCN, which AFAIK relies on write combining in L2 with dirty byte mask.
I understand what you were saying. I was suggesting that with wider links, or an interesting implementation, the transaction time could be reduced. Avoid some collisions, however some synchronization would still be required. Only exception is if the mesh is actually signal based as opposed to data, where L2 runs at the the speed of a wide bus. Treat it like a register port as opposed to network. Sort of like how FPGAs handle their interconnects. FPGA might be a good example of a mesh interconnect as they are giant configured meshes. L2 would likely run a bit slower and any clients off package wouldn't work very well. Addressing would be interesting. That might still be worthwhile though, treat it more like L3 than L2.
 
It would definitely have to be restricted to major stops or points requiring the bandwidth. So "all" probably wasn't a good description. Zen for example should be one(8c16t?) node with ideally full memory bandwidth for an APU. Same logic would apply to a second GPU, however if it adds 4+ stops that could make the mesh a bit large. A smaller bus for control, GCP/ACE/HWS, masks, etc might be able to maintain the mesh. Still doesn't really address the full bandwidth issue. Maybe design the chip so each link gets cut in half to accommodate more nodes? Rely on the interposer to change the network along with some firmware.


I understand what you were saying. I was suggesting that with wider links, or an interesting implementation, the transaction time could be reduced. Avoid some collisions, however some synchronization would still be required. Only exception is if the mesh is actually signal based as opposed to data, where L2 runs at the the speed of a wide bus. Treat it like a register port as opposed to network. Sort of like how FPGAs handle their interconnects. FPGA might be a good example of a mesh interconnect as they are giant configured meshes. L2 would likely run a bit slower and any clients off package wouldn't work very well. Addressing would be interesting. That might still be worthwhile though, treat it more like L3 than L2.
Time to bring this up again: http://research.cs.wisc.edu/multifacet/papers/hpca14_quick_release.pdf
 
Back
Top