AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Status
Not open for further replies.
Well at least there'd be close to zero chiplet to "HBM" power, if each chiplet is capped by a stack of memory. The troublesome question is what's in the chiplet at the base of a stack of memory. Just ROPs?
Having the ROPs separately under a stack while there's some other chiplet with the GPU would raise the question of how much goes back and forth from them under the current assumption that it's all on chip.
How many bytes can a CU export per cycle? Each SE's back end for a large GPU can manage 128B/cycle with peak pixel export, which would in peak scenarios would allow one SE to mostly saturate an interconnect to one chiplet with bandwidth paired to HBM2 stack's bandwidth. That might not include depth, which still can get out of the shader array even if the shader itself doesn't write it directly.
That wouldn't take into account any values travelling in the other direction, if depth/HiZ come back up from that section or any other internal sideband traffic is now moving off-chip.

It would be a change with some complex evaluation to weight the amplification of the ROP caches versus the possibly higher demands of operations like blending or multiple targets/samples. I'm not sure it's stated which side of the export process some of those might be held on.
One tension would be the increasing importance of even localized overdraw intercepted by a check at the depth buffer, since that wouldn't be hidden any longer on-chip. Feeding the front end with more accurate depth information would cost bandwidth in ways it didn't before.

It could sprout another set of buffers/compressors on either side, although wouldn't that also require walking back Vega's making the RBEs L2 clients?

I suppose the ideal would be if there were a way to put the ROPs and just enough respective producers/consumers on the same die to keep things on-chip, like having some but not all CUs and control logic near them. Otherwise, making the ROPs more independent/programmable or replacing them with programmable logic might make the balancing act easier conceptually.
 
Not sure if there's a misunderstanding from my side, but all the 2D resources for a GPU are 2D tiled and completely seperable. The meta-data for depth/color-compression is also tiled and seperable. I see no problem in concurrent processing of different areas of the same resources by different chiplets, because all the access is through state-enforced locality. There is only exactly one edge-case which is UAVs, which are timely overlapped or location overlapped, and that's either restrictable by "Concurrent Access Views" or a tuned coherency protocol. It's not difficult to rethink algorithms to not utilize the most contentious and hazardous memory access pattern.
 
I have flu and bed beckons, so this will be quick:
Having the ROPs separately under a stack while there's some other chiplet with the GPU would raise the question of how much goes back and forth from them under the current assumption that it's all on chip.
This is why Xenos worked quite well...

How many bytes can a CU export per cycle? Each SE's back end for a large GPU can manage 128B/cycle with peak pixel export, which would in peak scenarios would allow one SE to mostly saturate an interconnect to one chiplet with bandwidth paired to HBM2 stack's bandwidth.
For graphics, shader export only has to be able to keep up with rasterisation rate. If you're foolish enough to generate gargantuan fragments and your pixel shader is short enough to run at rasterisation rate, then yes the GPU's going to apply the brakes.

That might not include depth, which still can get out of the shader array even if the shader itself doesn't write it directly.
That wouldn't take into account any values travelling in the other direction, if depth/HiZ come back up from that section or any other internal sideband traffic is now moving off-chip.
Cop-out time: I'm assuming that hierarchical-Z will disappear in Navi, in favour of DSBR. Depth at fragment precision compresses really well, so it's practical to fetch a tile of depth in the bin-preparation phase.

It would be a change with some complex evaluation to weight the amplification of the ROP caches versus the possibly higher demands of operations like blending or multiple targets/samples. I'm not sure it's stated which side of the export process some of those might be held on.
One tension would be the increasing importance of even localized overdraw intercepted by a check at the depth buffer, since that wouldn't be hidden any longer on-chip. Feeding the front end with more accurate depth information would cost bandwidth in ways it didn't before.
DSBR should read depth and write depth once per tile. The aggregate bytes of depth written in these conditions should easily beat hierarchical-Z queries/updates. Every fragment-quad that is despatched for shading must also have its depth written to hierarchical-Z.

DSBR, per batch, will only read/write depth once, but hierarchical-Z and colour-write will operate many times in the worst case: triangles in back-to-front order.

It could sprout another set of buffers/compressors on either side, although wouldn't that also require walking back Vega's making the RBEs L2 clients?
I've long posited that L2 would be in the base logic die beside the ROPs in the "lightweight" PIM scenario...

I suppose the ideal would be if there were a way to put the ROPs and just enough respective producers/consumers on the same die to keep things on-chip, like having some but not all CUs and control logic near them. Otherwise, making the ROPs more independent/programmable or replacing them with programmable logic might make the balancing act easier conceptually.
D3D12 has programmable blending. But, ignoring that, DSBR (fine-grained portion) can actually be done in PIM too. Shader export vertex data for coarse-grained binning, followed by sub-streams of vertices (and their attributes) despatched to each chiplet that owns the tile and does BR (i.e. fine-grained) and ROP.

So each chiplet with its stacked memory has in its logic-die L2 that supports various tile sizes. The tile size for both BR and ROP are affected in the same way by pixel format. Fatter pixels means smaller tiles for both binning and colour/Z buffers.

EDIT: Transmitting vertices and their attributes from the PIM chiplets to the shader chiplets is work. Which is why it would be good to defer attribute computation until you have decided to shade any of that triangle's fragments. Which is one of the puzzle pieces of primitive shader.
 
Last edited:
For the purposes of comparing interconnects, it's more of a function of aggregate inter-chip bandwidth. The assumption is point to point links between chips. EPYC is a full-connected solution and provides more whereas Nvidia's proposal only connects a chip to its two neighbors horizontally and vertically. How many DRAM stacks are there for two chips, and what are their respective areas?
Going to 4 chips helps give more area to compensate for redundant silicon and to provide a more tangible improvement versus the older node's 600-800mm2 chips.
Initially for a first iteration i would expect something like 2x300mm on 7nm. 2stacks of HBM per chip. I think redundant silicon is overstated (initially), ultimately you need to build a desirable product. Being able to hit a performance tier others cant early on in 7nm life would be one of those desirable products.

There are now questions to answer concerning how accesses are handled between the local requestors, remote requestors, and each chip's memory.
I would expect it to treat them both the same. So for that to work we would need to understand how much locality already exists/could easily exist. What control is there over scheduling/memory address allocations etc. If locality is really 50% remote / 50% local and there is nothing you can do about it, then my idea has a big problem. But i find that hard to believe.


There's metadata caches for the graphics domain (various compression methods, hierarchical structures) that are currently incoherent, and buffers/caches for the output of front-end stages or command processors that are of unclear relationship with the L2. Some may spill to the L2, while others may generally avoid it like the MBs of vertex parameter cache. There some queues/buffers used by the shader engines that the Vega ISA hints at allowing shaders to address simultaneously within the chip, and MSG instructions that have global impact but might not store into the GDS.
The tradeoff in redundant work versus taking bandwidth or synchronization across the chip would have to be looked at.
None of this sound particularly horrendous, if they are already incoherent sharing them across 2 chiplets doesn't sound that bad. I assume that all coherent data is ultimately the responsibility of the L2.

The L1-L2 crossbar is diagrammed in some documents as being inside the GPU domain proper, and not exposed to the infinity fabric. If Vega has something similar to Fury's crossbar (or crossbar-like structure), it's somewhere like 2-3x the HBM's bandwidth. Which bandwidth should the cross-chip interconnect cater to? The assumption so far has been for it to match DRAM bandwidth.
There are two things to care about right, average and peak average. I assume 2-3x for the internal structure is because it all has a lock step clock and it has to handle that peak average. When going across chips there is going to have to be some form of a boundary between the PHY/its buffers and the internal cross bar. At this point you could decouple and have a dynamic external frequency ( i believe EPYC does this already) then that bandwidth becomes less of an issue and its more about power consumption when it hits that peak bandwidth for sustained amounts of time.

To me going for 2 chiplets seems significantly easier then >2 and if this is the path AMD is going that should be exploited until it is no longer an advantage. The other thing that would be very cool (not that i think they would do it) is if they did go down this type of path is that the RR follow on did some form of optional HBM (1 stack, something like Samsung cheap HBM proposal) + optional inter chiplet link. then AMD could do an APU , and APU + HBM and and APU + Chiplet + 2x HBM. I think AMD are trying to Drive Si Interposer (via HBM) because they are very actively trying to drive a direction.
 
"GTFO Raja" because.. AMD recovered a substantial chunk in discrete graphics marketshare despite the laughable R&D budget they had for GPUs during the last 3-4 years since the inception of RTG?
Because Vega cards sold out everywhere for several months despite their prices being inflated?

I hate AMD's handling of Vega's release just as much as the next guy. Non-existent communication, the pricing debacle, forcing Vega 56 down reviewers' throats some 48 hours before ending the review embargo, and even not choosing the "power saving mode" as default power plan seems completely ridiculous to me
.
But the cards aren't exactly sitting in the shelves with no one picking them up. And given the budget that RTG was given, together with the downsides of being stuck with GlobalFoundries' lacking 14nm, I'd say Vega and Polaris meant a spectacular handling of resources.

The Vega “sold out” narrative would be impressive if there was any indication that it was actually available in volume. The fact that AMD has been losing market share ever since its introduction underscores just how much of a “success” it has really been. It’s Fury all over again: “sold out” for months after launch, with very few units actually moving when all was said and done.
 
Not sure if there's a misunderstanding from my side, but all the 2D resources for a GPU are 2D tiled and completely seperable. The meta-data for depth/color-compression is also tiled and seperable. I see no problem in concurrent processing of different areas of the same resources by different chiplets, because all the access is through state-enforced locality. There is only exactly one edge-case which is UAVs, which are timely overlapped or location overlapped, and that's either restrictable by "Concurrent Access Views" or a tuned coherency protocol. It's not difficult to rethink algorithms to not utilize the most contentious and hazardous memory access pattern.

The RAW hazard handling for resources that have associated metadata includes flushing the affected caches. Vega can avoid flushing its L2, if its alignment is respected, but the metadata cache would still be flushed. This is a heavier operation since it is a pipeline flush that includes among other things a microcode engine stall to avoid a race condition within the command processor. The operation and its front-end stall are not quite transparent in the single-GPU case to start with.

This is why Xenos worked quite well...
To clarify, is the chiplet stacked memory a part of the GPU's general memory pool, or is there more than one path to the stack? Xenos had a dedicated path for its GPU output, separate from and higher bandwidth than the the DDR bus.
My interpretation of the stacked solution is that there would be one signal path to the stack, which the other proposals matched with the HBM's bandwidth.

The EDRAM was sized to be able to support a worst-case amplification of that 32GB/s payload to 256GB/s internally, although that included the choice to not use the compression techniques available at the time. That 8x amplification of export bus to ROP traffic might not hold as strongly once basic color sample compression or now DCC come into play.
Not using all channels or not having all lanes marked valid could also be an opportunity for an external bus to compact or compress data, although it would make less sense as long as the export bus is on-die.

Cop-out time: I'm assuming that hierarchical-Z will disappear in Navi, in favour of DSBR. Depth at fragment precision compresses really well, so it's practical to fetch a tile of depth in the bin-preparation phase.
Something like the 32-pixel tiles in mentioned in the Vega ISA for the shader engine coverage instruction?

I've long posited that L2 would be in the base logic die beside the ROPs in the "lightweight" PIM scenario...
Is there some other pool of L2 on the other side of the connection to the PIM, in order to maintain some amount of bandwidth amplification and coherence?

Initially for a first iteration i would expect something like 2x300mm on 7nm. 2stacks of HBM per chip. I think redundant silicon is overstated (initially), ultimately you need to build a desirable product. Being able to hit a performance tier others cant early on in 7nm life would be one of those desirable products.
That seems a touch large if we are to believe AMD's projection for cost/mm2 for 7nm on a 250mm2 die, if they aren't being overly pessimistic. If this is a Fiji-like arrangement ( ~600mm2 GPU, 4 stacks) this presumably has some additional improvements like a larger interposer or some other means of making the larger HBM2 stacks fit, or some kind of multi-exposure interposer like that used for GP100.
The bandwidth between chips would likely be proportional to the HBM interfaces, since it would favor a wide and slow connection, and the HBM stacks dominate the length of a die edge.

None of this sound particularly horrendous, if they are already incoherent sharing them across 2 chiplets doesn't sound that bad. I assume that all coherent data is ultimately the responsibility of the L2.
Why the elements are incoherent and whether something else can use a given piece of data can determine how expensive they can be. A fair number of them are assuming they won't be shared, so tidying up the architecture would mean making sure the data stays private. Other cases like the process of geometry setup has caches and queues that are global and visible, just with hardwired assumptions like the limits of shader engine counts that we don't know how AMD intends deal with in a scalable fashion.

The L2's coherence is based on a pretty simple assumption that a given entry can only exist in one specific L2 slice. Caching another chip's location would break that assumption, but addresses are usually interleaved between channels to get more even bandwidth utilization. One or the other would need adjustment.

There are two things to care about right, average and peak average. I assume 2-3x for the internal structure is because it all has a lock step clock and it has to handle that peak average.
The L2's bandwidth is such because it is supposed to provide bandwidth amplification over what the memory bus can provide, and the many L1s are write-through and very small. The L1s are there to at most compensate for L2 latency and coalesce the most recent vector loads, rather than avoid consuming L1/L2 bandwidth.
Vega now has the L2 servicing the ROPs as well, which is a new client with an unclear impact at this point.

When going across chips there is going to have to be some form of a boundary between the PHY/its buffers and the internal cross bar. At this point you could decouple and have a dynamic external frequency ( i believe EPYC does this already) then that bandwidth becomes less of an issue and its more about power consumption when it hits that peak bandwidth for sustained amounts of time.
The fabric is pretty consistent about memory and link bandwidth being equivalent with nice ratios of clock and link width. xGMI and its more stringent error margins are the outlier.
Lower activity might scale down some elements, but GPU loads are miss-prone and the caches themselves are not as good about reducing misses.

To me going for 2 chiplets seems significantly easier then >2 and if this is the path AMD is going that should be exploited until it is no longer an advantage.
A specific solution that slots into a Fiji-like package would at least be somewhat plausible as a product. Just by opting for an interposer, it rules anything greater out and leaves the one-chip solution saddled with an interposer. AMD's estimates might point to this being somewhat Vega-like in cost for a single-chip implementation. It's not EPYC-like at this point, which I think we agree upon.
I've commented on this before, but AMD uses "chiplet" for a specific form of chip that cannot function on its own without a specialized interposer and secondary silicon either mounted alongside or operating in the interposer itself. It's even less like EPYC and even more far-out in terms of whether it is feasible for some time. The more alternatives to interposers gain success, the more there's a chance AMD's speculating down a wasted effort.
 
The RAW hazard handling for resources that have associated metadata includes flushing the affected caches. Vega can avoid flushing its L2, if its alignment is respected, but the metadata cache would still be flushed. This is a heavier operation since it is a pipeline flush that includes among other things a microcode engine stall to avoid a race condition within the command processor. The operation and its front-end stall are not quite transparent in the single-GPU case to start with.

Can you exlain me the cross-chiplet RAW hazward you have in mind? Getting a rendertarget/depth-stencil from writing to texture filtering reading has such massive overhead - decompression - I can not imagine something else can compete with that. I'm not including DX11 order guaranteeing past-ages paradigms in my thoughts about how this could work, but DX12 style multi-chiplet barrier support for shared resources and direct concurrent graphics queue usage pattern.
 
Can you exlain me the cross-chiplet RAW hazward you have in mind?
The inciting hazard not specific to multi-chip setups, just intra-frame readback of resources that have metadata.
The flush of affected caches is not transparent to the driver, and includes queuing a command for the command processor to stall in order to prevent an race condition with one of several sub-components running ahead through the command stream.
Managing it isn't transparent to the driver or software, which was one direction proposed for an EPYC-like solution for Navi's chips behaving as if they were one unit.
 
Managing it isn't transparent to the driver or software, which was one direction proposed for an EPYC-like solution for Navi's chips behaving as if they were one unit.

Oh, I see. I think that's weird to want, because you loose all the flexibility it could give.
On the other hand, EPYC doesn't look like 1 CPU core transparently just wider. Maybe that "complete" degree of transparency wasn't exactly asked for from chiplets (or threadlets, corelets, modulets), rather something more realistic. :) Aren't we halfway into lego silicon anyway?
 
To clarify, is the chiplet stacked memory a part of the GPU's general memory pool, or is there more than one path to the stack? Xenos had a dedicated path for its GPU output, separate from and higher bandwidth than the the DDR bus.
My interpretation of the stacked solution is that there would be one signal path to the stack, which the other proposals matched with the HBM's bandwidth.
In Infinity Fabric, compute nodes are multi-ported. The question is whether the base logic die under a stack of memory is a compute node or a memory node in an IF system.

If we treat it as PIM, it's a compute node, perhaps implying it's multi-ported. The port count might be low (2, say) just because the only peers are CU chiplet(s). I'm assuming there's no reason for PIMs to talk directly to each other. There's likely a maximum of 8 CU chiplets, and we might not see more than 4 CU chiplets until whatever comes after Navi. For a package consisting of 4 CU chiplets and 4 PIMs, 2 ports on each PIM would make all CUs a maximum of two hops from memory.

If the CUs are 3-ported then each CU has two neighbouring CUs and one PIM as its peers.

So there's a latency/bandwidth trade-off in sizing those hops. When a node is acting as a relay for another node, that implies an overhead on each port for peer traffic, which in itself is substantial

Obviously GPUs don't mind latency in general. So then bandwidth is the real question. In graphics, ROPs are the primary bandwidth hog, but too much imbalance in favour of intra-PIM bandwidth is going to hurt compute.

This is the fundamental question that we're really struggling for data on, so I'm not trying to suggest it's easy. Obviously an interposer is seen as a solution to bandwidth amongst chips. I don't share your utter disdain for interposers, for what it's worth. They, or something very similar, are the target of lots of research simply because the pay-off is so great.

The EDRAM was sized to be able to support a worst-case amplification of that 32GB/s payload to 256GB/s internally, although that included the choice to not use the compression techniques available at the time. That 8x amplification of export bus to ROP traffic might not hold as strongly once basic color sample compression or now DCC come into play.
Not using all channels or not having all lanes marked valid could also be an opportunity for an external bus to compact or compress data, although it would make less sense as long as the export bus is on-die.
I agree with all that. There will be some residual amplification. In my proposal with binned rasterisation actually occurring in the PIMs, this also implies a particular kind of vertex traffic taking a "long trip" whereas in current GPUs the trip is a bit shorter (not much, though, it's still chip-wide traffic - in Vega all that traffic uses the single L2 as a staging point). So the vertex traffic adds some pressure.

Something like the 32-pixel tiles in mentioned in the Vega ISA for the shader engine coverage instruction?
I suppose so.

Is there some other pool of L2 on the other side of the connection to the PIM, in order to maintain some amount of bandwidth amplification and coherence?
There will be some kind of memory, even if solely to assemble packets.

I think it would be useful to think in terms of bandwidth amplification and coherence separately. ROPs and BR rely entirely upon both of these things. I think it gets quite hard to classify how the other clients depend on either of these factors. Vega now provides L2 as backing for "all" clients (obviously TEX/Compute have L1, so they're partially isolated). e.g. Texels are fairly likely to only ever see a single CU in any reasonably short (intra-frame) period of time. So that's not amplification. It's barely coherence, too. And so on.

I'm feeling too ill to spend an hour or two thinking about the ins-and-outs of cache usage factors on a client by client basis. With a full classification of cache clients I'm not even sure we'd have something meaningful for the next step of speculation.

I'm now wondering what fixed-function units should be in PIM, beyond ROPs and BR. With AMD saying that all graphics blocks in Vega are clients of L2, and assuming that L2 is in PIM, it would seem there needs to be a really good reason to place a block in a CU chiplet instead.

The L2's coherence is based on a pretty simple assumption that a given entry can only exist in one specific L2 slice. Caching another chip's location would break that assumption, but addresses are usually interleaved between channels to get more even bandwidth utilization. One or the other would need adjustment.
Remember these are fairly coarse tilings of texture and render targets. Large bursts to/from memory imply substantial tiles.
 
Status
Not open for further replies.
Back
Top