AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Discussion in 'Architecture and Products' started by Jawed, Mar 23, 2016.

Tags:
Thread Status:
Not open for further replies.
  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Having the ROPs separately under a stack while there's some other chiplet with the GPU would raise the question of how much goes back and forth from them under the current assumption that it's all on chip.
    How many bytes can a CU export per cycle? Each SE's back end for a large GPU can manage 128B/cycle with peak pixel export, which would in peak scenarios would allow one SE to mostly saturate an interconnect to one chiplet with bandwidth paired to HBM2 stack's bandwidth. That might not include depth, which still can get out of the shader array even if the shader itself doesn't write it directly.
    That wouldn't take into account any values travelling in the other direction, if depth/HiZ come back up from that section or any other internal sideband traffic is now moving off-chip.

    It would be a change with some complex evaluation to weight the amplification of the ROP caches versus the possibly higher demands of operations like blending or multiple targets/samples. I'm not sure it's stated which side of the export process some of those might be held on.
    One tension would be the increasing importance of even localized overdraw intercepted by a check at the depth buffer, since that wouldn't be hidden any longer on-chip. Feeding the front end with more accurate depth information would cost bandwidth in ways it didn't before.

    It could sprout another set of buffers/compressors on either side, although wouldn't that also require walking back Vega's making the RBEs L2 clients?

    I suppose the ideal would be if there were a way to put the ROPs and just enough respective producers/consumers on the same die to keep things on-chip, like having some but not all CUs and control logic near them. Otherwise, making the ROPs more independent/programmable or replacing them with programmable logic might make the balancing act easier conceptually.
     
  2. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    859
    Likes Received:
    262
    Not sure if there's a misunderstanding from my side, but all the 2D resources for a GPU are 2D tiled and completely seperable. The meta-data for depth/color-compression is also tiled and seperable. I see no problem in concurrent processing of different areas of the same resources by different chiplets, because all the access is through state-enforced locality. There is only exactly one edge-case which is UAVs, which are timely overlapped or location overlapped, and that's either restrictable by "Concurrent Access Views" or a tuned coherency protocol. It's not difficult to rethink algorithms to not utilize the most contentious and hazardous memory access pattern.
     
    Anarchist4000 and Gubbi like this.
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I have flu and bed beckons, so this will be quick:
    This is why Xenos worked quite well...

    For graphics, shader export only has to be able to keep up with rasterisation rate. If you're foolish enough to generate gargantuan fragments and your pixel shader is short enough to run at rasterisation rate, then yes the GPU's going to apply the brakes.

    Cop-out time: I'm assuming that hierarchical-Z will disappear in Navi, in favour of DSBR. Depth at fragment precision compresses really well, so it's practical to fetch a tile of depth in the bin-preparation phase.

    DSBR should read depth and write depth once per tile. The aggregate bytes of depth written in these conditions should easily beat hierarchical-Z queries/updates. Every fragment-quad that is despatched for shading must also have its depth written to hierarchical-Z.

    DSBR, per batch, will only read/write depth once, but hierarchical-Z and colour-write will operate many times in the worst case: triangles in back-to-front order.

    I've long posited that L2 would be in the base logic die beside the ROPs in the "lightweight" PIM scenario...

    D3D12 has programmable blending. But, ignoring that, DSBR (fine-grained portion) can actually be done in PIM too. Shader export vertex data for coarse-grained binning, followed by sub-streams of vertices (and their attributes) despatched to each chiplet that owns the tile and does BR (i.e. fine-grained) and ROP.

    So each chiplet with its stacked memory has in its logic-die L2 that supports various tile sizes. The tile size for both BR and ROP are affected in the same way by pixel format. Fatter pixels means smaller tiles for both binning and colour/Z buffers.

    EDIT: Transmitting vertices and their attributes from the PIM chiplets to the shader chiplets is work. Which is why it would be good to defer attribute computation until you have decided to shade any of that triangle's fragments. Which is one of the puzzle pieces of primitive shader.
     
    #283 Jawed, Dec 12, 2017
    Last edited: Dec 12, 2017
    Gubbi likes this.
  4. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,298
    Likes Received:
    396
    Location:
    Australia
    Initially for a first iteration i would expect something like 2x300mm on 7nm. 2stacks of HBM per chip. I think redundant silicon is overstated (initially), ultimately you need to build a desirable product. Being able to hit a performance tier others cant early on in 7nm life would be one of those desirable products.

    I would expect it to treat them both the same. So for that to work we would need to understand how much locality already exists/could easily exist. What control is there over scheduling/memory address allocations etc. If locality is really 50% remote / 50% local and there is nothing you can do about it, then my idea has a big problem. But i find that hard to believe.


    None of this sound particularly horrendous, if they are already incoherent sharing them across 2 chiplets doesn't sound that bad. I assume that all coherent data is ultimately the responsibility of the L2.

    There are two things to care about right, average and peak average. I assume 2-3x for the internal structure is because it all has a lock step clock and it has to handle that peak average. When going across chips there is going to have to be some form of a boundary between the PHY/its buffers and the internal cross bar. At this point you could decouple and have a dynamic external frequency ( i believe EPYC does this already) then that bandwidth becomes less of an issue and its more about power consumption when it hits that peak bandwidth for sustained amounts of time.

    To me going for 2 chiplets seems significantly easier then >2 and if this is the path AMD is going that should be exploited until it is no longer an advantage. The other thing that would be very cool (not that i think they would do it) is if they did go down this type of path is that the RR follow on did some form of optional HBM (1 stack, something like Samsung cheap HBM proposal) + optional inter chiplet link. then AMD could do an APU , and APU + HBM and and APU + Chiplet + 2x HBM. I think AMD are trying to Drive Si Interposer (via HBM) because they are very actively trying to drive a direction.
     
  5. Geeforcer

    Geeforcer Harmlessly Evil
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,297
    Likes Received:
    464
    The Vega “sold out” narrative would be impressive if there was any indication that it was actually available in volume. The fact that AMD has been losing market share ever since its introduction underscores just how much of a “success” it has really been. It’s Fury all over again: “sold out” for months after launch, with very few units actually moving when all was said and done.
     
    DrYesterday and xpea like this.
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The RAW hazard handling for resources that have associated metadata includes flushing the affected caches. Vega can avoid flushing its L2, if its alignment is respected, but the metadata cache would still be flushed. This is a heavier operation since it is a pipeline flush that includes among other things a microcode engine stall to avoid a race condition within the command processor. The operation and its front-end stall are not quite transparent in the single-GPU case to start with.

    To clarify, is the chiplet stacked memory a part of the GPU's general memory pool, or is there more than one path to the stack? Xenos had a dedicated path for its GPU output, separate from and higher bandwidth than the the DDR bus.
    My interpretation of the stacked solution is that there would be one signal path to the stack, which the other proposals matched with the HBM's bandwidth.

    The EDRAM was sized to be able to support a worst-case amplification of that 32GB/s payload to 256GB/s internally, although that included the choice to not use the compression techniques available at the time. That 8x amplification of export bus to ROP traffic might not hold as strongly once basic color sample compression or now DCC come into play.
    Not using all channels or not having all lanes marked valid could also be an opportunity for an external bus to compact or compress data, although it would make less sense as long as the export bus is on-die.

    Something like the 32-pixel tiles in mentioned in the Vega ISA for the shader engine coverage instruction?

    Is there some other pool of L2 on the other side of the connection to the PIM, in order to maintain some amount of bandwidth amplification and coherence?

    That seems a touch large if we are to believe AMD's projection for cost/mm2 for 7nm on a 250mm2 die, if they aren't being overly pessimistic. If this is a Fiji-like arrangement ( ~600mm2 GPU, 4 stacks) this presumably has some additional improvements like a larger interposer or some other means of making the larger HBM2 stacks fit, or some kind of multi-exposure interposer like that used for GP100.
    The bandwidth between chips would likely be proportional to the HBM interfaces, since it would favor a wide and slow connection, and the HBM stacks dominate the length of a die edge.

    Why the elements are incoherent and whether something else can use a given piece of data can determine how expensive they can be. A fair number of them are assuming they won't be shared, so tidying up the architecture would mean making sure the data stays private. Other cases like the process of geometry setup has caches and queues that are global and visible, just with hardwired assumptions like the limits of shader engine counts that we don't know how AMD intends deal with in a scalable fashion.

    The L2's coherence is based on a pretty simple assumption that a given entry can only exist in one specific L2 slice. Caching another chip's location would break that assumption, but addresses are usually interleaved between channels to get more even bandwidth utilization. One or the other would need adjustment.

    The L2's bandwidth is such because it is supposed to provide bandwidth amplification over what the memory bus can provide, and the many L1s are write-through and very small. The L1s are there to at most compensate for L2 latency and coalesce the most recent vector loads, rather than avoid consuming L1/L2 bandwidth.
    Vega now has the L2 servicing the ROPs as well, which is a new client with an unclear impact at this point.

    The fabric is pretty consistent about memory and link bandwidth being equivalent with nice ratios of clock and link width. xGMI and its more stringent error margins are the outlier.
    Lower activity might scale down some elements, but GPU loads are miss-prone and the caches themselves are not as good about reducing misses.

    A specific solution that slots into a Fiji-like package would at least be somewhat plausible as a product. Just by opting for an interposer, it rules anything greater out and leaves the one-chip solution saddled with an interposer. AMD's estimates might point to this being somewhat Vega-like in cost for a single-chip implementation. It's not EPYC-like at this point, which I think we agree upon.
    I've commented on this before, but AMD uses "chiplet" for a specific form of chip that cannot function on its own without a specialized interposer and secondary silicon either mounted alongside or operating in the interposer itself. It's even less like EPYC and even more far-out in terms of whether it is feasible for some time. The more alternatives to interposers gain success, the more there's a chance AMD's speculating down a wasted effort.
     
    ImSpartacus likes this.
  7. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    859
    Likes Received:
    262
    Can you exlain me the cross-chiplet RAW hazward you have in mind? Getting a rendertarget/depth-stencil from writing to texture filtering reading has such massive overhead - decompression - I can not imagine something else can compete with that. I'm not including DX11 order guaranteeing past-ages paradigms in my thoughts about how this could work, but DX12 style multi-chiplet barrier support for shared resources and direct concurrent graphics queue usage pattern.
     
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The inciting hazard not specific to multi-chip setups, just intra-frame readback of resources that have metadata.
    The flush of affected caches is not transparent to the driver, and includes queuing a command for the command processor to stall in order to prevent an race condition with one of several sub-components running ahead through the command stream.
    Managing it isn't transparent to the driver or software, which was one direction proposed for an EPYC-like solution for Navi's chips behaving as if they were one unit.
     
  9. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    859
    Likes Received:
    262
    Oh, I see. I think that's weird to want, because you loose all the flexibility it could give.
    On the other hand, EPYC doesn't look like 1 CPU core transparently just wider. Maybe that "complete" degree of transparency wasn't exactly asked for from chiplets (or threadlets, corelets, modulets), rather something more realistic. :) Aren't we halfway into lego silicon anyway?
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    In Infinity Fabric, compute nodes are multi-ported. The question is whether the base logic die under a stack of memory is a compute node or a memory node in an IF system.

    If we treat it as PIM, it's a compute node, perhaps implying it's multi-ported. The port count might be low (2, say) just because the only peers are CU chiplet(s). I'm assuming there's no reason for PIMs to talk directly to each other. There's likely a maximum of 8 CU chiplets, and we might not see more than 4 CU chiplets until whatever comes after Navi. For a package consisting of 4 CU chiplets and 4 PIMs, 2 ports on each PIM would make all CUs a maximum of two hops from memory.

    If the CUs are 3-ported then each CU has two neighbouring CUs and one PIM as its peers.

    So there's a latency/bandwidth trade-off in sizing those hops. When a node is acting as a relay for another node, that implies an overhead on each port for peer traffic, which in itself is substantial

    Obviously GPUs don't mind latency in general. So then bandwidth is the real question. In graphics, ROPs are the primary bandwidth hog, but too much imbalance in favour of intra-PIM bandwidth is going to hurt compute.

    This is the fundamental question that we're really struggling for data on, so I'm not trying to suggest it's easy. Obviously an interposer is seen as a solution to bandwidth amongst chips. I don't share your utter disdain for interposers, for what it's worth. They, or something very similar, are the target of lots of research simply because the pay-off is so great.

    I agree with all that. There will be some residual amplification. In my proposal with binned rasterisation actually occurring in the PIMs, this also implies a particular kind of vertex traffic taking a "long trip" whereas in current GPUs the trip is a bit shorter (not much, though, it's still chip-wide traffic - in Vega all that traffic uses the single L2 as a staging point). So the vertex traffic adds some pressure.

    I suppose so.

    There will be some kind of memory, even if solely to assemble packets.

    I think it would be useful to think in terms of bandwidth amplification and coherence separately. ROPs and BR rely entirely upon both of these things. I think it gets quite hard to classify how the other clients depend on either of these factors. Vega now provides L2 as backing for "all" clients (obviously TEX/Compute have L1, so they're partially isolated). e.g. Texels are fairly likely to only ever see a single CU in any reasonably short (intra-frame) period of time. So that's not amplification. It's barely coherence, too. And so on.

    I'm feeling too ill to spend an hour or two thinking about the ins-and-outs of cache usage factors on a client by client basis. With a full classification of cache clients I'm not even sure we'd have something meaningful for the next step of speculation.

    I'm now wondering what fixed-function units should be in PIM, beyond ROPs and BR. With AMD saying that all graphics blocks in Vega are clients of L2, and assuming that L2 is in PIM, it would seem there needs to be a really good reason to place a block in a CU chiplet instead.

    Remember these are fairly coarse tilings of texture and render targets. Large bursts to/from memory imply substantial tiles.
     
  11. Nemo

    Newcomer

    Joined:
    Sep 15, 2012
    Messages:
    125
    Likes Received:
    23
  12. Picao84

    Veteran Regular

    Joined:
    Feb 15, 2010
    Messages:
    1,551
    Likes Received:
    695
  13. Nemo

    Newcomer

    Joined:
    Sep 15, 2012
    Messages:
    125
    Likes Received:
    23
    Vega is Vega. Vega 20 is gfx9.
     
    Jawed and Picao84 like this.
  14. Nemo

    Newcomer

    Joined:
    Sep 15, 2012
    Messages:
    125
    Likes Received:
    23
  15. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    7,029
    Likes Received:
    3,101
    Location:
    Pennsylvania
    Mind. Blown.
     
  16. yuri

    Newcomer

    Joined:
    Jun 2, 2010
    Messages:
    178
    Likes Received:
    147
  17. sheepdogexpress

    Newcomer

    Joined:
    Mar 10, 2012
    Messages:
    86
    Likes Received:
    11
  18. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    520
    Likes Received:
    239
    Does RTG have any other teams left?
     
  19. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    The Orlando guys?
     
  20. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,183
    Likes Received:
    1,840
    Location:
    Finland
    has any RTG team been cut ever?
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...