AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Discussion in 'Architecture and Products' started by Jawed, Mar 23, 2016.

Tags:
Thread Status:
Not open for further replies.
  1. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Que???

    Of course they're studying it: they're already bumping against the reticle limit for their jumbo chips. But there's no good reason yet to do so for lower performance versions.

    Last time I checked a GP102 was less than 500mm2. They have room for a next generation even while staying with 16/12nm! So for 7nm, they have plenty of room without multi-die shenanigans for one or maybe even two generations.

    If AMD chose that the multi-die model for 7nm anyway, it's be very similar to HBM: using a technology with future potential way before it makes sense to do so. It may earn them brownie points with the press and some fans for being courageous and innovative, but we all know who ran away with the real brownies.
     
    xpea likes this.
  2. giannhs

    Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    37
    Likes Received:
    40
    i think this can clarify a bit what amd is currently doing
    https://www.nextplatform.com/2017/07/12/heart-amds-epyc-comeback-infinity-fabric/
     
  3. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,528
    Likes Received:
    862
    EPYC uses PCIe lanes for inter-die communications. It runs at slightly higher speed than standard PCIe because signals don't go off substrate. I don't think that's what AMD envisions for their future multi-die GPUs.

    Traditionally multi GPUs need to replicate data in each GPUs private memory. That's not the case with AMD's HBCC though, each GPU's chunk of HBM2 only holds data from that GPU's working set.

    I could imagine a GPU consisting of multiple dies, each die connected to a single stack of HBM2 with additional phys for connecting to neighbouring dies. Since this is on a silicon transposer, the links could be very wide, very high bandwidth and very low energy per bit.

    Cheers
     
  4. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,173
    Likes Received:
    576
    Location:
    France
    Do we know how much of Vega we will find in Navi ? Or, with the multi small dies thing, we can assume that It's a brand new chip ?
     
  5. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,971
    Likes Received:
    4,565
    I don't think AMD has the resources to make two big ISA jumps in a row. It'll definitely be a new chip, but I bet it'll be GFX9 or GFX9.x.
     
  6. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,173
    Likes Received:
    576
    Location:
    France
    Didn't they have like 2 teams working in "cycle" at one point ? Like, while a team is working on X, the other team is already working on Y ? I guess I'm wrong, or it was years and years ago.
    The sad thing is Vega doesn't look like a big isa jump performances wise... Anyway, it's another story for another topic.

    Thx for your answer ToTTenTranz.
     
  7. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    There's only 40 new instructions in the GCN5 ISA (http://diit.cz/clanek/architektura-radeon-rx-vega). Vega isn't a huge ISA jump. Vega seems to focus more on graphics side (tiled rasterizer, ROP caches, geometry pipes, etc) and optimizing the power usage and rising the clocks. GCN3 was a much bigger ISA change.
     
  8. Rys

    Rys PowerVR
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,156
    Likes Received:
    1,433
    Location:
    Beyond3D HQ
    There is very little about that graph and the links that marries up with reality. It just does not cost that much, both in total and in some of the sub-costs identified, to produce even complex processor designs.
     
  9. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,059
    Likes Received:
    1,021
    It is very difficult as a layman to have a grasp on this.
    How much did Vega cost? Then, having designed Vega, how much would it cost to design a smaller variant with 32CUs and a single memory channel?
    The graph doesn't make any distinction between these two cases, but overall cost should differ greatly. But how much is "greatly" really, what are the numbers involved?
    I'm frustrated by my lack of knowledge.
     
  10. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,971
    Likes Received:
    4,565
    If only our collective frustrations could make Rys break his NDAs...
    ;)
     
    Cat Merc, Malo and AlBran like this.
  11. Rys

    Rys PowerVR
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,156
    Likes Received:
    1,433
    Location:
    Beyond3D HQ
    I've honestly got no idea how much Vega cost to get to this point (one ASIC shipped). But I do have a very good idea how much modern consumer SoCs cost (and especially the GPUs therein).

    The cost in producing the scaled variants of a processor design like a GPU or CPU is almost 100% verification, after you've designed and verified the base. Scaling it up or down has very little design cost and lots of verification cost.

    In terms of the actual dollar cost, lots of the above in the graph is wildly out. It simply does not cost $100M+ to verify a SoC (nowhere near!), and it does not shoot up like that as the node gets smaller.
     
    Prophecy2k, tinokun, T1beriu and 11 others like this.
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Did you mean inter-socket communications? On-package links use different PHY and a lower speed. Regardless, the quoted power of 2 pJ/bit is high relative to links assumed to be used for something like an MCM GPU.

    AMD's chiplet scheme has no numbers, and at least for HPC a single interposer has two GPUs. Intra-interposer communication is described as using some kind of short-reach high-speed link, which might take things back up to an undesirable power range.

    HBM's pJ/bit is rather high, compared to some papers using interposers for communication. I'm not sure if that's accounting for other parts of the access process, however.
    AMD hasn't demonstrated or given projected power numbers for its project, and at least in terms of bump density the necessary improvements have not materialized. Interposer lines may be dense, but the ubump pitch has not improved much despite interposer proponents' promises. HBM's pitch is coarser than AMD's NOC paper hoped for, and the bandwidth numbers for that are relatively modest.
     
  13. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,528
    Likes Received:
    862
    You're right, I mixed them up. The inter-die links are 42GB/s (bi-directional), single ended instead of differential signalling. 2pJ is pretty good though, that's 4 watts for 250GB/s bandwidth.

    Cheers
     
  14. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    939
    Likes Received:
    398
    Btw why Multi GPUs has always failure but multi die GPU can succeed?
     
  15. BacBeyond

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    73
    Likes Received:
    43
    mGPU requires developer and driver support. Same with the existing multiple separate dies on a GPU board like 295x2 or Fiji Pro Duo.

    What NV and AMD are going to do for the future would be have a single die like Ryzen, that you can "glue" (thanks intel!) together to have a bigger chip. These will communicate internally and not require per game support from the devs / driver teams. They will work and function as a single GPU.
     
  16. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    I don't think that answers the original question. ;-)

    Motherboards with multiple CPUs have existed for decades and worked quite well. Adding multiple CPU dies on the same substrate is almost the same thing. The interface between just has a higher BW and there's some cache coherency protocol (I think).

    It's not at all clear to me that the same can be done for GPUs without a massive BW interface between GPUs, and what the cost of that would be.
     
    Cat Merc likes this.
  17. BacBeyond

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    73
    Likes Received:
    43
    How does it not answer his question?

    His question was "How is it different from what we have now for multiple gpu support". The answer is, instead of requiring developer / extra driver hacks, it will work as a single GPU and not multiple. Say it's 1024 cores per "gpu", the system would see one 2048 core gpu instead of two 1024 ones.
     
    ToTTenTranz likes this.
  18. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    It will work, because it already does - we have multiple µGPUs called SMs or CUs already working together on the final picture. One problem was moving this off-card (and no, dual-GPU cards using a PCIe-switch were not inherently better at this). With the discussed solution, we're getting one step closer to on-die integration. If that'll be enough for all applications? Who knows.

    In fact, even multiple graphics cards used for mining or rendering (blender etc.) or the accelerators in supercomputers do work together very well already. The culprit is gaming: Vendors insisted on maximum length of benchmark bars for gaming and focused on AFR, which in turn introduces a whole load of troubles on it's own.

    In the early days, screen partitioning in one way or the other was the method of choice and it worked rather well - at least compared to AFR-style mGPU. Problem is/was: How to market all the hassle with two or more GPUs [cost (2 cards, mainboards with 2x PEG, PSU, electricity) and noise] when you won't get 2× performance - while your competitor might actually do that by accepting all that is bad in MGPU (aka AFR). That's what broke the neck for MGPU in gaming, IMHO.
     
    Cat Merc, Kej and Lightman like this.
  19. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    I believe it could be possible, assuming of course that there's shared memory (like in multi-socket CPU configs)...

    Let's talk about traditional vertex shader + pixel shader pipeline first. In this case your inputs are commonly RO (buffers and textures). GPU can simply cache them separately. No coherence is needed. Output goes from pixel shader to ROP which does the combine. There's no programmable way to read the render target while you are rendering to it. Tiled rasterizer splits triangles to tiles and renders tiles separately. You need to have more tiles in flight to saturate a wider GPU. This should work also seamlessly for two GPUs with shared memory. If they are processing different set of tiles, there's no hazards. Tile buffers obviously need to be flushed to memory after finishing them, but I would assume that this is the common case in single GPU implementation as well (if the same tile is rendered again twice, why is it split in the first place?).

    Now let's move to the UAVs. This is obviously more difficult. However it is worth noticing that DirectX (and other APIs) by default only mandate that writes by a thread group are only visible to that thread group. This is what allows GCN CU L1 cache to be incoherent between other CU L1 caches. You need to combine the writes at some point, if cache line was partially written, but you can simply use a dirty bitmask for that. There's no need for complex coherency protocol. It's undefinied behavior if two thread groups (potentially executing on different CU) write to the same memory location (group execution order isn't guaranteed and memory order isn't guaranteed = race condition). If we forget that atomics and globallycoherent UAV attribute exist, we simply need to combine partial dirty cache line with existing data (using a bit mask) when it is written to memory.

    Globallycoherent attribute for UAV is a difficult case. It means that UAV writes must be visible for other CUs after doing DeviceMemoryBarrierWithGroupSync. Groups can use it in combination with atomics to ensure data visibility between groups. However this isn't a common use case in current rendering code. For example Unreal Engine code base shows zero hits for "globallycoherent". Atomics however are used quite commonly in modern rendering code (without combining it with globallycoherent UAV). DirectX mandates that atomics are visible to other groups (even without a barrier). The most common use case is one global counter (atomic add), but you could do random access writes with atomics to a buffer or even a texture (both 2d and 3d texture atomics exist). But I would argue that the bandwidth used for atomics and globallycoherent UAVs is tiny compared to other memory accesses, meaning that we don't need full width bus between the GPUs (for transferring cache lines touched by these operations requiring coherency).

    But these operations still exist and must be supported relatively efficiently. So it definitely isn't a trivial thing to scale to 2P GPU system with memory coherence and automatic load balancing (automatically split single dispatch or draw call to both).

    However if we compare CPU and GPU, I would argue that GPU seems much simpler to scale up. CPU is constantly accessing memory. Stack = memory. Compilers write registers to memory very frequently to pass them to function calls, to index them and to spill them. There's potential coherency implication on each read and write. GPU code on the other hand is designed to do much less memory operations. Most operations are done in registers and in groupshared memory. Writing a result to memory and immediately reading it back from memory afterwards is not a common case. Most memory regions (resources) that are random accessed are marked as read only. Most resources that are written are marked as only needing group coherency (group = all threads executing on same CU). Resources needing full real time coherency between CUs and between multiple GPUs are rare, and most of these accesses are simple atomic counters (one cache line bouncing between GPUs). This is a much simpler system to optimize than CPUs.
     
    #99 sebbbi, Aug 4, 2017
    Last edited: Aug 4, 2017
    tinokun, Cat Merc, T1beriu and 8 others like this.
  20. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    112
    Likes Received:
    129
    As i don't have insight i can only take the public data, so might be much too high.

    100M might be really too high, but are you really sure that it does not shoot up per node? Your ex company also showed numbers in which verification cost skyrocket, while true that it's on a way lower level. But 28nm to 16nm here we have a doubling and looking at the trend from 65nm this happened every node jump. Looking at the 25 M in this Graph i would expect 50 M at 7nm. Also verification cost should be bigger in a bigger chip i would expect or am i wrong? So maybe in big chips with vega size you could even reach 100M. At least that would've been my laymans thought that a chip much bigger than socs would cost way more to verify. Correct me if i'm wrong :-D
    [​IMG]

    https://www.imgtec.com/blog/imagination-tsmc-collaborate-on-iot-subsystems/
     
    #100 Samwell, Aug 4, 2017
    Last edited: Aug 4, 2017
    Kej likes this.
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...