AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Status
Not open for further replies.
Intel says hi.
Que???

Not all memory, but regardless no one suggested otherwise.
Even nvidia is studying ways to make multi-die GPUs. Several smaller chips connected through a very high bandwidth fabric does seem like a strong suggestion for the future.
Of course they're studying it: they're already bumping against the reticle limit for their jumbo chips. But there's no good reason yet to do so for lower performance versions.

Last time I checked a GP102 was less than 500mm2. They have room for a next generation even while staying with 16/12nm! So for 7nm, they have plenty of room without multi-die shenanigans for one or maybe even two generations.

If AMD chose that the multi-die model for 7nm anyway, it's be very similar to HBM: using a technology with future potential way before it makes sense to do so. It may earn them brownie points with the press and some fans for being courageous and innovative, but we all know who ran away with the real brownies.
 
Epyc uses 4 links because it's what AMD deemed necessary for this specific use case. It doesn't mean a future multi-die GPU launching in 2019 in a different process would be limited to the same number of links.
It also doesn't mean Navi will have to use the same IF version that is available today. Hypertransport for example doubled its bandwidth between the 1.1 version in 2002 and the 2.0 in 2004.

Besides, isn't IF more flexible than just the GMI implementation? Isn't Vega using a mesh-like implementation that reaches about 512GB/s?


It could very well still be worth it. How much would they save for having all foundries manufacturing a single GPU die and getting their whole process optimization team working on the production of that sole die, instead splitting up to work on 4+ dies? How much would they save on yields considering this joined effort?
Maybe this 880mm^2 "total" combined GPU has equivalent manufacturing cots as much as performs as well as a 600mm^2 monolithic GPU, when all these factors are combined.
i think this can clarify a bit what amd is currently doing
https://www.nextplatform.com/2017/07/12/heart-amds-epyc-comeback-infinity-fabric/
 
To clarify, it's 42 GB/s per link with EPYC, and 768 GB/s per link with Nvidia, or 3 TB/s aggregate. Nvidia's interconnect is also 4x as power efficient per bit at .5 pJ/bit versus 2 pJ/bit for EPYC's on-package links.

EPYC uses PCIe lanes for inter-die communications. It runs at slightly higher speed than standard PCIe because signals don't go off substrate. I don't think that's what AMD envisions for their future multi-die GPUs.

Traditionally multi GPUs need to replicate data in each GPUs private memory. That's not the case with AMD's HBCC though, each GPU's chunk of HBM2 only holds data from that GPU's working set.

I could imagine a GPU consisting of multiple dies, each die connected to a single stack of HBM2 with additional phys for connecting to neighbouring dies. Since this is on a silicon transposer, the links could be very wide, very high bandwidth and very low energy per bit.

Cheers
 
Do we know how much of Vega we will find in Navi ? Or, with the multi small dies thing, we can assume that It's a brand new chip ?
 
Do we know how much of Vega we will find in Navi ? Or, with the multi small dies thing, we can assume that It's a brand new chip ?

I don't think AMD has the resources to make two big ISA jumps in a row. It'll definitely be a new chip, but I bet it'll be GFX9 or GFX9.x.
 
Didn't they have like 2 teams working in "cycle" at one point ? Like, while a team is working on X, the other team is already working on Y ? I guess I'm wrong, or it was years and years ago.
The sad thing is Vega doesn't look like a big isa jump performances wise... Anyway, it's another story for another topic.

Thx for your answer ToTTenTranz.
 
I don't think AMD has the resources to make two big ISA jumps in a row. It'll definitely be a new chip, but I bet it'll be GFX9 or GFX9.x.
There's only 40 new instructions in the GCN5 ISA (http://diit.cz/clanek/architektura-radeon-rx-vega). Vega isn't a huge ISA jump. Vega seems to focus more on graphics side (tiled rasterizer, ROP caches, geometry pipes, etc) and optimizing the power usage and rising the clocks. GCN3 was a much bigger ISA change.
 
I'm not so much into it, to know why, but it seems everything beside architecture is exploding and also verification costs are much higher:

Gartner is even saying more than double the manyears for 7nm vs 14nm. Ok they're talking about SoCs, but how the exact numbers are doesn't matter. Important is, the cost are growing extremely and i doubt amd can manage to design so many chips in the future. As costs go up the same will happen with nvidia as well, but probably they can stay a bit longer with monolythic chips because of their higher r&d.
There is very little about that graph and the links that marries up with reality. It just does not cost that much, both in total and in some of the sub-costs identified, to produce even complex processor designs.
 
There is very little about that graph and the links that marries up with reality. It just does not cost that much, both in total and in some of the sub-costs identified, to produce even complex processor designs.
It is very difficult as a layman to have a grasp on this.
How much did Vega cost? Then, having designed Vega, how much would it cost to design a smaller variant with 32CUs and a single memory channel?
The graph doesn't make any distinction between these two cases, but overall cost should differ greatly. But how much is "greatly" really, what are the numbers involved?
I'm frustrated by my lack of knowledge.
 
It is very difficult as a layman to have a grasp on this.
How much did Vega cost? Then, having designed Vega, how much would it cost to design a smaller variant with 32CUs and a single memory channel?
The graph doesn't make any distinction between these two cases, but overall cost should differ greatly. But how much is "greatly" really, what are the numbers involved?
I'm frustrated by my lack of knowledge.

If only our collective frustrations could make Rys break his NDAs...
;)
 
I've honestly got no idea how much Vega cost to get to this point (one ASIC shipped). But I do have a very good idea how much modern consumer SoCs cost (and especially the GPUs therein).

The cost in producing the scaled variants of a processor design like a GPU or CPU is almost 100% verification, after you've designed and verified the base. Scaling it up or down has very little design cost and lots of verification cost.

In terms of the actual dollar cost, lots of the above in the graph is wildly out. It simply does not cost $100M+ to verify a SoC (nowhere near!), and it does not shoot up like that as the node gets smaller.
 
EPYC uses PCIe lanes for inter-die communications. It runs at slightly higher speed than standard PCIe because signals don't go off substrate. I don't think that's what AMD envisions for their future multi-die GPUs.
Did you mean inter-socket communications? On-package links use different PHY and a lower speed. Regardless, the quoted power of 2 pJ/bit is high relative to links assumed to be used for something like an MCM GPU.

AMD's chiplet scheme has no numbers, and at least for HPC a single interposer has two GPUs. Intra-interposer communication is described as using some kind of short-reach high-speed link, which might take things back up to an undesirable power range.

I could imagine a GPU consisting of multiple dies, each die connected to a single stack of HBM2 with additional phys for connecting to neighbouring dies. Since this is on a silicon transposer, the links could be very wide, very high bandwidth and very low energy per bit.
HBM's pJ/bit is rather high, compared to some papers using interposers for communication. I'm not sure if that's accounting for other parts of the access process, however.
AMD hasn't demonstrated or given projected power numbers for its project, and at least in terms of bump density the necessary improvements have not materialized. Interposer lines may be dense, but the ubump pitch has not improved much despite interposer proponents' promises. HBM's pitch is coarser than AMD's NOC paper hoped for, and the bandwidth numbers for that are relatively modest.
 
Did you mean inter-socket communications? On-package links use different PHY and a lower speed. Regardless, the quoted power of 2 pJ/bit is high relative to links assumed to be used for something like an MCM GPU.

You're right, I mixed them up. The inter-die links are 42GB/s (bi-directional), single ended instead of differential signalling. 2pJ is pretty good though, that's 4 watts for 250GB/s bandwidth.

Cheers
 
Btw why Multi GPUs has always failure but multi die GPU can succeed?

mGPU requires developer and driver support. Same with the existing multiple separate dies on a GPU board like 295x2 or Fiji Pro Duo.

What NV and AMD are going to do for the future would be have a single die like Ryzen, that you can "glue" (thanks intel!) together to have a bigger chip. These will communicate internally and not require per game support from the devs / driver teams. They will work and function as a single GPU.
 
mGPU requires developer and driver support. Same with the existing multiple separate dies on a GPU board like 295x2 or Fiji Pro Duo.

What NV and AMD are going to do for the future would be have a single die like Ryzen, that you can "glue" (thanks intel!) together to have a bigger chip. These will communicate internally and not require per game support from the devs / driver teams. They will work and function as a single GPU.
I don't think that answers the original question. ;-)

Motherboards with multiple CPUs have existed for decades and worked quite well. Adding multiple CPU dies on the same substrate is almost the same thing. The interface between just has a higher BW and there's some cache coherency protocol (I think).

It's not at all clear to me that the same can be done for GPUs without a massive BW interface between GPUs, and what the cost of that would be.
 
I don't think that answers the original question. ;-)

How does it not answer his question?

His question was "How is it different from what we have now for multiple gpu support". The answer is, instead of requiring developer / extra driver hacks, it will work as a single GPU and not multiple. Say it's 1024 cores per "gpu", the system would see one 2048 core gpu instead of two 1024 ones.
 
It will work, because it already does - we have multiple µGPUs called SMs or CUs already working together on the final picture. One problem was moving this off-card (and no, dual-GPU cards using a PCIe-switch were not inherently better at this). With the discussed solution, we're getting one step closer to on-die integration. If that'll be enough for all applications? Who knows.

In fact, even multiple graphics cards used for mining or rendering (blender etc.) or the accelerators in supercomputers do work together very well already. The culprit is gaming: Vendors insisted on maximum length of benchmark bars for gaming and focused on AFR, which in turn introduces a whole load of troubles on it's own.

In the early days, screen partitioning in one way or the other was the method of choice and it worked rather well - at least compared to AFR-style mGPU. Problem is/was: How to market all the hassle with two or more GPUs [cost (2 cards, mainboards with 2x PEG, PSU, electricity) and noise] when you won't get 2× performance - while your competitor might actually do that by accepting all that is bad in MGPU (aka AFR). That's what broke the neck for MGPU in gaming, IMHO.
 
It's not at all clear to me that the same can be done for GPUs without a massive BW interface between GPUs, and what the cost of that would be.
I believe it could be possible, assuming of course that there's shared memory (like in multi-socket CPU configs)...

Let's talk about traditional vertex shader + pixel shader pipeline first. In this case your inputs are commonly RO (buffers and textures). GPU can simply cache them separately. No coherence is needed. Output goes from pixel shader to ROP which does the combine. There's no programmable way to read the render target while you are rendering to it. Tiled rasterizer splits triangles to tiles and renders tiles separately. You need to have more tiles in flight to saturate a wider GPU. This should work also seamlessly for two GPUs with shared memory. If they are processing different set of tiles, there's no hazards. Tile buffers obviously need to be flushed to memory after finishing them, but I would assume that this is the common case in single GPU implementation as well (if the same tile is rendered again twice, why is it split in the first place?).

Now let's move to the UAVs. This is obviously more difficult. However it is worth noticing that DirectX (and other APIs) by default only mandate that writes by a thread group are only visible to that thread group. This is what allows GCN CU L1 cache to be incoherent between other CU L1 caches. You need to combine the writes at some point, if cache line was partially written, but you can simply use a dirty bitmask for that. There's no need for complex coherency protocol. It's undefinied behavior if two thread groups (potentially executing on different CU) write to the same memory location (group execution order isn't guaranteed and memory order isn't guaranteed = race condition). If we forget that atomics and globallycoherent UAV attribute exist, we simply need to combine partial dirty cache line with existing data (using a bit mask) when it is written to memory.

Globallycoherent attribute for UAV is a difficult case. It means that UAV writes must be visible for other CUs after doing DeviceMemoryBarrierWithGroupSync. Groups can use it in combination with atomics to ensure data visibility between groups. However this isn't a common use case in current rendering code. For example Unreal Engine code base shows zero hits for "globallycoherent". Atomics however are used quite commonly in modern rendering code (without combining it with globallycoherent UAV). DirectX mandates that atomics are visible to other groups (even without a barrier). The most common use case is one global counter (atomic add), but you could do random access writes with atomics to a buffer or even a texture (both 2d and 3d texture atomics exist). But I would argue that the bandwidth used for atomics and globallycoherent UAVs is tiny compared to other memory accesses, meaning that we don't need full width bus between the GPUs (for transferring cache lines touched by these operations requiring coherency).

But these operations still exist and must be supported relatively efficiently. So it definitely isn't a trivial thing to scale to 2P GPU system with memory coherence and automatic load balancing (automatically split single dispatch or draw call to both).

However if we compare CPU and GPU, I would argue that GPU seems much simpler to scale up. CPU is constantly accessing memory. Stack = memory. Compilers write registers to memory very frequently to pass them to function calls, to index them and to spill them. There's potential coherency implication on each read and write. GPU code on the other hand is designed to do much less memory operations. Most operations are done in registers and in groupshared memory. Writing a result to memory and immediately reading it back from memory afterwards is not a common case. Most memory regions (resources) that are random accessed are marked as read only. Most resources that are written are marked as only needing group coherency (group = all threads executing on same CU). Resources needing full real time coherency between CUs and between multiple GPUs are rare, and most of these accesses are simple atomic counters (one cache line bouncing between GPUs). This is a much simpler system to optimize than CPUs.
 
Last edited:
There is very little about that graph and the links that marries up with reality. It just does not cost that much, both in total and in some of the sub-costs identified, to produce even complex processor designs.

As i don't have insight i can only take the public data, so might be much too high.

I've honestly got no idea how much Vega cost to get to this point (one ASIC shipped). But I do have a very good idea how much modern consumer SoCs cost (and especially the GPUs therein).

The cost in producing the scaled variants of a processor design like a GPU or CPU is almost 100% verification, after you've designed and verified the base. Scaling it up or down has very little design cost and lots of verification cost.

In terms of the actual dollar cost, lots of the above in the graph is wildly out. It simply does not cost $100M+ to verify a SoC (nowhere near!), and it does not shoot up like that as the node gets smaller.

100M might be really too high, but are you really sure that it does not shoot up per node? Your ex company also showed numbers in which verification cost skyrocket, while true that it's on a way lower level. But 28nm to 16nm here we have a doubling and looking at the trend from 65nm this happened every node jump. Looking at the 25 M in this Graph i would expect 50 M at 7nm. Also verification cost should be bigger in a bigger chip i would expect or am i wrong? So maybe in big chips with vega size you could even reach 100M. At least that would've been my laymans thought that a chip much bigger than socs would cost way more to verify. Correct me if i'm wrong :D
Imagination-TSMC-IP-platforms-SoC-IP-5.png


https://www.imgtec.com/blog/imagination-tsmc-collaborate-on-iot-subsystems/
 
Last edited:
  • Like
Reactions: Kej
Status
Not open for further replies.
Back
Top