AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Since the data that is scraped from firmware are pointing to both HBM and GDDR6, would it be possible to connect HBM and GDDR6 in 1 memory system? Memory coherency has advanced a lot and AMD has already done the SSG with NAND connected to be used as VRAM. Can they possibly make a memory system with both HBM and GDDR6 as 1 coherent memory system?
 
Wow, so much out of the loop on this stuff. New respect for TSMC, too.

So, the rumoured ~500mm² Navi 21, if usng InFO_MS to work with HBM, could have substantially less than ~500mm² active GPU area...

If I understand things correctly, this was supposed to be in "pre-qualification" earlier this month, so it would be quite a shock to see it in an actual product this year.
 
InFO_MS does not account RAM area as logic area, and by the way the (few) pictures we have and card renders show a package that is quite bigger than Radeon VII (which had 4 HBM dies) and it is on par with Vega 10 package (which was a 495 mm^2 chip with 2 HBM chunks)
I've misunderstood fan out. This article helps:

https://semiengineering.com/momentum-builds-for-advanced-packaging/
https://semiengineering.com/momentum-builds-for-advanced-packaging/
[...] is a wafer-level packaging technology, where dies are packaged in a wafer.

[...]

To make fan-out packages, dies are placed in a wafer-like structure using an epoxy mold compound. The RDLs are formed. The individual dies are cut, forming a package.

Fan-out has some challenges. When the dies are placed in the compound, they can move during the process. This effect, called die shift, can impact yield.
So this would appear to have no substantial impact on a GPU die's active area.
 
I got bored:

b3da036.png


The analysis excludes PHY, MC, IO and media blocks in XSX die shot. i.e. to come up with a die size for Navi 21, you would need to add those to whatever numbers you might derive from above.

Total GPU does include command processor, L2 (5MB) and RBEs. Navi's RBE area would be "correct" if we assume it has 64 RBEs, but if zixel rate is doubled, I dunno how much extra space that would be. Adjust for 128 RBEs if you like - if you can find them on the die (I have a good idea, but got lazy).

Navi's 4 shader engines are going to add more area and I don't know how to account for that.

I've not done this to support the 128MB Infinity Cache rumour, just felt like having some fun.
 
I've not done this to support the 128MB Infinity Cache rumour, just felt like having some fun.
GPU L2 has plenty of logic though. Atomics, dirty byte tracking, compressor, etc.

The Zen 2 L3 block (vanilla 7nm) is another way to estimate the size of a hypothetical 128MB SRAM cache. 16MB is ~17 mm^2 (tags included though). Eight of these puts the tally at ~130 mm^2. Now depending on whether you think Navi 2X will use 7+ (EUV), and whether you think it is a memory pool or is extra capacity for the L2 cache, the number could go further down.
 
Last edited:
pTmdfx, I've taken your hint about CPU L3 cache, so using the L3 cache that's on the same die shot:

b3da037.png


it turns out that this L3 is very compact. The change in size for the "128MB Infinity Cache" is crazy.

In case you're wondering, the die image I'm using is from this slide:

die_shot.png


Which is from:

https://www.eurogamer.net/articles/digitalfoundry-2020-xbox-series-x-silicon-hot-chips-analysis

I'm still sceptical about this concept, but cache hit rate in the many rendering passes used per frame does appear to be a concern. As the rendering pass count rises and the count of target GPUs rises, it gets harder to justify to developers: "optimise your memory access patterns like this for this GPU".

Ray tracing might be the killer app?
 
This is how I've defined L3:

b3da039.png


As I understand it, this is 4MB, and it takes up 5.4mm².

So wait, the claim would then be 173mm die space for 128mb of l3. Which would need to deliver some ridiculous IPC somewhere, as another 256bits of bus width would only take up what, about 80mm(ish?) according to the below, and that would be all that's needed to get the big one working with a standard memory configuration. AFAIK with all the latency hiding and serialization there's no way that would be worth the tradeoff. Even for raytracing, the slowdown should come from partially occupied wavefronts due to poor locality, even if you strip the latency down a lot. At least, I don't see how some standard l3 cache alone would still be worth it.

jxniefh8pvb41.jpg
 
L3 has the added benefit of consuming much less energy per byte fetched, which can lead to higher clocks for a given power budget.

Still, my money's on off-die, but on-package cache with something denser than SRAM—if there really is a very large cache, that is.
 
The problem we have is that the rumoured ~500mm² Navi 21 is ludicrously over-sized for a "double Navi 10".

So wait, the claim would then be 173mm die space for 128mb of l3. Which would need to deliver some ridiculous IPC somewhere, as another 256bits of bus width would only take up what, about 80mm(ish?) according to the below, and that would be all that's needed to get the big one working with a standard memory configuration.
Using this really nice die shot (thanks for this, I was just about to go hunting for it) PHY + MCs look like they take ~64mm².

My analysis (note L2 is tricky, there's two variants, and I think the "small block" variant, based on 4-repeated slices is more likely correct - MB/area is similar to XSX, too):

b3da040.png


Even a 512-bit die has a lot of "missing" area, ~29mm², and that's with a naive doubling of uncore (GPU logic outside of shader engines) area.

AFAIK with all the latency hiding and serialization there's no way that would be worth the tradeoff. Even for raytracing, the slowdown should come from partially occupied wavefronts due to poor locality, even if you strip the latency down a lot. At least, I don't see how some standard l3 cache alone would still be worth it.
It seems that ALU utilisation suffers way more from memory latency than we would have expected (even though GPUs "hide" it). This appears to be because there are so many rendering passes in modern games, and can only be partially accounted for by the spin-up/spin-down of hardware threads:

b3da041.png


see slide 34:

https://gpuopen.com/wp-content/uploads/2018/05/gdc_2018_sponsored_engine_optimization_hot_lap.pptx

RDNA attempts to improve ALU utilisation by scheduling hardware threads for minimal duration, theoretically to maximise coherent use of memory (cache and LDS). What we don't have, as far as I can tell, is an analysis of ALU utilisation in RDNA.
 
Last edited:
Isn't dark silicon to improve thermals and reduce signal interference at higher clocks, a much simpler explanation for the large die size?
 
Isn't dark silicon to improve thermals and reduce signal interference at higher clocks, a much simpler explanation for the large die size?

And the most sensible thing to fill that dark silicon is cache.

It's important to note that dark silicon will basically never mean areas of the die left literally blank. It just means you have to design your system so that not all of it can be switching at the same time. Pretty much the archetypical not-often-switching large structure is a block of cache.
 
It's not just N21 that's ridiculously oversized. N22 with 40CUs and 192bit bus is rumored to be 340mm2.
In my analysis, a "256-bit double Navi 21" comes out at ~363mm². Wouldn't it be funny if the rumoured die sizes were all for the "next chip up in size"...
 
Also worth mentioning also that GPU kernels have more explicit say in cache policies than typical CPU cores.

Say RDNA/GCN allows you — for every request — to alter L0, L1 and L2 policies by choosing different combos of GLC, DLC and SLC bits.

So presumably, if the cache hierarchy is getting a big capacity boost, the shader compiler would likely get a big complementary upgrade too. One could go as far as JITing shaders with live profiling data to, say, make resource accesses with high L2 read miss rate to skip L2 for all reads in the future.

Then of course, the option of large SRAMs being a backing memory pool (“eSRAM” in Xbox One) is also on the table. Such pool could be controlled by either “HBCC” kind of thing (hardware assisted page cache, supposedly), or 100% software (driver). This also shifts the issue to the virtual memory management, from the hardware caching realm.
 
Last edited:
So wait, the claim would then be 173mm die space for 128mb of l3. Which would need to deliver some ridiculous IPC somewhere, as another 256bits of bus width would only take up what, about 80mm(ish?) according to the below, and that would be all that's needed to get the big one working with a standard memory configuration.

More GDDR6 PHYs alone don't give the chip more effective bandwidth. They'd need to pair it with more memory chips, which come at a (very unpredictable, lately) cost. Plus, it seems that GDDR6 is especially picky in regards to signaling and PCB placement, which is probably why 384bit arrangements are now reserved for >$1400 graphics cards (nvidia had 384bit GDDR5 cards in the $650 range).

A bigger chip with narrower memory bus seems like a safer bet (if it's effective). IHVs can usually decrease and control the cost of those bigger chips as yields improve and waffers get cheaper, but when the time comes to renew DRAM supply contracts they have no such control over external pricing.
 
And the most sensible thing to fill that dark silicon is cache.

Aren't caches fairly large energy consumers?

It's important to note that dark silicon will basically never mean areas of the die left literally blank. It just means you have to design your system so that not all of it can be switching at the same time. Pretty much the archetypical not-often-switching large structure is a block of cache.

I try to familiarize myself with the creative ideas that have been brought forward regarding dark silicon. This one's an interesting overview:
A landscape of the new dark solicon design regime

Spatial and temporal switching, seems streight forward an idea. I assume there are severe placement solving problems for spatially active regions, so some local optimum may be still (say) 10% larger than a globally optimal solution, which in turn is still a 25% larger than an impossible ideal solution. 3D would allow further exploitation ofc.

What I find super interesting is the suggestion of more FF (or C-cores) to fill up the space, much easier to layout spatially, temporal exclusivity is almost a given, as GPUs are until now not truely super-scalar, and spatially these regions are neighbours as they share the data-paths.

Because you mentioned caches, and I tried to understand the energy-profile of it (didn't really find anything besides 6% D-cache and 21% I-cache energy contribution to instruction execution, which sounds a lot, but I guess the alternative is much worst), there are these switchable cache-configs:
Switchable cache: Utilising dark silicon for application specific cache optimisations

I only read the abstract, but I find this tempting for a GP-GPU, as the different workloads und utilization types certainly have different characteristics in regards to data-access (especially the difference between a BVH-data request and a swizzled texture-data access).

But then, intuitively I believe all the proposals to describe too risky, too complex, ideas/solutions. I personally believe the answer is a very simple one.
 
Status
Not open for further replies.
Back
Top