AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Still haven't seen the cooling required for either one, so will have to wait a while more to see how it really plays out.
 
There is a guy picking in AMD's firmware blobs. He got some nice info from the description and power tables.

* Navi 21 - 80CUs, maxing out at 2.05-2.2GHz = 21-22.5TFLOPS
* Navi 22 - 40CUs, maxing out at whooping 2.5GHz = 12.8TFLOPS
* Navi 23 - 32CUs, with not yet power tables-driven frequency; apparently this one comes later after 21 and 22.
* Navi 31 - 80CUs with identical configuration as Navi 21; possibly just place holder data?
* Van Gogh APU - 1 SA with 8 RDNA CUs = 8 CUs
* Rembrandt APU - 2 SAs with 6 RDNA CUs each = 12 CUs


// ok, beaten :]
 
Jesus H Christ. I expected at least 2.2 like PS5, but 2.5GHz?

If I understand things correctly, the card will basically never actually run at 2.5GHz. That's the absolute maximum boost limit for when all stars align and power consumption is low and thermals are excellent. That number is not marketed or printed on any box, mainly because realistically no card ever reaches it in any real load. Best possible real-world clocks have typically been ~100MHz below that.

OFC, 2.4GHz is still pretty monstrous...
 
If I understand things correctly, the card will basically never actually run at 2.5GHz. That's the absolute maximum boost limit for when all stars align and power consumption is low and thermals are excellent. That number is not marketed or printed on any box, mainly because realistically no card ever reaches it in any real load. Best possible real-world clocks have typically been ~100MHz below that.

OFC, 2.4GHz is still pretty monstrous...

While the clocks are high, memory bandwidth still makes absolutely no sense to me.
 
If those specs are true then Navi 22 is pretty disappointing. I'd have hoped the second tear GPU would be a comfortable step above the XSX but this would be barely faster at all.

EDIT: missed the memory bandwidth, so it would actually be slower than the XSX :/
 
If those specs are true then Navi 22 is pretty disappointing. I'd have hoped the second tear GPU would be a comfortable step above the XSX but this would be barely faster at all.

EDIT: missed the memory bandwidth, so it would actually be slower than the XSX :/
I'd assume this will be cut down Navi 22. There will be big empty desert between 40CU 192bit and 80CU 384 bit or whatever. I expect to see 256/320bit full/cutdown versions of Navi 22 and Navi 21
 
EDIT: missed the memory bandwidth, so it would actually be slower than the XSX :/
If anything, my takeaway is that memory efficiency has improved, which would make XSS look a bit better.

But yea, if that bandwidth is true its very surprising.
I guess their view is its not targeting 4K?
 
If I understand things correctly, the card will basically never actually run at 2.5GHz. That's the absolute maximum boost limit for when all stars align and power consumption is low and thermals are excellent. That number is not marketed or printed on any box, mainly because realistically no card ever reaches it in any real load. Best possible real-world clocks have typically been ~100MHz below that.

OFC, 2.4GHz is still pretty monstrous...

Well, if you look at the power tables for the Navi 10, the power and clocks are those for the 5700 (base clocks mroeover), not even the 5700XT, and the 5700 reaches these frequencies for sure.
 
the navi 21 at 3080 performance and prices would be a good grab and a good change of pace for AMD. IF that information above is correct we could see such a thing happen
 
If Navi 22 does have an inherent clock advantage over Navi 21 by the amount listed then the gap is really roughly the same as that of GA104 against GA102.

There is no full consumer GA104 or GA102 but RTX 3070/RTX 3090 TF ratio is 57.1%.

Using the 12.8/22.5 numbers for Navi 22/21 is 56.8%
 
So this is an old and perhaps relevant thread, from 2015:

https://forum.beyond3d.com/threads/gpu-cache-sizes-and-architectures.56731/

And relevant posting by sebbbi in 2013:


Tahiti saw benefits from this "fully-tiled" benchmark and we now have GPUs with ~6x more fillrate, but only ~3x more bandwidth (RTX 3090, of course, has tiled rasterisation - but that wouldn't be relevant to sebbbi's HDR particles test).

Still, we don't know how RDNA works and whether it uses the ROP depth/colour cache that we saw in GCN.

You know, B3D forum is epic:


128MB L4 :)
I dimly recall that there was discussion either here or elsewhere about what possible sizes of on-die cache would allow texturing behavior to be treated more as a well-behaved working set instead of streaming low-temporal-locality workload in that time period or earlier. I think the values at the time for the working set were then-impossible values in the hundreds of MB.
Perhaps with some relaxation on conditions, such as in-stack versus on-die and getting within an order of magnitude, it's at least theoretically possible.

Efficient L2 Cache Management to Boost GPGPU Performance

This paper, from 2019, is a direct study of GCN, which improves an existing simulator (to achieve substantially closer simulated performance versus actual chip performance) and then goes on to propose a new cache architecture:
I may have to do more than skim this. There are some interesting questions about the non-disclosed elements of these architectures being probed, such as the number of misses from each cache level and the behavior of the cache protocol.
I'm curious about some of the assumptions made, and how they may affect the conclusions about simulator accuracy or the calculated values for things like miss-handling capacity.
The assumed hardware model has cache sizes that match SI, but it is CPU-like in some ways. The L1 latency is listed as 1-cycle, L2 is 10-cycle, and memory is 90.
In GDC2018, AMD indicated GCN's figures for at each level were ~114/190/350 or 114/76/160 when taking additive latency into account.
Could the loads be that different and not affect the conclusion of how accurate a given simulation is?

I haven't thought enough about whether the IPC metric chosen could be influenced by this.

The proposed cache management scheme seems like a combination of cache pipelining and a small victim buffer seen on some CPUs, with an additional amount of cache line swapping.
At least in the less-flooded coherent systems for CPUs, I think some level of this behavior is going on as part of the the interaction of the home nodes, caches, and memory controllers. There would be hundred-cycle periods where a significant number of cache lines would be useless, if not.
The swap function may or may not be done. The extra data movement in at least one direction might be an additional amount of complexity and consumption, at least for CPU-like scenarios.

I didn't see the reference for where the cache pipeline for the GPU was documented, or rather, how limited the "standard" pipelining is for the GPU L2 based on their claims. Perhaps it's possible that the pipeline needs to be less intensive due to the volume of traffic or other constraints, but the timeline of cache requests and invalidations has a cache line invalidated and idle for 100 cycles because the fetch isn't started until after the current line is fully gone, if my reading of the timing diagram is right. Never mind that if AMD's numbers are at least partly representative, the memory controller latency may be 2x worse.



It's possible Intel found that the Haswell GPU's ROP throughput was modest enough that it could dedicate the eDRAM to texturing. I'm not clear on whether the ROPs could leverage the GPU's L3 or the main L3, which could buffer their traffic further.
As far as Crystalwell at 7nm, I think the scaling didn't address that Crystalwell used Intel's eDRAM, which TSMC's 7nm does not have. Then there's the ballpark equivalence of foundry nodes being roughly the same density as Intel's N-1 node (current troubles aside).

An SRAM-based example might be Zen 2's L3 cache, which I've seen estimated to be ~34mm2 for 32 MB. Granted, there are tags, reliability measures, connectivity, and bandwidth measures that could hurt density, depending on the parameters of this hypothetical 128MB GPU cache. I don't think those factors are enough to counter the lack of eDRAM or node name shenanigans.

(edit: lost the word shenanigans)
 
Last edited:
"num_rb_per_se" refers to Render Backends per Shader Engine? If it's still 4 ROPs per RBE, we're looking at 64 ROPs on Navi 21?
 
Status
Not open for further replies.
Back
Top