AMD: Navi Speculation, Rumours and Discussion [2019-2020]

SimBy · Sep 26, 2020

Jesus H Christ. I expected at least 2.2 like PS5, but 2.5GHz?

BRiT · Sep 26, 2020

Still haven't seen the cooling required for either one, so will have to wait a while more to see how it really plays out.

PSman1700 · Sep 26, 2020

80cu @2.2 damn. Nv has some serious competition then, long over 20TF o assume.

yuri · Sep 26, 2020

There is a guy picking in AMD's firmware blobs. He got some nice info from the description and power tables.

* Navi 21 - 80CUs, maxing out at 2.05-2.2GHz = 21-22.5TFLOPS
* Navi 22 - 40CUs, maxing out at whooping 2.5GHz = 12.8TFLOPS
* Navi 23 - 32CUs, with not yet power tables-driven frequency; apparently this one comes later after 21 and 22.
* Navi 31 - 80CUs with identical configuration as Navi 21; possibly just place holder data?
* Van Gogh APU - 1 SA with 8 RDNA CUs = 8 CUs
* Rembrandt APU - 2 SAs with 6 RDNA CUs each = 12 CUs

https://www.reddit.com/r/Amd/comments/j06xcd

// ok, beaten :]

tunafish · Sep 26, 2020

SimBy said:
Jesus H Christ. I expected at least 2.2 like PS5, but 2.5GHz?

If I understand things correctly, the card will basically never actually run at 2.5GHz. That's the absolute maximum boost limit for when all stars align and power consumption is low and thermals are excellent. That number is not marketed or printed on any box, mainly because realistically no card ever reaches it in any real load. Best possible real-world clocks have typically been ~100MHz below that.

OFC, 2.4GHz is still pretty monstrous...

PSman1700 · Sep 26, 2020

A navi 21/31 looks very tempting, 22+ TF, some OC at that perhaps and some lightning fast hbm ram.

SimBy · Sep 26, 2020

tunafish said:
If I understand things correctly, the card will basically never actually run at 2.5GHz. That's the absolute maximum boost limit for when all stars align and power consumption is low and thermals are excellent. That number is not marketed or printed on any box, mainly because realistically no card ever reaches it in any real load. Best possible real-world clocks have typically been ~100MHz below that.

OFC, 2.4GHz is still pretty monstrous...

While the clocks are high, memory bandwidth still makes absolutely no sense to me.

pjbliverpool · Sep 26, 2020

If those specs are true then Navi 22 is pretty disappointing. I'd have hoped the second tear GPU would be a comfortable step above the XSX but this would be barely faster at all.

EDIT: missed the memory bandwidth, so it would actually be slower than the XSX :/

SimBy · Sep 26, 2020

pjbliverpool said:
If those specs are true then Navi 22 is pretty disappointing. I'd have hoped the second tear GPU would be a comfortable step above the XSX but this would be barely faster at all.

EDIT: missed the memory bandwidth, so it would actually be slower than the XSX :/

I don't know. That 40-80 gap seems way too big. Something has to slot in there.

AbsoluteBeginner · Sep 26, 2020

pjbliverpool said:
If those specs are true then Navi 22 is pretty disappointing. I'd have hoped the second tear GPU would be a comfortable step above the XSX but this would be barely faster at all.

EDIT: missed the memory bandwidth, so it would actually be slower than the XSX :/

I'd assume this will be cut down Navi 22. There will be big empty desert between 40CU 192bit and 80CU 384 bit or whatever. I expect to see 256/320bit full/cutdown versions of Navi 22 and Navi 21

Jay · Sep 26, 2020

pjbliverpool said:
EDIT: missed the memory bandwidth, so it would actually be slower than the XSX :/

If anything, my takeaway is that memory efficiency has improved, which would make XSS look a bit better.

But yea, if that bandwidth is true its very surprising.
I guess their view is its not targeting 4K?

Leoneazzurro5 · Sep 26, 2020

tunafish said:
If I understand things correctly, the card will basically never actually run at 2.5GHz. That's the absolute maximum boost limit for when all stars align and power consumption is low and thermals are excellent. That number is not marketed or printed on any box, mainly because realistically no card ever reaches it in any real load. Best possible real-world clocks have typically been ~100MHz below that.

OFC, 2.4GHz is still pretty monstrous...

Well, if you look at the power tables for the Navi 10, the power and clocks are those for the 5700 (base clocks mroeover), not even the 5700XT, and the 5700 reaches these frequencies for sure.

eastmen · Sep 26, 2020

the navi 21 at 3080 performance and prices would be a good grab and a good change of pace for AMD. IF that information above is correct we could see such a thing happen

SpaceBeer · Sep 26, 2020

SimBy said:
I don't know. That 40-80 gap seems way too big. Something has to slot in there.

Agree. But if Navi 22, with improved clocks and IPC is ~20-25% faster than 5700XT, it would be at 2080 Super level. So cut-down Navi 21 would be at 2080 Ti / 3070, and that makes sense

arandomguy · Sep 26, 2020

If Navi 22 does have an inherent clock advantage over Navi 21 by the amount listed then the gap is really roughly the same as that of GA104 against GA102.

There is no full consumer GA104 or GA102 but RTX 3070/RTX 3090 TF ratio is 57.1%.

Using the 12.8/22.5 numbers for Navi 22/21 is 56.8%

3dilettante · Sep 26, 2020

Jawed said:
So this is an old and perhaps relevant thread, from 2015:

https://forum.beyond3d.com/threads/gpu-cache-sizes-and-architectures.56731/

And relevant posting by sebbbi in 2013:

Tahiti saw benefits from this "fully-tiled" benchmark and we now have GPUs with ~6x more fillrate, but only ~3x more bandwidth (RTX 3090, of course, has tiled rasterisation - but that wouldn't be relevant to sebbbi's HDR particles test).

Still, we don't know how RDNA works and whether it uses the ROP depth/colour cache that we saw in GCN.

You know, B3D forum is epic:

128MB L4

I dimly recall that there was discussion either here or elsewhere about what possible sizes of on-die cache would allow texturing behavior to be treated more as a well-behaved working set instead of streaming low-temporal-locality workload in that time period or earlier. I think the values at the time for the working set were then-impossible values in the hundreds of MB.
Perhaps with some relaxation on conditions, such as in-stack versus on-die and getting within an order of magnitude, it's at least theoretically possible.

Jawed said:
Efficient L2 Cache Management to Boost GPGPU Performance

This paper, from 2019, is a direct study of GCN, which improves an existing simulator (to achieve substantially closer simulated performance versus actual chip performance) and then goes on to propose a new cache architecture:

I may have to do more than skim this. There are some interesting questions about the non-disclosed elements of these architectures being probed, such as the number of misses from each cache level and the behavior of the cache protocol.
I'm curious about some of the assumptions made, and how they may affect the conclusions about simulator accuracy or the calculated values for things like miss-handling capacity.
The assumed hardware model has cache sizes that match SI, but it is CPU-like in some ways. The L1 latency is listed as 1-cycle, L2 is 10-cycle, and memory is 90.
In GDC2018, AMD indicated GCN's figures for at each level were ~114/190/350 or 114/76/160 when taking additive latency into account.
Could the loads be that different and not affect the conclusion of how accurate a given simulation is?

I haven't thought enough about whether the IPC metric chosen could be influenced by this.

The proposed cache management scheme seems like a combination of cache pipelining and a small victim buffer seen on some CPUs, with an additional amount of cache line swapping.
At least in the less-flooded coherent systems for CPUs, I think some level of this behavior is going on as part of the the interaction of the home nodes, caches, and memory controllers. There would be hundred-cycle periods where a significant number of cache lines would be useless, if not.
The swap function may or may not be done. The extra data movement in at least one direction might be an additional amount of complexity and consumption, at least for CPU-like scenarios.

I didn't see the reference for where the cache pipeline for the GPU was documented, or rather, how limited the "standard" pipelining is for the GPU L2 based on their claims. Perhaps it's possible that the pipeline needs to be less intensive due to the volume of traffic or other constraints, but the timeline of cache requests and invalidations has a cache line invalidated and idle for 100 cycles because the fetch isn't started until after the current line is fully gone, if my reading of the timing diagram is right. Never mind that if AMD's numbers are at least partly representative, the memory controller latency may be 2x worse.

Jawed said:
https://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3

The focus here, for graphics, appears to be purely textures.

On 22nm:

https://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/4

20mm² on TSMC 7nm for 128MB?

It's possible Intel found that the Haswell GPU's ROP throughput was modest enough that it could dedicate the eDRAM to texturing. I'm not clear on whether the ROPs could leverage the GPU's L3 or the main L3, which could buffer their traffic further.
As far as Crystalwell at 7nm, I think the scaling didn't address that Crystalwell used Intel's eDRAM, which TSMC's 7nm does not have. Then there's the ballpark equivalence of foundry nodes being roughly the same density as Intel's N-1 node (current troubles aside).

An SRAM-based example might be Zen 2's L3 cache, which I've seen estimated to be ~34mm2 for 32 MB. Granted, there are tags, reliability measures, connectivity, and bandwidth measures that could hurt density, depending on the parameters of this hypothetical 128MB GPU cache. I don't think those factors are enough to counter the lack of eDRAM or node name shenanigans.

(edit: lost the word shenanigans)

gamervivek · Sep 26, 2020

Looks like AMD can't do 3SEs after all,

https://forum.beyond3d.com/posts/2073537/

Cutting down to 72/64CUs might be in store for bridging the gap between N21 and N22.

Deleted member 13524 · Sep 26, 2020

"num_rb_per_se" refers to Render Backends per Shader Engine? If it's still 4 ROPs per RBE, we're looking at 64 ROPs on Navi 21?

BRiT · Sep 26, 2020

Most likely limited by memory bandwidth anyways?

trinibwoy · Sep 26, 2020

PSman1700 said:
80cu @2.2 damn. Nv has some serious competition then, long over 20TF o assume.

That fillrate though. 128 rops @ 2.2.

Maybe AMD should market their 8K gaming chops too.

Edit: Oh it's 64 rops? Why did I think it was 128.

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

SimBy

BRiT

(>• •)>⌐■-■ (⌐■-■)

PSman1700

yuri

tunafish

PSman1700

SimBy

pjbliverpool

B3D Scallywag

SimBy

AbsoluteBeginner

Jay

Leoneazzurro5

eastmen

SpaceBeer

arandomguy

3dilettante

gamervivek

Deleted member 13524

Guest

BRiT

(>• •)>⌐■-■ (⌐■-■)

trinibwoy

Meh