AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Jay · Oct 18, 2020

yuri said:
RT perf is irrelevant. AMD simply has to be cheaper considering their weaker brand recognition and mainly their lacking SW - utterly horrid last gen driver experience, possibly broken OpenCL, lack of DLSS, CUDA, "driver utils" like Ansel, etc.

You think just being cheaper is enough?
It's not really helped then in the past.

SimBy · Oct 18, 2020

How about just being actually able to buy one.

Rootax · Oct 18, 2020

yuri said:
RT perf is irrelevant. AMD simply has to be cheaper considering their weaker brand recognition and mainly their lacking SW - utterly horrid last gen driver experience, possibly broken OpenCL, lack of DLSS, CUDA, "driver utils" like Ansel, etc.

I disagree here, not in 2020-2021... And with consoles having RT too.

tEd · Oct 18, 2020

trinibwoy said:
How would you even begin to calculate that? AMDs patent suggests each intersection engine can do 4 boxes or 1 triangle per clock. Even if you assume that’s true what numbers would you use for Turing and Ampere?

Ampere 2 triangles/clock and shading concurrently.
Turing 1 triangle/clock
Navi 1 triangle/clock and also shading concurrently

..but there is much more to RT performance than just those numbers i guess

Deleted member 13524 · Oct 18, 2020

Jawed said:
Why would 6900XTX with ~41TF (2GHz) be on a completely different level from 3090 at 36TF? Only NVidia is now allowed to have huge amounts of compute?

Then your suggestion would be that Navi 21 quadruples the compute units over Navi 10, but not all the other execution units in the GPU? That would make RDNA2 a compute-centric architecture, but that's most probably not going to happen.
AMD isn't going to focus RDNA2 on compute throughput because they already have CDNA / Arcturus for that.

The way I see it, RDNA2 has a lot of on-chip cache to compensate for a lower bandwidth towards the VRAM, whereas CDNA focuses more die area on compute units with less on-chip cache because it uses HBM2.

DegustatoR · Oct 18, 2020

So about 650W for 160 CUs at 2GHz GPU then? Cmon, guys, die area hasn't been the main limiter of a GPU performance for years now.

Jawed · Oct 18, 2020

DegustatoR said:
So about 650W for 160 CUs at 2GHz GPU then? Cmon, guys, die area hasn't been the main limiter of a GPU performance for years now.

NVidia can build a GPU with 10752 FP32 ALU lanes but AMD can't build a GPU with 10240 ALU lanes on a better node?

Scott_Arm · Oct 18, 2020

5700XT w/ 1755 MHz game clock
9 TFLOPS
112 GPixel/s

72 CU w/ 2100 MHz game clock
19.4 TFLOPS (2.2x 5700XT)
268 GPixel/s (128 rops, 2.4x 5700XT)

80 CU w/ 2100 MHz game clock
21.5 TFLOPS (2.4x 5700XT)
268 GPixel/s (128 rops, 2.4x 5700XT)

5700XT benchmarks:
borderlands 3 4k ultra ~33 fps
gears 5 4k ultra ~39 fps

Scaling from AMDs sample benchmarks:
borderlands 3 4k badass 61 (greater than 1.85x scaling, because this is badass and not ultra)
Gears 5 4k ultra 73 (1.88x scaling)

If the samples are the 72CU unit then 80CU extrapolates to the following assuming perfect scaling:
borderlands 3 4k badass 61 -> ~68 fps
Modern Warfare 4k ultra 88 -> ~98 fps
Gears 5 4k ultra 73 -> ~81 fps

If the samples are the 80CU unit then the 72CU extrapolates to the following assuming perfect scaling:
borderlands 3 4k badass 61 -> ~55 fps
Modern Warfare 4k ultra 88 -> ~79 fps
Gears 5 4k ultra 73 -> ~66 fps

The numbers they showed match up pretty close to a 3080, so if it's a 72CU then the 80CU will match pretty closely with a 3090. If the numbers are for the 80CU, then the 72CU will be well behind the 3080.

trinibwoy · Oct 18, 2020

Jawed said:
NVidia can build a GPU with 10752 FP32 ALU lanes but AMD can't build a GPU with 10240 ALU lanes on a better node?

I don’t think ALU lanes are the issue. It’s 160 CUs vs 84 SMs. Not even in the same ballpark.

Deleted member 13524 · Oct 18, 2020

Jawed said:
NVidia can build a GPU with 10752 FP32 ALU lanes but AMD can't build a GPU with 10240 ALU lanes on a better node?

Arcturus has 8192 FP32 ALU lanes, probably with 1:2 FP64 throughput. And if RDNA2's clocks are any indication, it too should clock at around 2GHz for >32 TFLOPs.

I honestly don't get why gaming Ampere has so many ALUs - much more than the GA100 that is compute-oriented. It definitely doesn't translate into gaming performance.
That said, I don't know why AMD would follow suit. They tried their hand at using chips with lots of compute units to compete in the gaming market, and the result was a chip with comparatively low power efficiency (Vega 10).

RDNA is gaming-centric, so it shouldn't have more compute resources than what the other execution units and effective memory bandwidth can keep up with, for rasterization.

nAo · Oct 18, 2020

trinibwoy said:
If it’s the image I think you’re talking about I thought that was made up nonsense. Nvidia hasn’t shared any details about its RT units.

Yep, I’ve seen that image posted on twitter and some of the numbers were completely wrong.

SimBy · Oct 18, 2020

https://twitter.com/x/status/1317856460253089795

Like what? More like Zen? Boost as high as possible as long as you're within temp/power envelope?

Jawed · Oct 18, 2020

trinibwoy said:
I don’t think ALU lanes are the issue. It’s 160 CUs vs 84 SMs. Not even in the same ballpark.

80 WGPs with 4x SIMD-32s vs 84 SMs with 8x SIMD-16s. Hmm...

Are you saying that 4 TMU lanes in an SM versus 8 TMU lanes in a WGP is a major factor here? Is there something else? I can't read your mind.

ToTTenTranz said:
Arcturus has 8192 FP32 ALU lanes, probably with 1:2 FP64 throughput. And if RDNA2's clocks are any indication, it too should clock at around 2GHz for >32 TFLOPs.

There's a spec? There's a die size?

I honestly don't get why gaming Ampere has so many ALUs - much more than the GA100 that is compute-oriented. It definitely doesn't translate into gaming performance.
That said, I don't know why AMD would follow suit. They tried their hand at using chips with lots of compute units to compete in the gaming market, and the result was a chip with comparatively low power efficiency (Vega 10).

RDNA looks like it is focussed on being bandwidth-efficient (in the CUs the focus is on minimised cache-thrashing). RDNA 2 looks like a major iteration on that concept, though perhaps the L1 and L2 papers and patent stuff is all already in RDNA.

RDNA is gaming-centric, so it shouldn't have more compute resources than what the other execution units and effective memory bandwidth can keep up with, for rasterization.

What other execution units? ALU:TEX in my proposal is unchanged. My hypothesis includes ALU:colour-fillrate ratio being either doubled or quadrupled. We don't know if it's 128 or 64 ROPs. ALU:zixel-rate is either the same or doubled, because my theory is that AMD will double zixel rate per ROP.

Sounds like a gaming GPU to me. 80 WGPs is a lot of ray tracing, too.

If you disagree with 80 WGPs, then you have to explain a massive die that's been seen in a gaming card with GDDR6. 128MB last level cache could be the answer, but it seems really unlikely to me, simply because a monster cache that is almost the same size as all of the CUs (which are about 112mm²) hasn't been seen in XSX (and PS5's die size corresponds to that). If RDNA 2 has a monster cache then that would appear to imply that the consoles have no RDNA 2 features except for ray tracing. That would make them even more horrible than I thought they were...

Earlier I said XSX CUs are 15% of the die - that's wrong, they're about 31%. Also, I said that there's one L0 per WGP, but in the RDNA whitepaper it shows that 4x TMUs (per CU) have a dedicated L0 ("vector L0", which also caches non-textured memory reads). The instruction and scalar data caches are shared by both CUs in addition to LDS - because a workgroup is the general concept of a shared-state of computation, consisting of multiple wave64s (one to four) or wave32s (one to eight) of work-items.

I don't know why this guy has credibility for AMD leaks, but here's a fresh one:

https://twitter.com/x/status/1317851087777443841

236mm² is 79mm² larger than Navi 14:

https://flic.kr/p/49523073911

which is 157mm². Just in case you've not been paying attention, 40CUs in Navi 10 take 80mm², or if you prefer, 20 CUs in Navi 14 take 40mm² (there's actually 24).

So, how the hell does a "rumoured 32 CU" Navi 23 spend 79mm², when 8 more CUs should be about 16mm²?:

It has half the huge last level cache size of Navi 21?
It has RDNA 2 CUs (not seen in XSX or PS5) which are twice the area of RDNA 1 CUs? Damn, that ray tracing had better be godlike.

PSman1700 · Oct 18, 2020

Jawed said:
If you disagree with 80 WGPs, then you have to explain a massive die that's been seen in a gaming card with GDDR6. 128MB last level cache could be the answer, but it seems really unlikely to me, simply because a monster cache that is almost the same size as all of the CUs (which are about 112mm²) hasn't been seen in XSX (and PS5's die size corresponds to that). If RDNA 2 has a monster cache then that would appear to imply that the consoles have no RDNA 2 features except for ray tracing. That would make them even more horrible than I thought they were...

Indeed their roughly at half the raw power compared to AMD's own dGPUs at launch. Even less compared to NV's stuff.

Jawed · Oct 18, 2020

I did a Navi 14 analysis:

It's interesting that 5mm² on such a small die is "edges" (perimeter of die). Capacitor ring? What else?

Some of the "unknown interior elements" (3mm²) appears to be blank die, where there's no small rectangles of functionality that can squeeze in. Not sure...

I've decided that "global control" is a better description than "uncore": graphics command processor, geometry processor, ACEs, HWS, DMA. I am suspicious that the real global control area is non-rectangular, "leaking" into the centre of the area that I think of as "(shader) engine common". I haven't found a Navi 10 die shot that has the stunning clarity of the Navi 14 die shot, so I can't make comparisons of function blocks...

I haven't worked out a way to say what's MC and what's L2 in the die shot. I made some assumptions for Navi 10, but I haven't thought of a way to improve those.

DegustatoR · Oct 18, 2020

Jawed said:
NVidia can build a GPU with 10752 FP32 ALU lanes but AMD can't build a GPU with 10240 ALU lanes on a better node?

It's not about how many lanes you have, it's about your power consumption. NV moved from 16 to 10nm and even with that they are well above 300W now. AMD doesn't even move anywhere with Navi2, it's the same process, and we already know their ballpark perf/watt gain. Lets be realistic here.

PSman1700 · Oct 18, 2020

DegustatoR said:
Lets be realistic here.

Yes, but think that we want or better said need amd to somewhat compete atleast, otherwise we will see NV upping their prices again. Atleast the 20+TF rumor seems very realistic. Which is great i think, before the NV ampere unveil we where guessing 18TF tops for next gen graphics processing units.

Kaotik · Oct 18, 2020

DegustatoR said:
It's not about how many lanes you have, it's about your power consumption. NV moved from 16 to 10nm and even with that they are well above 300W now. AMD doesn't even move anywhere with Navi2, it's the same process, and we already know their ballpark perf/watt gain. Lets be realistic here.

Actually we don't know which process it is. We know it's 7nm, but we don't know whether it's enhanced N7P or N7+. It's different process compared to at least Xbox SoC, probably PS5 SoC too (at least for Xbox it seems quite clear the "AMD enhanced 7nm" means same node as Zen2 Refresh & Zen 3, which is "enhanced N7"). And I'm thinking it should be given it's "N7P or better" since Navi1x were N7P already.

Silent_Buddha · Oct 18, 2020

ToTTenTranz said:
That said, I don't know why AMD would follow suit. They tried their hand at using chips with lots of compute units to compete in the gaming market, and the result was a chip with comparatively low power efficiency (Vega 10).

RDNA is gaming-centric, so it shouldn't have more compute resources than what the other execution units and effective memory bandwidth can keep up with, for rasterization.

If RDNA2 adopts the changes that MS requested (4x int8 and 8x int4) for Anaconda and Lockhart in their CUs then that would provide more flexibility for ML workloads.

Regards,
SB

Jawed · Oct 18, 2020

DegustatoR said:
Lets be realistic here.

I'm trying to be realistic about the use of die area. The only alternative being rumoured for the huge missing area is "massive cache". You have something better? Or do you think it's a monster cache?

Even a 4096-bit HBM bus in addition to 256-bit GDDR6 leaves a gaping mismatch.

All of this seems crazy. Clutching at straws, because there's no obviously "realistic" option.

A 900MHz range in clocks (for non-idle scenarios!) tells us that the GPU will clock down massively when given a particular kind of workload. That seems very likely to be sustained compute, which games are extremely bad at, therefore game clocks will be high.

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Jay

SimBy

Rootax

tEd

Casual Member

Deleted member 13524

Guest

DegustatoR

Jawed

Scott_Arm

trinibwoy

Meh

Deleted member 13524

Guest

nAo

Nutella Nutellae

SimBy

Jawed

PSman1700

Jawed

DegustatoR

PSman1700

Kaotik

Drunk Member

Silent_Buddha

Jawed