AMD Execution Thread [2023]

Status
Not open for further replies.
About the same as 6950XT on average.
Hardly surprising considering the exact same CU/TMU count and memory bandwidth, and very similar boost clock and GT/s fillrate.

Even in raytracing-heavy synthetic benchmarks like 3DMark Speedway, RX 6950 XT and RX 7900 GRE score about the same, while RX 7900 XT / XTX score 1.3 to 1.7 times higher, scaling with CU/TMU count, boost clocks, and memory bandwidth.

https://www.3dmark.com/compare/sw/122528/sw/758838/sw/300168/sw/550558/sw/209573/sw/306966
https://www.3dmark.com/compare/pr/2198884/pr/2485975/pr/2002278/pr/2044919/pr/2465595/pr/1948665

That would mean having 3 different models within $150, seems a bit tight? While 7800 GRE isn't available in retail outside china, it does have official MSRP of $649
Exaclty. RX 7900 GRE is available to OEMs though - even Computerbase.de took their review card from a prebuilt PC - and typically you can find OEM parts at large retailers.

So RX 7800 XT would have little chance against RX 7900 GRE if it's only $50 cheaper, and RX 7700 XT would likewise struggle at double the price of RX 7600 for only 50% more performance.

With 7800XT naming, I expect around a $600 pricetag, which will only represent a very minor performance per dollar improvement from 6800XT. $550 could make it slightly more appealing, but I dont think AMD will do that.

I don't think RX 6800 XT pricing would apply to a much smaller die, when RX 7600 is already cheaper than RX 6600 (and actually the same as RX 5600 XT was). I'd rather expect mid-range RX 6750 XT / RX 6650 XT price levels for Navi 32.

I expect a ~52CU 7700XT to perform similar to a 6800 for about $500, again representing a minimal improvement in performance per dollar from previous gen.

My bet RX 7800 XT could be $549 to $499 if it's a 60 CU part, while RX 7700 XT could be $449 and probably as low as $399 if it's a 48 CU part.

[edit] A leak suggesting the same:
 
Last edited:
I don't think RX 6800 XT pricing would apply to a much smaller die, when RX 7600 is already cheaper than RX 6600 (and actually the same as RX 5600 XT was). I'd rather expect mid-range RX 6750 XT / RX 6650 XT price levels for Navi 32.
N32 is a chiplet design with an N5 GCD though. I doubt that you can expect it to have the same cost structure as N33.
 
The point was that some are memory bound, some compute bound, you can't just pick whatever suits your personal preference as ground truth.
You are losing context here, this blog post claims sweeping improvements based on a single memory bound test/unoptimized code.

And who made these sweeping conclusions based on a single memory bound workload that you are talking about?

From the blog post:

""In this post, we are taking a deep look at how well AMD GPUs can do compared to a performant CUDA solution on NVIDIA GPUs as of now""

""Our study focuses on consumer-grade GPUs as of now. Based on our past experiences, MLC optimizations for consumer GPU models usually are generalizable to cloud GPUs (e.g. from RTX 4090 to A100 and A10g)""

""We are confident that the solution generalizes across cloud and consumer-class AMD and NVIDIA GPUs, and will also update our study once we have access to more GPUs""
 
From the blog post:

""In this post, we are taking a deep look at how well AMD GPUs can do compared to a performant CUDA solution on NVIDIA GPUs as of now""

""Our study focuses on consumer-grade GPUs as of now. Based on our past experiences, MLC optimizations for consumer GPU models usually are generalizable to cloud GPUs (e.g. from RTX 4090 to A100 and A10g)""

""We are confident that the solution generalizes across cloud and consumer-class AMD and NVIDIA GPUs, and will also update our study once we have access to more GPUs""
Whether their tests are deep or not is debatable, sure, but in none of those quotes are they generalizing their results into other workloads. The title of the blog post clealy specifies "LLM inference" and I don't see them step outside of that topic.

Are they wrong in assuming generalizability to cloud GPU's or how are those quotes relevant?
 
Are they wrong in assuming generalizability to cloud GPU's or how are those quotes relevant?
Depending on what people intend to do with the cloud GPU, setting up a server without static or dynamic batch inference would be bad due to low amortization (or the lack of thereof). Usually, people opt for single batch inference only when low-latency inference is an absolute requirement. Otherwise, static or dynamic batching is pretty much a requirement for a production environment. Making any generalizations about 'LLM inference' based on a single batch test of a single model is also quite misleading since this is not what is used in real setups.
 
Making any generalizations about 'LLM inference' based on a single batch test of a single model is also quite misleading since this is not what is used in real setups.
I'm not sure the authors ever generalized their findings to all types of LLM inference workloads. Looks to me like the quotes are about generalizing their specific method (MLC) to other GPU's.
 
N32 is a chiplet design with an N5 GCD though. I doubt that you can expect it to have the same cost structure as N33.
To recap a previous post, the main GCD would be ~200 mm2, the same as the Navi 33, and 3 to 4 MCDs (for 192 to 265-bit GDDR6 interface), at ~37 mm2 each, would be the same dies packaged in the Navi 31 - so yields should be better for these smaller dies, and the 'high-performance fan-out' interconnect on an organic substrate is the same technology as used in Ryzen processors. Packaging costs would be higher though, as data bus interfaces are much wider comparing to CPU packages and require more die area dedicated to fan-out contacts.

BTW there was a leak alleging $549 for RX 7800 and $449 for RX 7700.

NAVI4X
NAVI4M
NAVI4C

MI400 Mercury/Venus/Earth
would say different configuration and interconnect of different parts of the physical chips
'Navi4X' looks like a general designation that encompasses all dies, rather than a specific die design. 'Navi4M' could be 'monolithic' or 'mobile' die (i.e. Navi 44 according to previous leaks ), and 'Navi4C' could be 'chiplet' (i.e. Navi 41/42/43), or maybe 'custom' (i.e. Sony/Microsoft).


MI400 names look like APU, GPU-only and CPU-only variants, just like MI300 would come in A, X, C, and P versions.
 
Last edited:
Moore's Law is Dead just leaked a Navi4C presentation slide from several months ago, and it's basically a GCD die split into a few dozen vertically stacked chiplets! o_O

Three Shader Engine dies (SED) sit on top of each of the three Active Interposer dies (AID), where AIDs are interconnected through horizontal silicon bridges to their neighbours and the Multimedia I/O die (MID).

This design is based on AMD patent US20220320042A1 Die stacking for modular parallel processors.

 
Last edited:
Moore's Law is Dead just leaked a Navi4C presentation slide from several months ago, and it's basically a GCD die split into a few dozen vertically stacked chiplets! o_O

Up to three Shader Engine dies (SED) sit on top of each Active Interposer die (AID), where AIDs are interconnected through horizontal silicon bridges to neighbours and the Multimedia I/O die (MID); Memory-Cache dies (MCD) are not pictured, but supposedly present.

This design is based on AMD patent US20220320042A1 Die stacking for modular parallel processors.


Yes the thing was to have CU counts in upper 200s.
And no, no MCDs there.
It's also a tiny little peek into EPYC Venice.
 
'Upper 200s' CUs, on a 500 mm2 die? o_O

200+ CUs runnning at 3.1+ GHz should perform at ~3000 GT/s and ~200 TFLOPS, or 3x times the RX 7900 XTX (if counting 4 ops per clock, as in dual-issue FP32 ALUs in RDNA3); ~100 CUs would be 25% faster, and ~72 CUs would be the same. That's not counting the impact of memory bandwidth and caches.

ALU:TMU:ROP
CU
Die size mm2
Clock GHz
FP32 TFLOPS
Texture GT/s
Pixel GP/s
RX 6950 XT (RDNA2)​
5120:320:128​
80​
520​
2,3​
23,65​
739​
296​
RX 7900 XTX (RDNA3)​
6144:384:192​
96​
531​
2,5​
61,44​
960​
480​
Instinct MI250X (CDNA2)​
14080:880:0​
220​
1540​
1,7​
47,87​
1496​
0​
Instinct MI300X (CDNA3)​
19456:1216:0​
304​
1018​
2,5​
97,28​
3040​
0​
NAVI4M (RDNA4) 'RX 8400'​
2304:144:96​
36​
108​
3,125​
28,8​
450​
225​
NAVI4M (RDNA4) 'RX 8500 XT'​
3072:192:96​
48​
140​
3,125​
38,4​
600​
300​
NAVI4M (RDNA4) 'RX 8600 XT'​
4608:288:144​
72​
207​
3,125​
57,6​
900​
450​
NAVI4M (RDNA4) 'RX 8700 XT'​
6144:384:192​
96​
270​
3,125​
76,8​
1200​
600​
NAVI4C (RDNA4) 'RX 8800’​
9216:576:288​
144​
390​
3,125​
115,2​
1800​
900​
NAVI4C (RDNA4) 'RX 8900 XT'​
11264:704:352​
176​
470​
3,125​
140,8​
2200​
1100​
NAVI4C (RDNA4) 'RX 8950'​
12288:768:384​
192​
515​
3,125​
153,6​
2400​
1200​
NAVI4C (RDNA4) 'RX 8970'​
14336:896:448​
224​
595​
3,125​
179,2​
2800​
1400​
NAVI4C (RDNA4) 'RX 8990'​
16384:1024:512​
256​
675​
3,125​
204,8​
3200​
1600​


[Edit] Navi4x model numbers / die sizes are tentative and based on reported TSMC N3P transistor densities, which are probably too optimistic...
 
Last edited:
Status
Not open for further replies.
Back
Top