AMD CDNA Discussion Thread

if MI200 is so bad, why they buy it? MI100 as a POC has been assessed in lab, right ?
One possible area that will need to be resolved is software. Existing HPC applications and software will see MI200 as two gpu's and not one like MI100 (or A100), so from the standpoint of extracting the highest performance and efficiency there might be additional work beyond using what is already available to attain the best optimized workloads for MI200.
 
One possible area that will need to be resolved is software. Existing HPC applications and software will see MI200 as two gpu's and not one like MI100 (or A100), so from the standpoint of extracting the highest performance and efficiency there might be additional work beyond using what is already available to attain the best optimized workloads for MI200.

That's not a problem for HPC. That software is already designed to target many CPUs and GPUs. The only real issue is that MI200 isn't actually the "first multi-die GPU" that AMD promised. It's actually 2 GPUs on a stick with faster I/O between them which isn't that exciting.

AMD-ACP-Press-Deck-21-575px.jpg
 
"It's actually 2 GPUs on a stick with faster I/O between them which isn't that exciting."
And this is somehow unexpected? Do you expect Nvidia's or Intel's solution to have some magic sauce to connect them together as to not have it boil down to 2 GPUs with a faster interconnect between them?
 
Without any nvidia exascale supercomputers being built, how is anyone even going validate any of the things being said without any data?
At Argonne National Laboratory you should have Polaris coming online in 2022, and should be able to a validate AI workload performance.
The installation currently underway spans 280 HPE Apollo Gen10 Plus systems across 40 racks, aggregating a total of 560 AMD Epyc Rome CPUs and 2,240 Nvidia 40GB A100 GPUs with HPE’s Slingshot networking. As part of a planned upgrade, the second-gen Epyc Rome CPUs (32-core 7532 SKU) will be swapped out in March 2022 for third-gen Epyc Milans (the 32-core 7543 part).
...
At 44-petaflops (double-precision, peak) Polaris would rank among the world’s top 15-or-so fastest computers. The system’s theoretical AI performance tops out at nearly 1.4 exaflops, based on mixed-precision compute capabilities, according to HPE and Nvidia.
Argonne’s 44-Petaflops ‘Polaris’ Supercomputer Will Be Testbed for Aurora, Exascale Era (hpcwire.com)

The testing should be interesting from the aspect of comparing performance based on the number of gpu's required from each system (Polaris 2,240 vs Aurora 9,000) to attain similar results.

Edit: I doubt it will be used as a testbed but Leonardo will also be available with 10 exaflops of AI performance.
 
Last edited:
And this is somehow unexpected? Do you expect Nvidia's or Intel's solution to have some magic sauce to connect them together as to not have it boil down to 2 GPUs with a faster interconnect between them?
Yes, AMD's upcoming RDNA3 GPUs will use MCM, and they will have to become one homogeneous big GPU or else performance will suffer considerably, putting two GPUs on one stick is no different than putting two GPUs in SLI/Crossfire on one PCB, it was worthless and the whole multi GPUs configuration trend died prematurely. We were led to believe CDNA2 is a true multi-core GPU acting as one, a true breakthrough, not some crossfired GPUs on a single PCB.
(like 50 % lower bandwidth than the actual one, ~25 % higher power consumption than the actual one, purposely confusing theoretical and measured performance etc
Bandwidth is still actually 200GB/s directional. Power consumption is 560w on water cooling.
 
Last edited:
AMD's own benchmarks has been made with the 560W version. So MI250X is around 36% more efficient than A100 SMX while using twice the footprint and only on par with the 300W PCIe version. And i guess the 20% increase in FP16 is only sustained with 560W, too. So it is less effcient while being around twice as big...

"It's actually 2 GPUs on a stick with faster I/O between them which isn't that exciting."
And this is somehow unexpected? Do you expect Nvidia's or Intel's solution to have some magic sauce to connect them together as to not have it boil down to 2 GPUs with a faster interconnect between them?

For nVidia? Everything else doesnt make any sense. At least their interconnect has to be faster than NVLink.
 
At Argonne National Laboratory you should have Polaris coming online in 2022, and should be able to a validate AI workload performance.

Argonne’s 44-Petaflops ‘Polaris’ Supercomputer Will Be Testbed for Aurora, Exascale Era (hpcwire.com)

The testing should be interesting from the aspect of comparing performance based on the number of gpu's required from each system (Polaris 2,240 vs Aurora 9,000) to attain similar results.

Edit: I doubt it will be used as a testbed but Leonardo will also be available with 10 exaflops of AI performance.
I want whatever you were smoking when figuring out Polaris "attaining similar results" to Aurora.
Polaris is nothing but quick'n'dirty "testbed" put together because Aurora is late. And Aurora won't have 9000 GPUs, it'll be over 9000 nodes each of which have 6 Ponte Vecchios which could be counted as 2 GPUs each.
 
"It's actually 2 GPUs on a stick with faster I/O between them which isn't that exciting."
And this is somehow unexpected? Do you expect Nvidia's or Intel's solution to have some magic sauce to connect them together as to not have it boil down to 2 GPUs with a faster interconnect between them?

You’re asking this even after I shared AMD’s slide calling MI200 a multi-die GPU? Singular. As in one GPU.

If Nvidia or Intel make the same claim we can talk about it then.
 
Polaris is nothing but quick'n'dirty "testbed" put together because Aurora is late. And Aurora won't have 9000 GPUs, it'll be over 9000 nodes each of which have 6 Ponte Vecchios which could be counted as 2 GPUs each.
My bad, typo on my part. I meant they could use Polaris as a Frontier testbed for AI workloads in 2022. The point is there will be options to test Frontier workloads against when it comes up to speed.[/QUOTE]
 
For nVidia? Everything else doesnt make any sense. At least their interconnect has to be faster than NVLink.

NVLink already provides up to 600GB/s bi-directional between GPUs on different PCBs. That's either using a switch or direct peer-to-peer connections. So 400GB/s between GPUs on the same interposer isn't earth shattering. Given current NVLink speeds you would expect much bigger numbers between Hopper dies on the same substrate.

Someone mentioned it earlier in the thread but the big benefit of MI200 could be compute density. You can fit more MI200 dies than A100 dies in the same space.
 
The only real issue is that MI200 isn't actually the "first multi-die GPU" that AMD promised. It's actually 2 GPUs on a stick with faster I/O between them which isn't that exciting.
Actiually.. the first commercial rasterization GPU was a multi-die GPU.

TOj4iJk.jpeg

I'm also surprised and a bit disappointed that AMD is still calling Mi200 a GPU.
 
NVLink already provides up to 600GB/s bi-directional between GPUs on different PCBs. That's either using a switch or direct peer-to-peer connections. So 400GB/s between GPUs on the same interposer isn't earth shattering. Given current NVLink speeds you would expect much bigger numbers between Hopper dies on the same substrate.
Inter die links are 400GB/s while the inter GPU links are 800 GB/s. All coherent links.
Direct from the CDNA2 paper.
But somehow it seems a lot of repurposed tech from EPYC.
Anyway Trento CPU is a one off project and possibly MI200 too.

Packaging tech EFB is interesting though, will probably be useful in democratizing HBM. Cutting down Si Interposer costs drastically.
Real next gen is MI300.
But I bet the system architect already evaluated everything and went with MI200 for this gen.
They have Spock, they have the previous machine and they have the Exascale readiness task force from the ECP project who are evaluating all the architectures.
I bet they are at least as smart as forum members here if not smarter, just about everyone there carrying their PhD in their name tags and all.

Eventual goal is to ensure software from Frontier and Aurora can run on both.
You can google around, lots of work done by the gov agencies for software portability and they awarded contracts to Codeplay (by Argonne) and Mentor Graphics (by ORNL) too for SW work on Intel and AMD systems.
Their goal is basically to drop all non open/non industry standard frameworks, its public info wont bother linking just few minutes of google search.
 
Last edited by a moderator:
All of these IF links are 16 bits wide an operate at 25Gbps/pin in a dual simplex fashion. This means there’s 50GB/second of bandwidth up and another 50GB/second of bandwidth down along each link. Or, as AMD likes to put it, each IF link is 100GB/second of bi-directional bandwidth, for a total aggregate bandwidth of 800GB/second. Notably, this gives the two GCDs within an MI250(X) 200GB/second of bandwidth in each direction to communicate among themselves. This is an immense amount of bandwidth, but for remote memory accesses it’s still going to be a fraction of the 1.6TB/second available to each GCD from its own HBM2E memory pool.

Otherwise, these links run at the same 25Gbps speed when going off-chip to other MI250s or an IF-equipped EPYC CPU. Besides the big benefit of coherency support when using IF, this is also 58% more bandwidth than what PCIe 4.0 would otherwise be capable of offering.
AMD Announces Instinct MI200 Accelerator Family: Taking Servers to Exascale and Beyond (anandtech.com)
 
Page 8
https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf
Inter die IF 4x links 400 GB/s
GPU P2P 8x Links 800 GB/s (Long range MTK SerDes I think)
Host interface 16x lanes in IF mode or PCIe mode
Downstream 1x 25Gbps to NIC/Slingshot

It doesn't say whether the host interface is dedicated or uses one of the 8 external links. The latter certainly seems to be the case from the picture. 2 links to the CPU in PCIe mode and 6 links to other GPUs.
 
Back
Top