Bondrewd
Veteran
Of course.MI100 as a POC has been assessed in lab, right ?
Those procurements stretch for years.
Of course.MI100 as a POC has been assessed in lab, right ?
One possible area that will need to be resolved is software. Existing HPC applications and software will see MI200 as two gpu's and not one like MI100 (or A100), so from the standpoint of extracting the highest performance and efficiency there might be additional work beyond using what is already available to attain the best optimized workloads for MI200.if MI200 is so bad, why they buy it? MI100 as a POC has been assessed in lab, right ?
One possible area that will need to be resolved is software. Existing HPC applications and software will see MI200 as two gpu's and not one like MI100 (or A100), so from the standpoint of extracting the highest performance and efficiency there might be additional work beyond using what is already available to attain the best optimized workloads for MI200.
At Argonne National Laboratory you should have Polaris coming online in 2022, and should be able to a validate AI workload performance.Without any nvidia exascale supercomputers being built, how is anyone even going validate any of the things being said without any data?
Argonne’s 44-Petaflops ‘Polaris’ Supercomputer Will Be Testbed for Aurora, Exascale Era (hpcwire.com)The installation currently underway spans 280 HPE Apollo Gen10 Plus systems across 40 racks, aggregating a total of 560 AMD Epyc Rome CPUs and 2,240 Nvidia 40GB A100 GPUs with HPE’s Slingshot networking. As part of a planned upgrade, the second-gen Epyc Rome CPUs (32-core 7532 SKU) will be swapped out in March 2022 for third-gen Epyc Milans (the 32-core 7543 part).
...
At 44-petaflops (double-precision, peak) Polaris would rank among the world’s top 15-or-so fastest computers. The system’s theoretical AI performance tops out at nearly 1.4 exaflops, based on mixed-precision compute capabilities, according to HPE and Nvidia.
Yes, AMD's upcoming RDNA3 GPUs will use MCM, and they will have to become one homogeneous big GPU or else performance will suffer considerably, putting two GPUs on one stick is no different than putting two GPUs in SLI/Crossfire on one PCB, it was worthless and the whole multi GPUs configuration trend died prematurely. We were led to believe CDNA2 is a true multi-core GPU acting as one, a true breakthrough, not some crossfired GPUs on a single PCB.And this is somehow unexpected? Do you expect Nvidia's or Intel's solution to have some magic sauce to connect them together as to not have it boil down to 2 GPUs with a faster interconnect between them?
Bandwidth is still actually 200GB/s directional. Power consumption is 560w on water cooling.(like 50 % lower bandwidth than the actual one, ~25 % higher power consumption than the actual one, purposely confusing theoretical and measured performance etc
"It's actually 2 GPUs on a stick with faster I/O between them which isn't that exciting."
And this is somehow unexpected? Do you expect Nvidia's or Intel's solution to have some magic sauce to connect them together as to not have it boil down to 2 GPUs with a faster interconnect between them?
I want whatever you were smoking when figuring out Polaris "attaining similar results" to Aurora.At Argonne National Laboratory you should have Polaris coming online in 2022, and should be able to a validate AI workload performance.
Argonne’s 44-Petaflops ‘Polaris’ Supercomputer Will Be Testbed for Aurora, Exascale Era (hpcwire.com)
The testing should be interesting from the aspect of comparing performance based on the number of gpu's required from each system (Polaris 2,240 vs Aurora 9,000) to attain similar results.
Edit: I doubt it will be used as a testbed but Leonardo will also be available with 10 exaflops of AI performance.
"It's actually 2 GPUs on a stick with faster I/O between them which isn't that exciting."
And this is somehow unexpected? Do you expect Nvidia's or Intel's solution to have some magic sauce to connect them together as to not have it boil down to 2 GPUs with a faster interconnect between them?
My bad, typo on my part. I meant they could use Polaris as a Frontier testbed for AI workloads in 2022. The point is there will be options to test Frontier workloads against when it comes up to speed.[/QUOTE]Polaris is nothing but quick'n'dirty "testbed" put together because Aurora is late. And Aurora won't have 9000 GPUs, it'll be over 9000 nodes each of which have 6 Ponte Vecchios which could be counted as 2 GPUs each.
For nVidia? Everything else doesnt make any sense. At least their interconnect has to be faster than NVLink.
Actiually.. the first commercial rasterization GPU was a multi-die GPU.The only real issue is that MI200 isn't actually the "first multi-die GPU" that AMD promised. It's actually 2 GPUs on a stick with faster I/O between them which isn't that exciting.
Inter die links are 400GB/s while the inter GPU links are 800 GB/s. All coherent links.NVLink already provides up to 600GB/s bi-directional between GPUs on different PCBs. That's either using a switch or direct peer-to-peer connections. So 400GB/s between GPUs on the same interposer isn't earth shattering. Given current NVLink speeds you would expect much bigger numbers between Hopper dies on the same substrate.
Inter die links are 400GB/s while the inter GPU links are 800 GB/s. All coherent links.
NVIDIA's NVLink3 is 300GB/s directional (600GB/s bi-directional), which is still faster than AMD's 200GB/s directional (400GB/s bi-directional) between dies.NVLink already provides up to 600GB/s bi-directional
Page 8That 800 GB/s number isn't real though as you need to reserve some links for GPU<->CPU communication. Unless there's still an option to use PCIe.
It's just ASE FOEB.Packaging tech EFB is interesting though, will probably be useful in democratizing HBM
AMD Announces Instinct MI200 Accelerator Family: Taking Servers to Exascale and Beyond (anandtech.com)All of these IF links are 16 bits wide an operate at 25Gbps/pin in a dual simplex fashion. This means there’s 50GB/second of bandwidth up and another 50GB/second of bandwidth down along each link. Or, as AMD likes to put it, each IF link is 100GB/second of bi-directional bandwidth, for a total aggregate bandwidth of 800GB/second. Notably, this gives the two GCDs within an MI250(X) 200GB/second of bandwidth in each direction to communicate among themselves. This is an immense amount of bandwidth, but for remote memory accesses it’s still going to be a fraction of the 1.6TB/second available to each GCD from its own HBM2E memory pool.
Otherwise, these links run at the same 25Gbps speed when going off-chip to other MI250s or an IF-equipped EPYC CPU. Besides the big benefit of coherency support when using IF, this is also 58% more bandwidth than what PCIe 4.0 would otherwise be capable of offering.
Page 8
https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf
Inter die IF 4x links 400 GB/s
GPU P2P 8x Links 800 GB/s (Long range MTK SerDes I think)
Host interface 16x lanes in IF mode or PCIe mode
Downstream 1x 25Gbps to NIC/Slingshot