AMD CDNA Discussion Thread

Intel is putting HBM on a CPU, AMD is putting a CPU on HBM. 18 months from now isolated use cases in data centers will be only small part of the business. Accelerators exists because x86 CPUs are not fast enough. Combining a slow x86 CPU with an accelerator is a paradox.
Then why do so many clusters have CPU only nodes?

Your view is stupidly simplistic ,
remember executing data is easy moving data is hard. CPU's are orders of magnitude better at moving data that you don't know you need yet or you don't know which data you need (OOE machinery) .
GPU's are orders of magnitude wider then CPU's and have far more concurrency.
CPU's SIMD's are stupidly fast(latency, 4-5 cycle FMA at 2.5-5ghz

HBM limits how large your working set is but has high throughput ( but not really "fast" per say)
DIMM's have low throughput but can have TB of working space
DIMM's can have persistent memory

By putting all these on package , by having coherency end to end , you can:
reduce unnecessary coping of data between "engines" ( remember move hard , execute easy)
remove slow power hungry off package I/O interfaces
increase working set size
accelerate the workload in either direction easier


Team Nv of this forums beloved matrix engines are nothing more then exactly above at play but for a very specific workload , very large number of reads to very low number of writes.
 
Then why do so many clusters have CPU only nodes?

Your view is stupidly simplistic ,
remember executing data is easy moving data is hard. CPU's are orders of magnitude better at moving data that you don't know you need yet or you don't know which data you need (OOE machinery) .
GPU's are orders of magnitude wider then CPU's and have far more concurrency.
CPU's SIMD's are stupidly fast(latency, 4-5 cycle FMA at 2.5-5ghz

HBM limits how large your working set is but has high throughput ( but not really "fast" per say)
DIMM's have low throughput but can have TB of working space
DIMM's can have persistent memory

By putting all these on package , by having coherency end to end , you can:
reduce unnecessary coping of data between "engines" ( remember move hard , execute easy)
remove slow power hungry off package I/O interfaces
increase working set size
accelerate the workload in either direction easier


Team Nv of this forums beloved matrix engines are nothing more then exactly above at play but for a very specific workload , very large number of reads to very low number of writes.

Thats a lot of stupids for one post, hard to have a civil discussion when you’re reducing people and processes to idiotic and extreme.
 
32 A100s still consumed significantly lower wattage than 16 MI250X (32 logical GPUs), while achieving similar performance. The situation becomes inverted at 128 logical GPUs though, with the 128 A100s using slightly more power.
I got the impression that was their point? On paper (at least if the only metrics you look at are peak TFLOPS and TDP) MI250X should be ~3.5x more efficient than A100, but IRL it's very similar for this problem.
 
Thats a lot of stupids for one post, hard to have a civil discussion when you’re reducing people and processes to idiotic and extreme.
No it isn't, it's the fundamentals of computation for the last 70 years and the problem is it's really hard and there is no free lunch. As technology(process, package, etc) improves it's just the wheel of reincarnation.

But feel free to actually put some sort of meaningful objection with some kind of actual substance forward......
 
No it isn't, it's the fundamentals of computation for the last 70 years and the problem is it's really hard and there is no free lunch. As technology(process, package, etc) improves it's just the wheel of reincarnation.

But feel free to actually put some sort of meaningful objection with some kind of actual substance forward......
The meaningful objection is clear I think: we cannot have productive conversations when you’re hyperbolic and personally insulting people
 
"Single GCD is theoretical 1.6TB/s, in ideal synthetic benchmark case <1.4TB/s. They market it as a combined 3.2TB/s on MI250. Clearer would be "2x 1.6TB/s"."

Bingo. MI250X is just two CDNA2 chips glued together without an advcanced interconnection (and only 400GB/s...).
 
August 2, 2022
This time, I had the opportunity to benchmark a machine (SuperMicro 4124GQ-TNMI) equipped with four AMD Instinct MI250, a lower model of AMD Instinct MI250X used in Frontier. Speaking of GPUs, NVIDIA has been around for a long time, but I compared it with NVIDIA A100 to see if AMD could compete
 
Looking at our broader Data Center portfolio, as expected, Data Center GPU sales were down significantly from a year ago when we had substantial shipments supporting the build-out of the Frontier exascale supercomputer.

The HPC world isnt just Linpack.
 

The HPC world isnt just Linpack.
That's what I said long time ago. Out of these government subsidized deals, MI250 is nowhere to be seen. But kudo to AMD that got the right solution for a stupid exascale race that the US wanted to win.
More interesting now is what will happen next with CDNA as it's so behind in most commercial AI/ML workloads. I don't see CDNA being profitable with the tiny revenue it generates. Maybe that's the reason they are going to big HPC hybrid CPU+GPGPU APU chiplets...
 
Back
Top