AMD CDNA Discussion Thread

But it's not slower!
Frontier part has ~1700MHz fmax.

I know that, i.e. is certainly not slower than MI100. But I have the feeling that -from a process point of view- it could clock even higher (as we have seen the latest Vega derivatives going at >2GHz speed in a low power budget, and CDNA1 was a Vega derivative, and CDNA2 is an evolution of that). But of course when designing a product you need to balance out performance, power, costs, and so on. )
 
Can someone remind me why GCN is better for HPC than RDNA? Better flops per mm^2?

Quite probably that, and the much better FP64 support since Radeon VII. When Vega was out, there was a long debate about it being more a compute oriented architecture rather than a gaming architecture.
Edit: I've found this interestng article about the diferences between these two architectures.

https://www.hardwaretimes.com/difference-between-amd-rdna-vs-gcn-gpu-architectures/
 
Last edited:
Quite probably that, and the much better FP64 support since Radeon VII. When Vega was out, there was a long debate about it being more a compute oriented architecture rather than a gaming architecture.
Edit: I've found this interestng article about the diferences between these two architectures.

https://www.hardwaretimes.com/difference-between-amd-rdna-vs-gcn-gpu-architectures/

Thanks. GCN has more compute resources per scheduler which explains the density advantage in terms of peak flops. But doesn’t the underutilization problem in games also affect HPC workloads?

I also don’t understand this part of the article. What’s an example of a short work queue that wouldn’t saturate GCN but would saturate RDNA with the same total number of ALUs? Don't most draw calls generate millions of work items?

“Each Compute Unit would also work on four 64-item waves. The reason why this wasn’t very effective is that most games use shorter work queues due to which only one or two out of the four wavefronts were saturated per execution cycle.”
 
I am not a progamming expert, but if I understand correctly, while both HPC and gaming workloads involve massive parallel calculations, games involve frequent context switching and branching, while compute workloads behavior is much less complicated, so they are achieving a better utilization than gaming workloads on GCN/Vega architecture (which was designed to maximize peak throughput) while RDNA, (which was optimized for latency) is less influenced by the workload but FLOPS/mm^2 is quite higher on Vega.

EDIT: I summarized the data about FLOPS, area and TDP in the following table:

Area for MI250X is estimated at 2xMI100 as there are no official data.
 

Attachments

  • CDNA_RDNA.jpg
    CDNA_RDNA.jpg
    44.7 KB · Views: 16
Last edited:
Can someone remind me why GCN is better for HPC than RDNA?
HPC/ML workloads are cache/shmem/TLP benchmarks which GCN is/was really good at overall.
the latest Vega derivatives going at >2GHz speed in a low power budget
Those are stretchy as balls clocks that net not much real world(tm) performance.
And that's like at ~30W in timespy.
But doesn’t the underutilization problem in games also affect HPC workloads?
Easier to dance around that.
Well to be honest some modern problems throw silly small amount of waves per SIMD/SM partition overall in CUDA lands.
 
Flops per W, mainly, as this is one of the main metrics for HPC. So the comparison is 42.2@600W vs 15.5@400W (as the referenced A100 is not the PCIE version, I'll consider the nominal TDP for both as we have no data about thre real power draw).
It's likely more complicatedy. 560 W for MI250X (not 600 W) is related to water-cooled version used in Frontier. Standard version is 500W. 400W value is correct for standard version of A100, but according to Anandtech some custom deployments use up-to 600W (probably water-cooled) versions. So in some circumstances it can be 560 W for MI250X and 600 W for A100.
 
Last edited:
It's likely more complicatedy. 560 W for MI250X (not 600 W) is related to water-cooled version used in Frontier. Standard version is 500W. 400W value is correct for standard version, but according to Anandtech some custom deployments use up-to 600W (probably water-cooled) versions. So in some circumstances it can be 560 W for MI250X and 600 W for A100.

Yes, I know, as there is also the PCI-E version of A100 but that's quite probably power limited. I wanted only to show why MI250X was chosen in that case: >1,8x in perf/W in that benchmark was simply too big to ignore.
 
Yes, I know, as there is also the PCI-E version of A100 but that's quite probably power limited
250W PCI-E A100 definitely has frequency/voltage curve tuned towards Max-Q, it's not that far away from A100 SXM.
500W MI250 is not pushed on clocks either since there are 2 GPUs. Wonder why transistor density per area is so low in MI250 (aside from lack of sram), looks like a tradeoff in physical design towards higher clocks.
500W MI250 should be around 1.5x FP64 flops/watt in the DP HPL benchmark in comparison with 500W 2x PCI-E A100 with latest SW and there are still 18% on the table with full GA100.
 
Last edited:
Last edited:
Back
Top