AMD CDNA Discussion Thread

Discussion in 'Architecture and Products' started by Frenetic Pony, Nov 16, 2020.

  1. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,644
    Likes Received:
    812
    But it's not slower!
    Frontier part has ~1700MHz fmax.
     
  2. Leoneazzurro5

    Regular Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    324
    Likes Received:
    335
    I know that, i.e. is certainly not slower than MI100. But I have the feeling that -from a process point of view- it could clock even higher (as we have seen the latest Vega derivatives going at >2GHz speed in a low power budget, and CDNA1 was a Vega derivative, and CDNA2 is an evolution of that). But of course when designing a product you need to balance out performance, power, costs, and so on. )
     
  3. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    11,823
    Likes Received:
    2,790
    Location:
    New York
    Can someone remind me why GCN is better for HPC than RDNA? Better flops per mm^2?
     
  4. Leoneazzurro5

    Regular Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    324
    Likes Received:
    335
    Quite probably that, and the much better FP64 support since Radeon VII. When Vega was out, there was a long debate about it being more a compute oriented architecture rather than a gaming architecture.
    Edit: I've found this interestng article about the diferences between these two architectures.

    https://www.hardwaretimes.com/difference-between-amd-rdna-vs-gcn-gpu-architectures/
     
    #364 Leoneazzurro5, Nov 25, 2021
    Last edited: Nov 25, 2021
    Krteq and trinibwoy like this.
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    11,823
    Likes Received:
    2,790
    Location:
    New York
    Thanks. GCN has more compute resources per scheduler which explains the density advantage in terms of peak flops. But doesn’t the underutilization problem in games also affect HPC workloads?

    I also don’t understand this part of the article. What’s an example of a short work queue that wouldn’t saturate GCN but would saturate RDNA with the same total number of ALUs? Don't most draw calls generate millions of work items?

    “Each Compute Unit would also work on four 64-item waves. The reason why this wasn’t very effective is that most games use shorter work queues due to which only one or two out of the four wavefronts were saturated per execution cycle.”
     
  6. Leoneazzurro5

    Regular Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    324
    Likes Received:
    335
    I am not a progamming expert, but if I understand correctly, while both HPC and gaming workloads involve massive parallel calculations, games involve frequent context switching and branching, while compute workloads behavior is much less complicated, so they are achieving a better utilization than gaming workloads on GCN/Vega architecture (which was designed to maximize peak throughput) while RDNA, (which was optimized for latency) is less influenced by the workload but FLOPS/mm^2 is quite higher on Vega.

    EDIT: I summarized the data about FLOPS, area and TDP in the following table:

    Area for MI250X is estimated at 2xMI100 as there are no official data.
     

    Attached Files:

    #366 Leoneazzurro5, Nov 25, 2021
    Last edited: Nov 25, 2021
    nnunn and Lightman like this.
  7. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,644
    Likes Received:
    812
    HPC/ML workloads are cache/shmem/TLP benchmarks which GCN is/was really good at overall.
    Those are stretchy as balls clocks that net not much real world(tm) performance.
    And that's like at ~30W in timespy.
    Easier to dance around that.
    Well to be honest some modern problems throw silly small amount of waves per SIMD/SM partition overall in CUDA lands.
     
  8. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,422
    Likes Received:
    445
    It's likely more complicatedy. 560 W for MI250X (not 600 W) is related to water-cooled version used in Frontier. Standard version is 500W. 400W value is correct for standard version of A100, but according to Anandtech some custom deployments use up-to 600W (probably water-cooled) versions. So in some circumstances it can be 560 W for MI250X and 600 W for A100.
     
    #368 no-X, Nov 25, 2021
    Last edited: Nov 25, 2021
    Lightman likes this.
  9. Leoneazzurro5

    Regular Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    324
    Likes Received:
    335
    Yes, I know, as there is also the PCI-E version of A100 but that's quite probably power limited. I wanted only to show why MI250X was chosen in that case: >1,8x in perf/W in that benchmark was simply too big to ignore.
     
  10. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,644
    Likes Received:
    812
    You can get an aircooled 500W A100 80GB but caveats inbound (same as 500W MI250(X) anyhow).
     
  11. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    762
    Likes Received:
    1,498
    250W PCI-E A100 definitely has frequency/voltage curve tuned towards Max-Q, it's not that far away from A100 SXM.
    500W MI250 is not pushed on clocks either since there are 2 GPUs. Wonder why transistor density per area is so low in MI250 (aside from lack of sram), looks like a tradeoff in physical design towards higher clocks.
    500W MI250 should be around 1.5x FP64 flops/watt in the DP HPL benchmark in comparison with 500W 2x PCI-E A100 with latest SW and there are still 18% on the table with full GA100.
     
    #371 OlegSH, Nov 26, 2021
    Last edited: Nov 26, 2021
  12. Granath

    Newcomer

    Joined:
    Jul 26, 2021
    Messages:
    47
    Likes Received:
    37
  13. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,644
    Likes Received:
    812
     
    Krteq and Lightman like this.
  14. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    141
    Likes Received:
    169
    Nothing special, just what you expect. Babelstream measures Memory Performance. MI100 1,2 TB/s, MI210 1,6 TB/s. So you expect nearly 40% more speed for MI210.
     
    DegustatoR likes this.
  15. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,701
    Likes Received:
    3,746
    Location:
    Germany
    #375 CarstenS, Dec 4, 2021 at 12:39 PM
    Last edited: Dec 4, 2021 at 4:49 PM
    Lightman, Jensen Krage and Krteq like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...