AMD CDNA Discussion Thread

Discussion in 'Architecture and Products' started by Frenetic Pony, Nov 16, 2020.

  1. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    Yes.
    It depends.
    128 gigs or so.
     
  2. Krteq

    Newcomer

    Joined:
    May 5, 2020
    Messages:
    148
    Likes Received:
    261
    Some Aldebaran bits from recent Linux Kernel commit => 2 dies + 128GB HBM2(e) confirmed

    Code:
    On newer heterogeneous systems from AMD with GPU nodes connected via
    xGMI links to the CPUs, the GPU dies are interfaced with HBM2 memory.
    
    This patchset applies on top of the following series by Yazen Ghannam
    AMD MCA Address Translation Updates
    [[URL]https://patchwork.kernel.org/project/linux-edac/list/?series=505989[/URL]]
    
    This patchset does the following
    1. Add support for northbridges on Aldebaran
       * x86/amd_nb: Add Aldebaran device to PCI IDs
       * x86/amd_nb: Add support for northbridges on Aldebaran
    2. Add HBM memory type in EDAC
       * EDAC/mc: Add new HBM2 memory type
    3. Modifies the amd64_edac module to
       a. Handle the UMCs on the noncpu nodes,
       * EDAC/mce_amd: extract node id from InstanceHi in IPID
       b. Enumerate HBM memory and add address translation
       * EDAC/amd64: Enumerate memory on noncpu nodes
       c. Address translation on Data Fabric version 3.5.
       * EDAC/amd64: Add address translation support for DF3.5
       * EDAC/amd64: Add fixed UMC to CS mapping
    
    
    Aldebaran has 2 Dies (enumerated as a MCx, x= 8 ~ 15)
      Each Die has 4 UMCs (enumerated as csrowx, x=0~3)
      Each die has 2 root ports, with 4 misc port for each root.
      Each UMC manages 8 UMC channels each connected to 2GB of HBM memory.
    
    Muralidhara M K (3):
      x86/amd_nb: Add Aldebaran device to PCI IDs
      x86/amd_nb: Add support for northbridges on Aldebaran
      EDAC/amd64: Add address translation support for DF3.5
    
    Naveen Krishna Chatradhi (3):
      EDAC/mc: Add new HBM2 memory type
      EDAC/mce_amd: extract node id from InstanceHi in IPID
      EDAC/amd64: Enumerate memory on noncpu nodes
    
    Yazen Ghannam (1):
      EDAC/amd64: Add fixed UMC to CS mapping
    
    So... yes, @Bondrewd was right - 128GB HBM2(e) -> 2 dies - 4 UMCs per die - 8 channels per UMC - 2GB per channel
     
    #62 Krteq, Jul 1, 2021
    Last edited: Jul 1, 2021
  3. So they never disable any stack, out of the 8 stacks in the PCB?
    Nvidia always disables one stack out of 6, for every A100 GPU. Could AMD be that much more confident with their HBM2e's yields?

    Perhaps what @Bondrewd meant with the "more or less" part is that some SKUs may come with one or more stacks disabled.
     
    Lightman likes this.
  4. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    Oh yeah and LUMI (the pre-exascale Euro system) is a Frontier copy with 1/3rd the node count.
    For 1/3rd the money that is.
    Nope.
    Volumes.
    NV also bins relatively more % for A100 versus MI100.
    Tbh it's a question of AMD selling MI200 with half-sized stacks.
    128GB of HBM is a very pricey endeavour to casually yeet it into the wider market.
     
  5. Krteq

    Newcomer

    Joined:
    May 5, 2020
    Messages:
    148
    Likes Received:
    261
    Some more info from Locuza


     
  6. So it's a total of 256CUs capable of full-rate FP64 / packed FP32?

    We're looking at 256CUs * 64 ALUs each * 2 Multiply + Add * 2 FP32 RPM = ~100 TFLOPs FP32, or 200 TFLOPs FP16 (if clocked at 1.5GHz).

    I wonder how this would fare in a software renderer.
     
    #66 Deleted member 13524, Jul 1, 2021
    Last edited by a moderator: Jul 1, 2021
    Lightman likes this.
  7. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    A bit less but yea.
     
    Lightman likes this.
  8. Newguy

    Regular

    Joined:
    Nov 10, 2014
    Messages:
    263
    Likes Received:
    122
    They also say doubled BF16 rates to match FP16 which is nice (roughly 4x faster total with doubled CUs) however no faster int8/4 or sparsity which is interesting, both will stick with FP16 rates like before I presume. If they're going full on for FP64 I guess it makes sense, HPC not AI/ML focused:




    No chance of a 1731MHz boost clock AMD edition for 113.4Tflops? A = 1, M = 13, D = 4. Maybe that's reserved for the top tier RDNA3 GPU assuming that has double rate FP32 too, 160 CUs, so 2769MHz
     
    Lightman and BRiT like this.
  9. If Xilinx acquisition goes through (looks like it may, only China pending) they probably might end up not doing exotic ML ops at all in the near future
    They have a bunch of patents for what looks like far more advanced implementations of Xilinx's ACAP concept.
    A bunch of these patents also I see funded by Path Forward/Fast Forward 1/2 DoE programmes
    Some new startups have been using Xilinx Versal for dramatic speedups in ML
    https://www.zdnet.com/article/xilin...c-speed-up-of-neural-nets-versus-nvidia-gpus/

    ROCm unified programming model for FPGAs/GPUs/CPUs already demonstrated last year .
    https://forums.xilinx.com/t5/Xilinx...onverged-ROCm-Runtime-Technology/ba-p/1175091

    I think there is some possibility they will use this approach.
    Lisa's messaging also seems to be in this direction, and seems to be consistent lately
    upload_2021-7-1_22-7-8.png

     
    Ethatron likes this.
  10. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
  11. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    One can also count the FP64 TF number per board for lulz.
    Okay, should.
     
  12. upload_2021-7-5_23-20-21.png
    200,000 Milan Cores @ 2.5TF/64Cores = 7.8PF DPFP from 200K+ Milan Cores, 42.2PF from 750 MI200
    Each MI200 can do 56TF DPFP, that is 5x of MI100
    MI100 can do Half rate FP64 Ops and LLVM indicates MI200 can do full rate FP64 Ops.
    With a 1.2x clock boost over MI100, full rate FP64 and 2x dies, MI200 @ ~1.8GHz can put out 56TF DPFP
    Sounds formidable.

    def FeatureISAVersion9_0_8 : FeatureSet<
    [FeatureGFX9,
    HalfRate64Ops,
    FeatureFmaMixInsts,
    FeatureLDSBankCount32,
    FeatureDsSrc2Insts,
    FeatureExtendedImageInsts,
    FeatureMadMacF32Insts,
    FeatureDLInsts,
    FeatureDot1Insts,
    FeatureDot2Insts,
    FeatureDot3Insts,
    FeatureDot4Insts,
    FeatureDot5Insts,
    FeatureDot6Insts,
    FeatureDot7Insts,
    FeatureMAIInsts,
    FeaturePkFmacF16Inst,
    FeatureAtomicFaddInsts,
    FeatureSupportsSRAMECC,
    FeatureMFMAInlineLiteralBug,
    FeatureImageGather4D16Bug]>;

    def FeatureISAVersion9_0_A : FeatureSet<
    [FeatureGFX9,
    FeatureGFX90AInsts,
    FeatureFmaMixInsts,
    FeatureLDSBankCount32,
    FeatureDLInsts,
    FeatureDot1Insts,
    FeatureDot2Insts,
    FeatureDot3Insts,
    FeatureDot4Insts,
    FeatureDot5Insts,
    FeatureDot6Insts,
    FeatureDot7Insts,
    Feature64BitDPP,
    FeaturePackedFP32Ops,
    FeatureMAIInsts,
    FeaturePkFmacF16Inst,
    FeatureAtomicFaddInsts,
    FeatureMadMacF32Insts,
    FeatureSupportsSRAMECC,
    FeaturePackedTID,
    FullRate64Ops]>;
     
  13. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    A bit less than that, but yes.
    That's what won AMD Frontier and then some.
    It's an abhorrently mean stick as far as HPC classical prowess is concerned.
    Which is why MI100 was more or less relegated to the world's most expensive devkit role.
    Was basically a toy versus what went prod only 3Q after the thing.
     
    Lightman likes this.
  14. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    There's going to be much funniness around year's end. Also in finland.
     
    Lightman likes this.
  15. del42sa

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    208
    Likes Received:
    137
    [​IMG]

    any info about Aldebaran specs. ?
     
    Lightman likes this.
  16. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    Read ze thread
     
  17. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    780
    Location:
    EU-China
    It's nice to see AMD coning back and pushing performance forward. Since Lisa Su took over, they have a very pragmatic approach and they always go for the market segments where they can get good ROI. We see it on CPUs where EPYC get priority over APUs and RYZEN. On GPU, they go for high-end since RDNA2.
    Because they are so much supply constraint with TSMC, it's the right approach to quickly get cash flow back and increase their R&D expenditure as their last financial report shows. Ironically, TSMC is both the biggest reason of their success (as Intel is struggling with their nodes) and their biggest limiting factor.
    Regarding RDNA3, they are clearly betting in last generation of old school HPC market, where FP64 is still relevant, before everything moves to lower precision AI/ML workflows. The timing is good, with CDNA2 coming few quarters before Hopper and getting as much supercomputer deals as they can. Obviously, the key to this market opportunity is that CDNA2 doesn't need any software or ecosystem investment. It's basic FP64 brute force approach.
    The problem is when Grace-Hopper-Bluefield3 systems will arrive, the party will be over, and AMD won't have any other choice than getting their AI/ML stack up to the task. Even traditional FP64 markets like weather and fluid simulations are transitioning to AI/ML (it's f#cking time, this brute force FP64 approach is a no ending story as you never have enough compute power to simulate the quintillion of particles of our atmosphere...)
    The interesting part of the story is what will happen next. RDNA3 is one time shot. Nvidia will have MCM solutions too and much stronger AI/ML hardware and software stack...
     
    DavidGraham and Lightman like this.
  18. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    Cringe again.
    Even nV hweng org dudes shudder when they hear about MI200 then MI300.
    Sad!
    Not really.
    But FP32?
    Hell yeah.
    Of course it does.
    Thank you Intel!
    Cringe again.
    Nothing outside of hyperscale gives a shit about SmartNICs, and hyperscale builds their own!
    See EFA.
     
    #78 Bondrewd, Jul 31, 2021
    Last edited: Jul 31, 2021
    Tarkin1977 likes this.
  19. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    780
    Location:
    EU-China
    You have clearly no idea of what you are talking about.
    1. No, Intel AI/ML stack won't benefit AMD. At least not fully as their AI/ML hardware is vastly different. The only (little) savior is the government investment in ROCm as part of FRONTIER deal. It was done to avoid CUDA total dominance in the field. But frankly, few millions is not enough to make a dent on CUDA, especially considering the pathetic current state of ROCm...

    2. You should open your eyes and see how Nvidia operates when launching a new AI accelerator. They don't sale a GPU, they sale exclusively DGX systems the first 6 months and then their HGX reference platform for another months. Finally, one year later, they open the market with SMX/PCIe cards. In other worlds, it means that all customers will have to buy Grace-Hopper-BF3 systems to get their hands on the next gen AI/ML performance. And the thing is that customers are already lining up to get their hands on the new NVDA platform and they are already porting their hypervisors on BueField3 (Google, MS, FB, Baidu and Tencent are already at work). In fact, NVDA datacenter business is expected to be more than twice bigger with Grace-Hopper-BF3 generation than it was with Ampere, when Nvidia was only selling the GPU with a bit of Mellanox SmartNic (it was not a DPU yet) and relying on Intel / AMD for the CPU.

    And that's the interesting point. It's no more GPU vs GPU or CPU vs CPU. It's [Hardware + Software] platform against another platform. With Grace CPU and Bluefield DPU, NVDA has finally a complete platform and an extremely performant one, against AMD and Intel. In fact, Grace-Hopper-BF3 + Nvidia software ecosystem (Hypervisor + CUDA) has no equivalent and it's a huge selling point, something that AMD can only dream about...
     
  20. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    It's SYCL
    They're just barely programmable matrix engines; all stuff relevant is abstracted away.
    Oh jeez Codeplay is literally writing a Level Zero backend for gfx9/you name it.
    Tell me something I don't know.
    The opposite.
    HGX ships first to super 8, then DGX, then ODM HGX units.
    ?
    They will continue to use what they want.
    Super 8 has infinite leverage over any IHV house.
    Irrelevant.
    They'll cook their own much like AMZN did with Nitro + EFA.
    OH YES IT IS.
    Just in a bitta different way.
    Oh jeez that's peak LARP.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...