Recent content by Xmas

  1. X

    NVIDIA discussion [2024]

    Surely that means both have more than 30k, otherwise X would be 4th.
  2. X

    Hardware implementation of threading models in contemporary GPUs

    Is this not true pre-wgmma? mma/wmma still take several cycles to complete. Or is the difference here that wgmma is explicitly async?
  3. X

    GTC 2024

    I'll be going in person this year. Anyone else?
  4. X

    NVIDIA discussion [2024]

    Thanks for all the detail Arun, I really appreciate it! It's an extra cost, sure, but the cost of the operation itself is necessary complexity. And if the prologue is more complex the cost of moving data gets amortised. One extra move from shared mem to registers and back doesn't strike me as...
  5. X

    NVIDIA discussion [2024]

    Could you expand on this a bit? Are those prologues for decompression, including things like sparse matrix expansion, non-native datatypes, shared exponents/scale factors? I'd imagine that if a prologue just performed element-wise ops, like epilogues usually do (which can be frustratingly...
  6. X

    Apple is an existential threat to the PC

    I can believe that in some cases, but I can absolutely max out the FP16 ALU on M1/M2 without being limited by occupancy, memory, or power. Given the unification of register file and threadgroup memory in M3, occupancy should be much less of a problem, and M series GPUs have way more memory...
  7. X

    Apple is an existential threat to the PC

    What is curious is that they're saying on the one hand to prefer using FP16 wherever possible (to reduce bandwidth/space requirements) but there is no mention of the FP32 pipeline being able to perform FP16 operations, only that "conversion is free". Thus you might get peak throughput only with...
  8. X

    Questioning the 8.6 Tflops figure for the ASUS Rog Ally Z1 Extreme version

    https://www.amd.com/system/files/TechDocs/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf This lists various packed 2x16bit instructions, including V_PK_FMA_F16 (fused multiply-add), which would suggest the usual support for double-rate (packed) FP16.
  9. X

    AMD: RDNA 3 Speculation, Rumours and Discussion

    At 4k resolution 1 GiB is ~129B per pixel. Considering most textures will be compressed that strikes me as a lot of texture data. Even with relatively high average anisotropy and overdraw factored in, that's a fair few layers to store various material properties. But only a fraction of texture...
  10. X

    AMD: RDNA 3 Speculation, Rumours and Discussion

    I'm not sure I understand that last sentence. A single kernel running across the entire chip is common for compute workloads, yet from the perspective of the cores it doesn't matter as they usually don't have to communicate with each other. The problem with graphics workloads is the graphics...
  11. X

    Electric Vehicle Thread!

    But if they charge for EV charging it's ok? So if they charge $0.10 per car using an honesty box mounted on the charger ...
  12. X

    AMD: Zen 4, Speculation, Rumours and Discussion

    So the "ISA extensions for AI" are specifically VNNI and BF16, or is there more?
  13. X

    AMD: RDNA 3 Speculation, Rumours and Discussion

    On a block level, yes. That's not necessarily the same as the register bandwidth an equivalent sequence of FMA instructions would use. I'm not sure many people are aware here, but Apple's M1 GPU does have simdgroup based matrix multiply_accumulate/load/store instructions (using 8x8 tiles for...
  14. X

    XenForo 2 Problems and Feature Requests

    This will take some getting used to, and I'm sure it won't take long. Only the dark theme text strikes me as painfully bright/contrasty, and I don't think it was like that before. Can we have that toned down a little, please?
  15. X

    Intel ARC GPUs, Xe Architecture for dGPUs [2018-2022]

    I'm sure every IHV shader compiler is already capable of eliminating code that doesn't contribute to any output, that's a basic optimisation. So if you disable all the other outputs (just as if you commented them out in the source) the compiler will do the right thing, no further complexity...
Back
Top