Recent content by Xmas

X
NVIDIA discussion [2024]

Surely that means both have more than 30k, otherwise X would be 4th.
- Xmas
- Post #285
- Apr 9, 2024
- Forum: Graphics and Semiconductor Industry
X
Hardware implementation of threading models in contemporary GPUs

Is this not true pre-wgmma? mma/wmma still take several cycles to complete. Or is the difference here that wgmma is explicitly async?
- Xmas
- Post #15
- Apr 2, 2024
- Forum: Architecture and Products
X
GTC 2024

I'll be going in person this year. Anyone else?
- Xmas
- Post #3
- Mar 16, 2024
- Forum: Graphics and Semiconductor Industry
X
NVIDIA discussion [2024]

Thanks for all the detail Arun, I really appreciate it! It's an extra cost, sure, but the cost of the operation itself is necessary complexity. And if the prologue is more complex the cost of moving data gets amortised. One extra move from shared mem to registers and back doesn't strike me as...
- Xmas
- Post #132
- Feb 26, 2024
- Forum: Graphics and Semiconductor Industry
X
NVIDIA discussion [2024]

Could you expand on this a bit? Are those prologues for decompression, including things like sparse matrix expansion, non-native datatypes, shared exponents/scale factors? I'd imagine that if a prologue just performed element-wise ops, like epilogues usually do (which can be frustratingly...
- Xmas
- Post #102
- Feb 20, 2024
- Forum: Graphics and Semiconductor Industry
X
Apple is an existential threat to the PC

I can believe that in some cases, but I can absolutely max out the FP16 ALU on M1/M2 without being limited by occupancy, memory, or power. Given the unification of register file and threadgroup memory in M3, occupancy should be much less of a problem, and M series GPUs have way more memory...
- Xmas
- Post #1,298
- Nov 23, 2023
- Forum: PC Industry
X
Apple is an existential threat to the PC

What is curious is that they're saying on the one hand to prefer using FP16 wherever possible (to reduce bandwidth/space requirements) but there is no mention of the FP32 pipeline being able to perform FP16 operations, only that "conversion is free". Thus you might get peak throughput only with...
- Xmas
- Post #1,296
- Nov 22, 2023
- Forum: PC Industry
X
Questioning the 8.6 Tflops figure for the ASUS Rog Ally Z1 Extreme version

https://www.amd.com/system/files/TechDocs/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf This lists various packed 2x16bit instructions, including V_PK_FMA_F16 (fused multiply-add), which would suggest the usual support for double-rate (packed) FP16.
- Xmas
- Post #8
- May 15, 2023
- Forum: Mobile Graphics Architectures and IP
X
AMD: RDNA 3 Speculation, Rumours and Discussion

At 4k resolution 1 GiB is ~129B per pixel. Considering most textures will be compressed that strikes me as a lot of texture data. Even with relatively high average anisotropy and overdraw factored in, that's a fair few layers to store various material properties. But only a fraction of texture...
- Xmas
- Post #2,052
- Oct 27, 2022
- Forum: Architecture and Products
X
AMD: RDNA 3 Speculation, Rumours and Discussion

I'm not sure I understand that last sentence. A single kernel running across the entire chip is common for compute workloads, yet from the perspective of the cores it doesn't matter as they usually don't have to communicate with each other. The problem with graphics workloads is the graphics...
- Xmas
- Post #1,445
- Jul 25, 2022
- Forum: Architecture and Products
X
Electric Vehicle Thread!

But if they charge for EV charging it's ok? So if they charge $0.10 per car using an honesty box mounted on the charger ...
- Xmas
- Post #205
- Jul 18, 2022
- Forum: General Discussion
X
AMD: Zen 4, Speculation, Rumours and Discussion

So the "ISA extensions for AI" are specifically VNNI and BF16, or is there more?
- Xmas
- Post #15
- Jul 12, 2022
- Forum: PC Industry
X
AMD: RDNA 3 Speculation, Rumours and Discussion

On a block level, yes. That's not necessarily the same as the register bandwidth an equivalent sequence of FMA instructions would use. I'm not sure many people are aware here, but Apple's M1 GPU does have simdgroup based matrix multiply_accumulate/load/store instructions (using 8x8 tiles for...
- Xmas
- Post #1,385
- Jul 1, 2022
- Forum: Architecture and Products
X
XenForo 2 Problems and Feature Requests

This will take some getting used to, and I'm sure it won't take long. Only the dark theme text strikes me as painfully bright/contrasty, and I don't think it was like that before. Can we have that toned down a little, please?
- Xmas
- Post #160
- Jun 9, 2022
- Forum: Site Feedback
X
Intel ARC GPUs, Xe Architecture for dGPUs [2018-2022]

I'm sure every IHV shader compiler is already capable of eliminating code that doesn't contribute to any output, that's a basic optimisation. So if you disable all the other outputs (just as if you commented them out in the source) the compiler will do the right thing, no further complexity...
- Xmas
- Post #714
- Feb 10, 2022
- Forum: Architecture and Products