Recent content by Arun

  1. A

    Switch 2 Speculation

    How many SMs per GPC if it’s 2 GPCs? 8 like AD107 so 16 total? That seems like a lot of SMs to me, while 6 SMs per GPC feels unbalanced with probably a significant area penalty… Whether 32 ROPs even makes sense perf-wise depends on the expected memory bandwidth and L2 cache size. I would expect...
  2. A

    Switch 2 Speculation

    What's the current consensus on GPU specs? I was randomly thinking about this, and it feels like NVIDIA would want to optimise for "minimum work" on their end especially physical design, so that would point towards exactly 1/3 AD106 on TSMC N4 (1 GPC, 12 SMs, 1536 cores, 48 texture units, 16...
  3. A

    Speculation and Rumors: Nvidia Blackwell ...

    Do you mean on the CPU/Grace side? I don’t know much about that tbh, but on the GPU side it’s not like anyone else has a single address space and the bandwidth to make it very useful AFAIK. The NVL72 architecture seems strongly optimised towards training huge models: go for...
  4. A

    Speculation and Rumors: Nvidia Blackwell ...

    Don’t think that’s fair to MI300X, any real workload will be randomly distributed across the HBM stacks, and there’s more than enough bandwidth to hit 100% HBM utilisation. The theoretical bottleneck is for the L3 where you can’t use 100% of the L3 bandwidth for similar cases, but it’s still...
  5. A

    Speculation and Rumors: Nvidia Blackwell ...

    1.5TB/s for the fastest direction on MI300 and Hopper has a larger perimeter for the relevant side, but yeah, if you're looking at "bandwidth per mm" then it looks like it might be ~2x MI300 and other competitors? Arguably what matters more here is power efficiency though which is sadly nearly...
  6. A

    Speculation and Rumors: Nvidia Blackwell ...

    It makes a lot of sense if they're still going for retsized dies with a large amount of SRAM, N3E yields and SRAM density are probably not ready yet, and if they're back to yearly refreshes as rumoured then no one is likely to beat them to N3E mass production by any noticeable margin either...
  7. A

    GTC 2024

    Yep, I’m in San Jose now, happy to meet up with anyone here at some point :) Not planning to publish anything beforehand anymore, scope creep got ahead of me, could have posted a braindump but not sure that'd have been very useful (does anyone really want to hear me rant about why CUTLASS only...
  8. A

    RDNA3 Efficiency [Spinoff from RDNA4]

    Doing 2 FMAs in only a 64 bits instruction is a bit silly and clearly a sign it was hacked on top of what was already there - still it was a presumably a decent PPA improvement given how little time they had to do it (and not as silly as SGX-XT doing up to 13 flops in 64 bits [Vec4 FMA + 3-way...
  9. A

    Speculation and Rumors: Nvidia Blackwell ...

    Genuinely curious about those core vs memory underclock results. I would have expected a 7B+ LLM with a batch size of 1 to be massively bandwidth limited for most kernels. The only thing I can think of is maybe the (de-)quantisation step is a lot more expensive than I'm assuming it is? Or this...
  10. A

    NVIDIA discussion [2024]

    Not sure I agree, he's right that training is a systems problem rather than just a chips problem, which makes it more difficult technically (i.e. you need to solve 2 separate-but-related problems at the same time). And he's right that their install base is an advantage to get new models deployed...
  11. A

    Speculation and Rumors: Nvidia Blackwell ...

    The problem across multiple threads is that while you might be right for a lot of things, you effectively claim 100% confidence for practically everything. There is absolutely no way you (or anyone, ever) is right 100% of the time. The only way anyone can claim to be right 100% of the time is to...
  12. A

    RDNA3 Efficiency [Spinoff from RDNA4]

    Every single voltage that has been mentioned here is wayyyyyyyy above TSMC N4 Vmin! The exact number probably depends on how you implement/power your SRAM arrays + power circuitry (and associated cost trade-offs), and I'm not really expecting AMD/NVIDIA desktop cards to run at e.g. 0.65v, but...
  13. A

    RDNA3 Efficiency [Spinoff from RDNA4]

    That can’t possibly be true for desktop parts, I couldn't find very precise information, but online discussions on undervolting makes me think voltage is typically way above 1.0v, which is already wayyyyy above 4nm Vmin (e.g. H100 goes down to 0.675v for low frequencies while AD102 is limited to...
  14. A

    AMD CDNA: MI300 & MI400 (Analysis, Speculation and Rumors in 2024)

    They claim they'll also have H100s soon so presumably they have their own 8xH100 already. But that doesn't matter, the whole setup is weird and doesn't tell us much about other environments unfortunately. Fundamentally though: if this article is right and AMD is selling MI300X for $15K (or...
  15. A

    RDNA3 Efficiency [Spinoff from RDNA4]

    [Albuquerquefx] Mod Mode: Arun's post seemed like the best place to start the spin-off RDNA3 efficiency thread. I've done a little bit of creative editing of some posts to keep the flow mostly linked to RDNA3 efficiency talk and less about "outlier" things. Please keep it civil... I think it's...
Back
Top