Rumour is it could be a B200 like MCM of 2 x 256 bit GB203's. So they wouldn't need to design and tape out a 512 bit limited volume chip. If NV has managed to make it work with B100, could work for consumer graphics as well perhaps?
That's possible, but I'm not 100% sure whether NVIDIA ever actually solved making graphics work across their split L2 GPUs. On H100, we know graphics performance is intentionally terrible (not clear if they genuinely managed to save most of the silicon and it's only present in a single GPC, or if they are just doing yield recovery so they don't lose chips when the graphics parts which are still there are defective)... But A100 is a bit more mysterious to me.
The only public benchmark of A100 for Graphics I could find is... *sigh* GFXBench:
- A100:
https://gfxbench.com/device.jsp?D=NVIDIA+A100-PCIE-40GB&testgroup=overall
- RTX 3090:
https://gfxbench.com/device.jsp?D=NVIDIA+GeForce+RTX+3090&testgroup=overall
Ignore Tessellation/ALU2 because they're just broken on very fast chips afaict, but everything else that I'd half-trust-on-a-good-day is just terrible for a chip that size, e.g. texturing test (ignore absolute numbers, I know exactly how it works, and it's a bit silly - but relative should be fine on the same architecture) and Manhattan 3.1.1 1440p Offscreen.
So I did some quick calculations... A100 40GiB PCIe has 7 out of 8 GPCs active with 82 SMs in total. Let's assume they *don't* support graphics using both sides of the GPU, and they use the side with 4 GPCs active, so performance should be 4/7th of peak, i.e. ~61 SMs out of 108. That compares to 82 SMs on RTX 3090 (each with 2xFP32 though). The claimed boost clocks are ~1.4GHz for A100 and ~1.7GHz for RTX 3090. Now let's look at the texturing tests...
- A100: 263GTexel/s
- RTX 3090: 452GTexel/s
263 * (82 SMs / 61 SMs) * (1.7GHz / 1.4GHz) = 429GTexel/s... which is only ~5% less than what the RTX 3090 actually gets.
So yeah, it's hard to tell, and using random unverified GFXBench numbers is even worse than using GFXBench at all (ugh)... but my suspicion is that NVIDIA didn't solve that problem in the A100 timeframe and they are only using half the GPU in Graphics Mode. Of course, even if that's true, they could have solved it for Blackwell... it's not an easy problem to solve efficiently though. That's a LOT more (kinds of) inter-chip communication than just L2 cache coherency.