Intel ARC GPUs, Xe Architecture for dGPUs [2022-]

I could see that INT SIMD unit could be extended with FP math for Celestial. Turing's 1:1 share for FP:INT was unbalanced.
I think Volta/Turing's architecture has been partly misunderstood tbh, the 32x32+32 integer multiply-add was the same pipeline as the FP32 FMA reusing some of the HW, so the "integer" path was mostly just the cheap stuff like integer add, bitwise ops, etc... and a key benefit of "underpowering" the FP path is that it meant there was spare instruction fetch/decode performance not just for INT but also control ops like branches etc... which used to be effectively 'free' on Volta/Turing but now effectively prevent the 2nd FMA (or a special function op like rcp/log/etc.) from being co-issued so each control op is quite expensive. The extra FP32 FMA is cheap so it's a clear PPA win but the previous design made a lot of sense, and now it's the fetch/decode that's a bottleneck (which should be cheap area-wise, but hard to increase the throughput of architecturally given how everything else works).

To go back to Intel GPUs, I've never seen much information at this level of detail on how their execution ports work, but based on Figure 4 of Page 5 here, it looks to me like they don't have FP/INT co-issue at all on the "Xe-LP" parts and the Port 1 for "EM" is just for special function (log/exp/rcp/rsq/etc.) but they do allow co-issue of Branch instructions (and SEND, not sure what those are, fragment/vertex output and blending?). So that *might* just be a simple dual-issue with INT/FP on one side and EM/BRANCH/SEND being co-issuable, or something more complex.

But then they have 3 ports on Xe-HPG: FP + INT/EM + XMX... I am a bit skeptical (from my own personal GPU HW architecture experience) of merging the INT & EM ports, because EM is really quite special, they are all "single source, single destination" operations and higher internal latency than most other things (although potentially similar to an INT32 multiply-add if you optimise them a certain way). Just like everything Intel GPUs, AFAICT they seem to gravitate towards more and more complexity with dubious high-level architectural choices like SIMD8/SIMD16 hurting their PPA more than all the same clever tricks put together *shrugs*

Also I wouldn't say NVIDIA is SIMD16, because their register files are physically SIMD32 (2 banks at SIMD32 means an effective 4 reads + 4 writes at SIMD16 per cycle), and you cannot get more than 50% throughput for half-filled warps under *any* cirumstance (unlike on G80, where such a mode did exist for vertex shaders, but even one of their lead compute architects I spoke with in person back in 2008 didn't know about it!) - my assumption is that Intel can achieve 100% throughput for SIMD16 on Battlemage which is a very different set of trade-offs.

The dominant expectation Petersen talks about is a sort-of graphics programming zeitgeist—it's whatever the done thing. As Petersen later explains, the done thing in graphics is whatever "the dominant architecture" does, i.e. Nvidia.
This hits home hard though - I've had to argue that exact same thing at previous GPU employers repeatedly over the years sadly, too many GPU architects try to be too clever in their own unique way, and it rarely works out in practice for exactly that reason. Because AMD has both consoles, if you're "AMD-like" rather than "NVIDIA-like" then you're possibly okay as well, but doing something completely different with its own unique unexpected performance cliffs that developers might not expect and certainly won't optimise for is asking for trouble...
 
Also I wouldn't say NVIDIA is SIMD16, because their register files are physically SIMD32 (2 banks at SIMD32 means an effective 4 reads + 4 writes at SIMD16 per cycle), and you cannot get more than 50% throughput for half-filled warps under *any* cirumstance (unlike on G80, where such a mode did exist for vertex shaders, but even one of their lead compute architects I spoke with in person back in 2008 didn't know about it!) - my assumption is that Intel can achieve 100% throughput for SIMD16 on Battlemage which is a very different set of trade-offs.

Good point. Operand fetch and instruction issue are probably both SIMD32 issuing to each hardware SIMD16 block in alternate clocks.
 
This hits home hard though - I've had to argue that exact same thing at previous GPU employers repeatedly over the years sadly, too many GPU architects try to be too clever in their own unique way, and it rarely works out in practice for exactly that reason. Because AMD has both consoles, if you're "AMD-like" rather than "NVIDIA-like" then you're possibly okay as well, but doing something completely different with its own unique unexpected performance cliffs that developers might not expect and certainly won't optimise for is asking for trouble...
I know it hurts your ego. I was in the same position. But the only rational decision is to follow the standard, unless you have something that gives a substantial performance advantage (not in a subroutine test but in overall fps) and that the majority of developers will implement (which is the hardest part when you are not the dominant player). Kudos Arun for your honestly on this topic, I feel it's kind of taboo in the engineering world, yet the reason of so many failures...
 
Game developers don't write/test their applications against some theoretical gfx API specs! They've ALWAYS developed their applications against concrete driver implementations of these said gfx APIs so what Petersen articulated shouldn't really come out as some astonishing revelation. There's going to be developers out there who'll continue to indefinitely brush aside Intel gfx HW ...
 
As Intel revealed back in June 2023, the Falcon Shores chips will take the massively parallel Ethernet fabric and matrix math units of the Gaudi line and merge it with the Xe GPU engines created for Ponte Vecchio. This way, Falcon Shores can have 64-bit floating point processing and matrix math processing at the same time. Ponte Vecchio does not have 64-bit matrix processing, just 64-bit vector processing, which was done intentionally to meet the FP64 needs of Argonne. That’s great, but it means Ponte Vecchio is not necessarily a good idea for AI workloads, which would limit its appeal. Hence the merger of the Gaudi and Xe compute units into a single Falcon Shores engine.

We don’t know much about Falcon Shores, but we do know that it will weigh in at 1,500 watts, which is 25 percent more power consumption and heat dissipation of the top-end “Blackwell” B200 GPU expected to be shipping in volume early next year, which is rated at 1,200 watts and which delivers 20 petaflops of compute at FP4 precision. With 25 percent more electricity burned, Falcon Shores better have at least 25 percent more performance than Blackwell at the same floating point precision level at roughly the same chip manufacturing process level. Better still, Intel had better be using its Intel 18A manufacturing process, expected to be in production in 2025, to make Falcon Shores and it better have even more floating point oomph than even that. And Falcon Shores 2 had better be on the even smaller Intel 14A process, which is expected in 2026.

It is past time for Intel to stop screwing around in both its foundry and chip design businesses. TSMC has a ruthless drumbeat of innovation, and Nvidia’s GPU roadmap is relentless. There is an HBM memory bump and possibly a GPU compute bump coming with “Blackwell Ultra” in 2025, and the “Rubin” GPU comes in 2026 with the “Rubin Ultra” follow-on in 2027.
 
Back
Top