Intel ARC GPUs, Xe Architecture for dGPUs [2018-2022]

Status
Not open for further replies.
That XMX throughput is assuming the same capabilities as for the Matrix Engines in Ponte Vecchio. When asked, Intel reps would not give more details on Alchemist than what was presented in the slides. That may have been different for US press, but if so, they should be explicit about it.

It is possible, if unlikely, that for conumser grade GPUs Intel chose to have pure inference engines there while Ponte Vecchios HPC-style XMX are half as many, twice as wide and churn out 2048 ops/clk on TF32 (!), 4096 on FP16 & BF16 as well as 8192 on INT8. It's Vector Engines have been refactored too (8 x 512 Bit vs. 16 x 256 Bit), if that's any indication.
 

This part of the interview made me think they're planning on running XeSS completely off the Xe iGPUs present in the current Tiger Lake and future Alder Lake CPUs.


Digital Foundry: Here's an interesting thought that I had during the Architecture Day, which is that essentially, you have machine learning silicon not just in the GPU, but also in the CPU. Let's say, I own an older GeForce or Radeon card, and I want to tap into XeSS. Can I do that via the CPU?

Tom Petersen: Well, you know that there's an integrated GPU on most of our CPUs. And so, the question is, what would it look like? And I'm sure, you're aware of like how hybrid works for most notebooks where there's a discrete GPU render, and then there's a copy to an integrated GPU, that today does nothing other than act as a display controller really. But now that we have technologies that are really cool, could we do something interesting on the GPU? I think that entire space, we called it Deep Link. And what happens in terms of Deep Link right now, we're still learning so much here. And there are so many opportunities. Today, it's just Intel products working together, but you can think about Deep Link as just like, what can we do in a two GPU environment or a CPU/GPU environment that's better than otherwise? So, I don't want to answer that question directly, but let's just say there's lots of opportunity in that space.
 
I've not seen any quotation of Intel guys on "deep-learning upscaling method running efficiently on their integrated Iris Xe using RPM DP4A", can you share the quotes from Intel guys?
Sorry, but I don't buy it, someone's retellings (that can be based on wrong assumptions) are not Intel's claims.
Is this good enough?
We want the benefits of XᵉSS to be available to a broad audience, so we developed an additional version based on the DP4a instruction, which is supported by competing GPUs and Intel Xᵉ LP-based integrated and discrete graphics.
Straight from Intel blog @ Medium
https://medium.com/intel-tech/the-n...el-arc-high-performance-graphics-f68e7d2dc068
 
Is this good enough?
No, there is no word on "deep-learning upscaling method running efficiently on their integrated Iris Xe using RPM DP4A truly".

This part "supported by Intel Xᵉ LP-based integrated and discrete graphics" means that XeSS runs on the Iris Xe, this tells us nothing new since we all know that Xe Lp supports Shader Model 6.4 with the dp4a instructions. There is no word on whether XeSS is performant enough even for reconstruction to 1080p on Iris Xe in games.
What we really have is Intel's data on 1.5x frequency in comparison with the desktop DG1 part, perf modelling with this data in mind (though, in order to compete with RTX 3070 grade GPUs, Xe Max has to be ~12x of the discrete Xe LP) and Intel's XMX/DP4A graphs that are based on real testing.
Based on these Intel's numbers, I estimated Xe LP runtimes here and here. Do you remember how restrictive DLSS 1.0 was on way more performant GPUs? Would you consider something like 5.5 ms for 1080p reconstruction to be efficient for the low performant Xe LP?
 
Probably because Lumen does a whole bunch of work that has constant cost across different resolutions.
From epic's docs, it seems Lumen's shading cache is decoupled from screen resolution, world space probes are decoupled, and geometry is at least partially decoupled.
Did wonder about that, but that would be a really bad example to use for promotional purposes.
Do we know if it was using UE5 for the demo?

Just seems very strange that running at 1080p can only get you upto 2x speed which implies that's the most you should really expect.
I would have thought that would have been at 1440p.

I need to check what the quality settings and internal resolutions are for DLSS2.1
 
man talk about being anal about semantics.
Are you trying to justify GEMM tumors everywhere now?
They don't do anything duhhhhhhhhhh.


During the Digitalfoundry interview, it really seemed like Richard Leadbetter was trying to figure out if the DG2's dedicated tensor units were instrumental for XeSS to run effectively and efficiently, and IMO the answer was "No", again because Tom Petersen pointed to the integrated DG1 in their current and future CPUs as candidates to run XeSS.

Here's the timestamped video:

 
It seems to me that the answer to that question significantly depends on the target framerate. If DLSS/XeSS took, say, an extra 2ms per frame on tensor units and 4ms using DP4A then the latter might be fine if you aim to get 30-60fps, but not for 120fps.
 
TAAU in UE4 costs less than 1ms for 1080p -> 2160p on a 3090. The difference between a 4ms ML and a 1ms TAAU approach will never be worth the huge performance impact.
 
TAAU in UE4 costs less than 1ms for 1080p -> 2160p on a 3090. The difference between a 4ms ML and a 1ms TAAU approach will never be worth the huge performance impact.
That 4ms approach becomes the new 1ms approach as the hardware improves, though, until we get no more returns from method improvements. If we assume that current methods are already there then yes, a 3090 might well be beyond the point where tensor cores add much for temporal sampling/upscaling, while it's the 3060/3070 range that benefits most. Personally I believe there's a huge unexplored space of ML based methods for 3D rendering that would benefit from lower precision matmul acceleration, yet it's rather difficult to predict what might come out of it.
 
I'm mostly wondering about the roadmap with Alchemist>Battlemage>Celestial>Druid, annual releases perhaps? I mean considering the also rumored roadmap of Alder Lake to Nova Lake in 2025 I wonder if eventually with Druid & Nova Lake is the start of Intel doing annual releases of CPU+GPU starting with Druid & Nova Lake (on Intel 20A/18A Process?) where Intel doesn't want to simply beat AMD in the performance per watt but also aim at Apple.
 
Status
Not open for further replies.
Back
Top