Intel ARC GPUs, Xe Architecture for dGPUs [2018-2022]

Bondrewd · Aug 14, 2020

DavidGraham said:
We have a long history of GPU architectures to judge and forecast performance from

This isn't how you do it this year buddy.

DavidGraham said:
it lacks all of the features from DX12U, except hardware RT.

A-ha.
Gen12LP has no hardware RT.
Gen12HPG is ?????? for the featureset.

Lurkmass · Aug 14, 2020

Rootax said:
Maybe wait to see real performances before being so affirmative ?

Also I'm not sure what the big deal is behind a VLIW architecture when there's strong possibility that Kepler and Maxwell/Pascal were technically VLIW architectures as well since they required explicit encoding to dual-issue instructions ...

Jawed · Aug 14, 2020

DavidGraham said:
I am not expecting the gaming chips (Xe-HPG) to provide any stellar or competitive performance with AMD or NVIDIA, since it relies on the same scalability scheme as Xe-HPC, ie: relies on racking up several tiles of graphics to scale up core count, this will be a mess for drivers and games in general.

I strongly believe this is bullshit. I expect NVidia to prove this by the end of 2022, perhaps 2021.

It's a mess when your architecture is wrong. Similar to how asynchronous compute is a mess when your architecture is wrong.

DavidGraham said:
The architecture relies purely on software scoreboarding (software schedulers), which means Intel will have it's hands full writing good drivers for it to achieve good utilization (VLIW5 days anyone?), then on top of that they are scaling it up through tiling (multi core/die approach), which means it's going to be a nightmare to write drivers for, and to extract good performance from.

Doesn't NVidia do multi-instruction issue? Doesn't the compiler produce multi-instruction bundles for max throughput?

Bondrewd · Aug 14, 2020

Jawed said:
I expect NVidia to prove this by the end of 2022, perhaps 2021.

No, AMD will do it, next year, for both client and DC because they feel like it.

Jawed said:
It's a mess when your architecture is wrong

You also need some very very delicate and nice packaging there.

Jawed said:
Doesn't NVidia do multi-instruction issue? Doesn't the compiler produce multi-instruction bundles for max throughput?

Yes and yes.

techuse · Aug 14, 2020

DavidGraham said:
We have a long history of GPU architectures to judge and forecast performance from, nothing is affirmed of course, but it's worth going through the motions to predict where performance will lie given what we already know from past experiences.

Furthermore, Xe-LP still retains the abysmal max 1 primitive per clock rate, and worse yet, it lacks all of the features from DX12U, except hardware RT.

Intel removed hardware scoreboarding from Gen11, which wasn't really that effective there to begin with. Gen11 had one Thread Control unit handling 2 ALUs, each ALU had control over 4 FP32 instructions, so in total each Thread Control unit had access to 8 FP32 instructions, which I would call a pretty weak arrangement to begin with. Intel didn't change this arrangements in Xe-LP, instead it allowed each Thread Control unit to supervise 16 FP32 instructions now, further weakening their already weak position.

Navi is 4 yes? What is Turing/Pascal?

Rootax · Aug 14, 2020

DavidGraham said:
We have a long history of GPU architectures to judge and forecast performance from, nothing is affirmed of course, but it's worth going through the motions to predict where performance will lie given what we already know from past experiences.

Furthermore, Xe-LP still retains the abysmal max 1 primitive per clock rate, and worse yet, it lacks all of the features from DX12U, except hardware RT.

Intel removed hardware scoreboarding from Gen11, which wasn't really that effective there to begin with. Gen11 had one Thread Control unit handling 2 ALUs, each ALU had control over 4 FP32 instructions, so in total each Thread Control unit had access to 8 FP32 instructions, which I would call a pretty weak arrangement to begin with. Intel didn't change this arrangements in Xe-LP, instead it allowed each Thread Control unit to supervise 16 FP32 instructions now, further weakening their already weak position.

So maybe, all in all, they are ok with soft scoreboarding from their experience with Gen11 ?

For the dx12u features , yes, It's not cool, but I don't believe it will mater a lot for a first design. And they still support VRS Tiers 1.

For me, they don't need to be perfect yet, they need to release the product, with good performances and good drivers. If they do that, it's already a big achievement imo. Then, of course, they need to improve, like AMD and nVidia...

trinibwoy · Aug 14, 2020

Lurkmass said:
Also I'm not sure what the big deal is behind a VLIW architecture when there's strong possibility that Kepler and Maxwell/Pascal were technically VLIW architectures as well since they required explicit encoding to dual-issue instructions ...

Was Pascal dual-issue managed by the compiler?

DavidGraham · Aug 14, 2020

techuse said:
Navi is 4 yes? What is Turing/Pascal?

Turing is 6, same as Pascal. Also Navi/Vega are 4 but only in special circumstances If I recall correctly.

techuse · Aug 14, 2020

DavidGraham said:
Turing is 6, same as Pascal. Also Navi/Vega are 4 but only in special circumstances If I recall correctly.

What do they fall back to? Do you have any info on the circumstances?

Lurkmass · Aug 14, 2020

trinibwoy said:
Was Pascal dual-issue managed by the compiler?

It depends mostly on how clever the programmer is with the hardware. If they understand the conditions/constraints (restrictions) behind dual-issue instruction scheduling then it's possible for the compiler to generate codegen for these dual-issue instructions ...

Malo · Aug 14, 2020

I thought Vega was 4 and "up to 17" based on the use of their primitive shader? (which obviously never came about)

DavidC · Aug 14, 2020

DavidGraham said:
, Xe-LP still retains the abysmal max 1 primitive per clock rate, and worse yet, it lacks all of the features from DX12U, except hardware RT.

That's from the Anandtech article analysis?

Well they updated it:

Update: Intel has since shot me a note stating that they have in fact upgraded their geometry front-end, so this is not the same 1/tringle/clock hardware as on earlier Intel GPUs. Xe-LP's geometry frontend can now spit out two backface culled triangles per clock, doubling Intel's peak geometry performance on top of Xe-LP's clockspeed improvements.

Digidi · Aug 15, 2020

2 Rasterizer for 768 Shader?... this ich much.

the question ist how much Polygons can the Frontend accept. We know after backface culling it can rasterizer 2 Polygons but how much polygons is it able to cull?

Ryan Smith · Aug 15, 2020

DavidC said:
That's from the Anandtech article analysis?

Well they updated it:

Yep. Leave it to Intel to completely overhaul their geometry front end for parallel execution, and then not bother telling anyone.

The upshot, at least, is that their technical team is paying attention to what's being written. So if we get something wrong, they've been giving us the correct data.

trinibwoy · Aug 15, 2020

Does anyone even benchmark raw triangle throughput anymore? I think the last was hardware.fr. Damien is sorely missed.

Ryan Smith · Aug 15, 2020

trinibwoy said:
Does anyone even benchmark raw triangle throughput anymore? I think the last was hardware.fr. Damien is sorely missed.

Most of the tools for this are quite old these days, as low-level benchmarks don't garner the interest they once did. Unfortunately, I'm not sure what Damien was using to begin with.

trinibwoy · Aug 15, 2020

Ryan Smith said:
Most of the tools for this are quite old these days, as low-level benchmarks don't garner the interest they once did. Unfortunately, I'm not sure what Damien was using to begin with.

Yeah that’s understandable. Times have changed. It’s not even clear whether geometry throughput has a material impact on overall game performance these days. Maybe someone will write a mesh shader bench.

techuse · Aug 15, 2020

trinibwoy said:
Yeah that’s understandable. Times have changed. It’s not even clear whether geometry throughput has a material impact on overall game performance these days. Maybe someone will write a mesh shader bench.

Doesnt seem to be too important given how well GCN compares to NV in most AAA games. This could be way too simplistic a view though.

Digidi · Aug 16, 2020

Ryan Smith said:
Most of the tools for this are quite old these days, as low-level benchmarks don't garner the interest they once did. Unfortunately, I'm not sure what Damien was using to begin with.

Maybe its realy time for @Rys to make a new byond3d suite

I asked him to do that here: https://forum.beyond3d.com/threads/benchmark-tool-like-beyond3d-suite.61013/

3dcgi · Aug 16, 2020

Digidi said:
the question ist how much Polygons can the Frontend accept. We know after backface culling it can rasterizer 2 Polygons but how much polygons is it able to cull?

I interpreted the quote to say they can call two primitives per clock meaning they likely rasterize one that survives culling.

Intel ARC GPUs, Xe Architecture for dGPUs [2018-2022]

Bondrewd

Lurkmass

Jawed

Bondrewd

techuse

Rootax

trinibwoy

Meh

DavidGraham

techuse

Lurkmass

Malo

Yak Mechanicum

DavidC

Digidi

Ryan Smith

trinibwoy

Meh

Ryan Smith

trinibwoy

Meh

techuse

Digidi

3dcgi

Similar threads