Intel ARC GPUs, Xe Architecture for dGPUs [2018-2022]

Digidi · Sep 15, 2020

Here are som more Information. Die Size 189mm²

https://videocardz.com/newz/intel-dg2-discrete-graphics-spotted-with-6gb-and-8gb-gddr6-memory

OK leaker is offline here is the picture from another page

source
https://www.cnbeta.com/articles/tech/1029215.htm

pcchen · Sep 17, 2020

According to this review, Intel Xe seems to be doing pretty well for an integrated graphics unit. At least in 3DMark.

Panino Manino · Sep 18, 2020

pcchen said:
According to this review, Intel Xe seems to be doing pretty well for an integrated graphics unit. At least in 3DMark.

No actual game benchmarks?
There's also Anandtech: https://www.anandtech.com/show/16084/intel-tiger-lake-review-deep-dive-core-11th-gen/19

cheapchips · Sep 18, 2020

pcchen said:
According to this review, Intel Xe seems to be doing pretty well for an integrated graphics unit. At least in 3DMark.

Isn't the real challenge how well it scales to discrete GPU sizes?

tekyfo · Sep 22, 2020

DavidGraham said:
Furthermore, Kepler didn't remove them completely, and If I recall correctly, most elements of hardware scheduling came back in Maxwell, Volta and Turing.

CarstenS said:
Thanks for shortening my quote to fit your narrative. I explicitly said "among others of course)". And yes, Kepler had some failsafe mechanism to keep things in check, in case SW did not work that well.

From what I know, Kepler still had the failsafe mechanisms in case the compiler inserted control codes were wrong. Code with wrong hints would be slow, but not incorrect. Maxwell and subsequent archs removed the failsafes, apparently they were confident in their compilers then.

RDNA1 seems to have hardware scoreboarding, but I think that is a prime subject for removal in RDNA2, software scheduling is just more power efficient. The most elegant approach would be the one of GCN, where arithmetic dependencies just don't exist.

Since Volta, NVIDIA moved from dual issue to single issue. The two units (FP + INT or FP/INT + FP) can still be fully utlized, as they are only half rate since NVIDIA also double the number of thread dispatch units.

3dilettante · Sep 22, 2020

tekyfo said:
RDNA1 seems to have hardware scoreboarding, but I think that is a prime subject for removal in RDNA2, software scheduling is just more power efficient. The most elegant approach would be the one of GCN, where arithmetic dependencies just don't exist.

Which aspects appear to be using hardware scoreboarding?
Generally, RDNA's single-issue per wavefront capability would not require much in the way of tracking. Until the current instruction has released a pipeline interlock upon completion, no possible dependent instructions can issue.
Some of GCN's quirks in this regard came with the interplay between vector and scalar domain functions, which may have been redesigned with RDNA or made less of an issue because the scalar path is no longer shared between vector units, which might have made an interlock more feasible.
There are still errata for some of the corner cases, or when dealing with specific instruction combinations that interact with separate or shared domains such as LDS or VMEM.
The more general way of handling separate domains is the wait count functionality, which isn't scoreboarding or different from GCN.

tekyfo · Sep 23, 2020

It was my understanding, that these pipeline interlocks would have to be encoded by the compiler in the instruction as control codes, which is what would be termed "software scheduling". Not having these interlocks as control codes in the instruction stream requires hardware to track the dependencies. From how I see it, single or dual issue doesn't fundamentally change that, dual-issue just makes the issue (ha!) more pressing as the issue pace would be higher and probably more logic would be required, as not just one but two instructions have to be checked whether they could be issued and the second instruction is influenced by the first.

GCN didn't need pipeline interlock, so there are no control codes in the encoding. I assumed that RDNA's encoding is not that much different and does not contain control codes, as to make the software view as similar as possible to GCN.

Jawed · Sep 23, 2020

RDNA has clause as a "control code":

Another powerful tool introduced by the RDNA architecture is the clause, which gives software greater control over the execution of groups of instructions. A clause modifies the behavior of compute and memory instructions in the scalar and vector SIMD pipelines. In particular, instructions in a clause are uninterruptable except at software-specified breaks. Clauses enables more intelligent scheduling. For example, a clause of vector load instructions guarantees that all data is read from a fetched cache line before it is evicted – thus maximizing bandwidth and power efficiency. Similarly, forming a clause of compute instructions ensures that all results are available to be written back to a cache or consumed by subsequent instructions.

3dilettante · Sep 23, 2020

tekyfo said:
It was my understanding, that these pipeline interlocks would have to be encoded by the compiler in the instruction as control codes, which is what would be termed "software scheduling". Not having these interlocks as control codes in the instruction stream requires hardware to track the dependencies. From how I see it, single or dual issue doesn't fundamentally change that, dual-issue just makes the issue (ha!) more pressing as the issue pace would be higher and probably more logic would be required, as not just one but two instructions have to be checked whether they could be issued and the second instruction is influenced by the first.

GCN didn't need pipeline interlock, so there are no control codes in the encoding. I assumed that RDNA's encoding is not that much different and does not contain control codes, as to make the software view as similar as possible to GCN.

A pipeline interlock is a hardware check in the pipeline that stalls the next instruction if there is a dependence on an instruction currently in-progress. It's a check near the top of the pipeline that can be relatively simple if the execution model is straightforward. With AMD's GPUs, a single wavefront can only issue one instruction at a time, so a relatively small set of flags would need to be checked. Scoreboarding usually tracks a more significant amount of data, often for the purposes of some kind of re-ordering or handling multiple instructions from a thread in-flight, which wouldn't happen here.

The lack of interlocks is where the compiler/developer is expected to have control codes or insert NOPs or independent instructions.
GCN had a simpler way of handling hazards for vector instructions by not issuing anything until the current instruction's result was able to be bypassed, hence the longer cadence.
It did have areas where the lack of interlocks was evident, either in the table of wait states for various fall-through cases, or where waitcnt instructions are needed.
RDNA removes most of the wait states, but keeps waitcnt instructions and has a number of "bugs" that don't look that different from some of the wait states GCN had in its table.
RDNA's longer execution latency versus its issue latency means there is now a specific check whether there is a dependence in a small window of register IDs, although I don't know if that counts as full scoreboarding in the classical sense.

Jawed · Sep 24, 2020

NVidia's original scoreboarding was costly partly because it was handling varying operand fetch latencies, not just instruction-dependency chains (register read-after-write).

Operand fetch latencies were complicated because the register file was very loosely organised (or rather, register allocations were not simply mapped, the banks versus register allocation problem, "vertical and horizontal allocations"). This meant that register fetch latencies varied. The effect was that instruction-level scoreboarding had to track many more datapoints than we see with contemporary GPUs, making it an operand scoreboarder.

"Interlock" in RDNA sounds simply like a new case of waitcnt.

When the RDNA shader compiler emits a clause, the implication is that a stall caused by a register read-after-write must not be filled by switching thread. If a clause was not used, then the GPU can switch thread to fill the stall.

Obviously indexed registers and local data share gather operations (or lane-to-lane broadcasts that aren't 1:1 mapped) cause all GPUs to engage in some kind of operand scoreboarding.

3dilettante · Sep 24, 2020

I'm curious what form the documentation for the GPU architecture will take now that the goal is to be approachable as a compute architecture as well. Intel's GPU documentation has been available in very dense form in the past, but it didn't seem to catch on as a topic of discussion to the extent that the other GPUs have.

edit: This is probably wandering too far afield from Xe, which I have much of the responsibility for.
To continue:

Jawed said:
NVidia's original scoreboarding was costly partly because it was handling varying operand fetch latencies, not just instruction-dependency chains (register read-after-write).

Operand fetch latencies were complicated because the register file was very loosely organised (or rather, register allocations were not simply mapped, the banks versus register allocation problem, "vertical and horizontal allocations"). This meant that register fetch latencies varied. The effect was that instruction-level scoreboarding had to track many more datapoints than we see with contemporary GPUs, making it an operand scoreboarder.

Another factor that was mentioned in passing after the transition to the explicit encoding was that Fermi's architecture had reg/mem operations, where an instruction's source operands could be from a register or memory location, meaning the more intensive tracking methods for the latter may have bled into the tracking for the former. Later ISAs took on a load-store model that would have made this unnecessary.
Even with the explicit wait cycles, I was under the impression that variable register fetch latency could still occur with bank conflicts. I thought the explicit delay tracking helped with dispensing with tracking the execution status of instructions whose outputs would be sourced by later operations. It would be a simpler set of status flags for register IDs, with a short lifetime. The hardware pipeline would also dispense with stages devoted to the run-time analysis of having to check multiple reads with multiple sources, if the explicit tracking bits can direct references to a specific location.

GFX10 was flagged by LLVM as having a banked vector register file, but I haven't seen it mentioned as such by AMD's presentations.

"Interlock" in RDNA sounds simply like a new case of waitcnt.

An interlock would be capable of automatically detecting a specific dependence. AMD's method allows for a much broader time window and number of outstanding operations, particularly in light of how many wavefronts would have their individual tracking, but at the cost of precision. A non-zero counter indicates something hasn't resolved yet, but which one of potentially dozens of operations that is wouldn't be known.
Examples where the count must be zero would likely be classified as an architecturally visible point where there is a critical lack of interlocking, such as how often scalar memory operations are recommended to be used with a waitcnt of 0.

Some of the hazards/bugs from the following show microarchitectural gaps that RDNA still has when it comes to detecting dependences, frequently at similar points that the GCN ISA listed as needing NOPs or wait states (EXEC dependences, scalar/vector sourcing, etc.):
https://gitlab.freedesktop.org/mesa...18f4a3c8abc86814143bf/src/amd/compiler/README

An interesting reference is to the instruction s_waitcnt_depctr, which is not referenced in the ISA doc.
It's indicated as a fix for an unhandled WAR hazard on the EXEC mask, something a more cohesive pipeline would have physical checks for.
This may be a check on a more global internal counter on the number of instructions that have left the instruction buffer and whose operand fetches are outstanding. Using it would be relying on it as a sort of catch-all for missing interlocks or hazard checks..

Coincidentally, I ran across that band-aid showing up on AMD's GPGU site discussing the porting of the PS4 game Detroit to the PC on a 5700XT:
https://gpuopen.com/learn/porting-detroit-2/.
It also shows the compiler using s_inst_prefetch, which was documented as being capable of hanging shaders and wasn't even documented in the original release of the RDNA ISA doc. It apparently exists officially at least since the Feb 2020 version.

When the RDNA shader compiler emits a clause, the implication is that a stall caused by a register read-after-write must not be filled by switching thread. If a clause was not used, then the GPU can switch thread to fill the stall.

Clause terminology showed up in Vega, briefly, in relation to scalar memory ops. The other AMD reference would be its VLIW scheduling days.
It seems like the weaker implicit version in Vega (now called groups?) shows a major motivation for bringing back an optional form of the VLIW method is to get better behavior from the memory pipeline. It may also feed into the introduction of the ordered memory mode as well. I'm less clear on the benefit of the VALU clauses, but perhaps there is some wavefront bring-up that is ALU heavy and is worth optimizing for at the expense of SIMD utilization.

Jawed · Sep 25, 2020

3dilettante said:
I'm curious what form the documentation for the GPU architecture will take now that the goal is to be approachable as a compute architecture as well. Intel's GPU documentation has been available in very dense form in the past, but it didn't seem to catch on as a topic of discussion to the extent that the other GPUs have.

Burnt by Larrabee? I still love that concept, I wanna see it rise again. 1024x1024 at 72fps real time ray tracing more than 10 years ago. ~280 million ray-triangle intersection tests per second. That was without BVH traversal it seems, founded upon a novel triangle structure:

https://www.graphicon.ru/html/2007/proceedings/Papers/Paper_46.pdf

Another factor that was mentioned in passing after the transition to the explicit encoding was that Fermi's architecture had reg/mem operations, where an instruction's source operands could be from a register or memory location, meaning the more intensive tracking methods for the latter may have bled into the tracking for the former. Later ISAs took on a load-store model that would have made this unnecessary.

Ah yes, I'd forgotten about operands coming from memory.

Even with the explicit wait cycles, I was under the impression that variable register fetch latency could still occur with bank conflicts. I thought the explicit delay tracking helped with dispensing with tracking the execution status of instructions whose outputs would be sourced by later operations. It would be a simpler set of status flags for register IDs, with a short lifetime. The hardware pipeline would also dispense with stages devoted to the run-time analysis of having to check multiple reads with multiple sources, if the explicit tracking bits can direct references to a specific location.

Are you thinking of indexed register fetch or something else? I'm trying to understand the scenario for variable register fetch latency.

GFX10 was flagged by LLVM as having a banked vector register file, but I haven't seen it mentioned as such by AMD's presentations.

This looks pretty tasty:

https://homes.cs.washington.edu/~wysem/publications/quals-gpgpu-vrf.pdf

An interlock would be capable of automatically detecting a specific dependence. AMD's method allows for a much broader time window and number of outstanding operations, particularly in light of how many wavefronts would have their individual tracking, but at the cost of precision. A non-zero counter indicates something hasn't resolved yet, but which one of potentially dozens of operations that is wouldn't be known.
Examples where the count must be zero would likely be classified as an architecturally visible point where there is a critical lack of interlocking, such as how often scalar memory operations are recommended to be used with a waitcnt of 0.

So the problem here is to distinguish between known latencies and variable latencies. I'm not convinced that the "wait" counters proceed in an obviously deterministic way. This is because the number is statically defined, but the machine progresses in a way that cannot be determined statically at compile time (e.g. LDS broadcast - the count of iterations that will be required cannot be known at compile time).

Which makes me contemplate that the number may be a mode signifier rather than a cycle count.

Some of the hazards/bugs from the following show microarchitectural gaps that RDNA still has when it comes to detecting dependences, frequently at similar points that the GCN ISA listed as needing NOPs or wait states (EXEC dependences, scalar/vector sourcing, etc.):
https://gitlab.freedesktop.org/mesa...18f4a3c8abc86814143bf/src/amd/compiler/README

An interesting reference is to the instruction s_waitcnt_depctr, which is not referenced in the ISA doc.
It's indicated as a fix for an unhandled WAR hazard on the EXEC mask, something a more cohesive pipeline would have physical checks for.
This may be a check on a more global internal counter on the number of instructions that have left the instruction buffer and whose operand fetches are outstanding. Using it would be relying on it as a sort of catch-all for missing interlocks or hazard checks..

Coincidentally, I ran across that band-aid showing up on AMD's GPGU site discussing the porting of the PS4 game Detroit to the PC on a 5700XT:
https://gpuopen.com/learn/porting-detroit-2/.
It also shows the compiler using s_inst_prefetch, which was documented as being capable of hanging shaders and wasn't even documented in the original release of the RDNA ISA doc. It apparently exists officially at least since the Feb 2020 version.

I've been reading that series of articles. Vulkan on PC still leaves a lot to be desired in terms of basic performance constraints caused by graphics engine design decisions when porting from console... e.g. draw calls on console are practically "free" compared with PC...

s_inst_prefetch 0x3 may be setting the SIMDs to expect a 4-cycle instruction-dependency latency, I suppose.

s_waitcnt_depctr 0xffe3 could be interpreted as -29, sort of 7 sets of 4-cycles?...

To be honest, my eyes have always glazed over on these "wait" instructions. "Just pray there are other hardware threads around to fill the stalls that would otherwise happen."

RDNA, which can use 32-wide and 64-wide hardware threads, complicates the "latency" question even more, which makes me think that some of these might be modes, not statically defined periods of time.

Clause terminology showed up in Vega, briefly, in relation to scalar memory ops. The other AMD reference would be its VLIW scheduling days.
It seems like the weaker implicit version in Vega (now called groups?) shows a major motivation for bringing back an optional form of the VLIW method is to get better behavior from the memory pipeline. It may also feed into the introduction of the ordered memory mode as well. I'm less clear on the benefit of the VALU clauses, but perhaps there is some wavefront bring-up that is ALU heavy and is worth optimizing for at the expense of SIMD utilization.

Another tenet of RDNA is that hardware threads complete as fast as possible (wall-clock time), which helps with coherent use of memory (caches) and also frees-up register allocations (register allocations live for as short a time as possible). Shorter lifetimes equals less thrashing and makes the response times for instantiation of new hardware threads shorter (the queue for them proceeds in a less "clumpy" fashion).

The use of clauses in VALU code would tend to trade pipeline stalls for enhanced coherency and overall reduced duration of a hardware thread. But it seems very much "optional".

The whitepaper also suggests that VALU clauses help to concentrate the computation of results that will then be used in further instructions:

Similarly, forming a clause of compute instructions ensures that all results are available to be written back to a cache or consumed by subsequent instructions.

It seems that wave64 mode can also help with shorter lifetimes, so there's new compiler complexity here. The whitepaper seems to indicate that graphics (geometry? pixel?) prefers wave64, with wave32 preferred for compute.

3dilettante · Sep 25, 2020

Jawed said:
Burnt by Larrabee? I still love that concept, I wanna see it rise again. 1024x1024 at 72fps real time ray tracing more than 10 years ago. ~280 million ray-triangle intersection tests per second. That was without BVH traversal it seems, founded upon a novel triangle structure:

I recall Larrabee getting a fair amount of attention on this forum and elsewhere. The evangelism for it seemed more approachable.
There were several Gen iGPU architectures that had architectural documents released for them, but those didn't get as much attention.
The documents themselves had details, but perhaps the lack of general focus on that type of product and in an overly dense format made it less approachable.

Are you thinking of indexed register fetch or something else? I'm trying to understand the scenario for variable register fetch latency.

Register bank address conflicts, primarily. Usually the addresses are assigned to single-ported banks modulo-4, which would mean an instruction could take anywhere from 1, 2, or 3 cycles to fetch operands depending on if they hit the same bank.
It seems that the most recent GPUs may have gone for two dual-ported register banks, which would make conflicts rarer.

So the problem here is to distinguish between known latencies and variable latencies. I'm not convinced that the "wait" counters proceed in an obviously deterministic way. This is because the number is statically defined, but the machine progresses in a way that cannot be determined statically at compile time (e.g. LDS broadcast - the count of iterations that will be required cannot be known at compile time).
Which makes me contemplate that the number may be a mode signifier rather than a cycle count.

s_waitcnt or its related waitcnt instructions for GCN/RDNA would be instruction issue barriers waiting for a counter of outstanding operations to match or fall below a given value. It would be simpler for the hardware to track the incrementing and decrementing of a set of counters, versus tracking register dependence lists and unit statuses. The lack of precision would come from the non-deterministic way the hardware's execution would proceed. Issuing a dozen vector reads and then waiting for vmcnt to drop back to zero would most often stall the pipeline more than if the hardware could track the individual readiness of operands, since one outlier's latency could be hidden by the issuing of subsequent instructions whose execution could cover the delay.

s_inst_prefetch 0x3 may be setting the SIMDs to expect a 4-cycle instruction-dependency latency, I suppose.

Some LLVM or Mesa notes indicated the idea was to control the instruction fetch behavior of the wavefront. I'm hazy on if this governs the L1 instruction cache or the wavefront's instruction buffer, although it seems like the numbers involved would be more appropriate for a 4-cache line buffer.
The hardware assumes that it should fetch instructions that are further down the instruction stream, which makes sense in straight-line code. This would be less efficient in the case of a tight loop, since the hardware would prefetch instructions past the end of the loop and thrash the buffer at each iteration. The prefetch instruction could be used to dial back how aggressive the prefetch is so that the loop can stay in the buffer instead of forcing repeat traffic to the shared instruction cache.
However, per various bug lists, the instruction could somehow lock up the shader.

s_waitcnt_depctr 0xffe3 could be interpreted as -29, sort of 7 sets of 4-cycles?...

It's an undocumented instruction, so it's not certain if the parsing by the code view is correct, or how an immediate with those values would be parsed. It's possible it monitors various internal counters, and some arbitrary bit ranges are assigned to them. If this is being used as a barrier for a WAR hazard between adjacent instructions, it would seem like the best value would be 0 for whatever counter is of primary concern, and all 1s for everything else.

The whitepaper also suggests that VALU clauses help to concentrate the computation of results that will then be used in further instructions:

Perhaps that makes more sense in an inter-wavefront basis than it does for within a wavefront. Within a wavefront, just having VALU instructions placed together would accomplish the same result without a separate instruction that has a finite capacity for clause size. Perhaps it matters more for a producer shader whose execution latency would delay a larger number of consumers.

It seems that wave64 mode can also help with shorter lifetimes, so there's new compiler complexity here. The whitepaper seems to indicate that graphics (geometry? pixel?) prefers wave64, with wave32 preferred for compute.

Some of graphics pipeline may still be sized for 64-wide wavefronts, and graphics data can be more uniform in terms of batch sizing versus arbitrary compute. Wave64 can also permit a form of shared register between the 32-wide halves of the wave. This, coupled with the subvector execution mode, seems reminiscent in some ways of shared registers from the VLIW days.

DavidGraham · Oct 3, 2020

https://twitter.com/x/status/1311941086944714752

Wesker · Oct 31, 2020

https://www.anandtech.com/show/1621...-intel-launches-xe-max-for-entrylevel-laptops

Putas · Nov 1, 2020

With DG1 being OEM only and MAX being rather an "application" accelerator, can we say Intel gave up on discrete gaming market for now?

Kaotik · Nov 1, 2020

Putas said:
With DG1 being OEM only and MAX being rather an "application" accelerator, can we say Intel gave up on discrete gaming market for now?

Huh, how on earth would you come to that conclusion when Intel hasn't released any of their discrete gaming products yet?
MAX is Xe-LP like the integrated GPU, it was never meant for gaming. Intel's discrete gaming products are based on Xe-HPG and they're coming next year from either Samsung or TSMC.

Putas · Nov 1, 2020

When smaller chip does not enter direct competition I don't have much hope for the big one. Should the manufacturing make it more ambitious?

Erinyes · Nov 1, 2020

Kaotik said:
Huh, how on earth would you come to that conclusion when Intel hasn't released any of their discrete gaming products yet?
MAX is Xe-LP like the integrated GPU, it was never meant for gaming. Intel's discrete gaming products are based on Xe-HPG and they're coming next year from either Samsung or TSMC.

Rumour is that Xe-HPG is on TSMC 6nm.

Putas said:
When smaller chip does not enter direct competition I don't have much hope for the big one. Should the manufacturing make it more ambitious?

Given that it's their first discrete GPU in nearly two decades, I don't think anyone expected them to be back at the high end immediately. Starting off with a low end product allows them to get some of the teething problems out of the way and to get their software stack ready for the Xe-HPG products. Look at AMD, even with their experience it took a strong roadmap and very good execution to get back to the high end with RDNA 2.

Putas · Nov 1, 2020

It is not about high end, it is about them going head on at any segment.

Intel ARC GPUs, Xe Architecture for dGPUs [2018-2022]

Digidi

pcchen

Moderator

Panino Manino

cheapchips

tekyfo

3dilettante

tekyfo

Jawed

3dilettante

Jawed

3dilettante

Jawed

3dilettante

DavidGraham

Wesker

Putas

Kaotik

Drunk Member

Putas

Erinyes

Putas

Similar threads