Intel Xe Architecture for dGPUs

Discussion in 'Architecture and Products' started by DavidGraham, Dec 12, 2018.

Tags:
  1. Digidi

    Regular Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    309
    Likes Received:
    152
    #281 Digidi, Sep 15, 2020
    Last edited: Sep 15, 2020
  2. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,862
    Likes Received:
    313
    Location:
    Taiwan
    According to this review, Intel Xe seems to be doing pretty well for an integrated graphics unit. At least in 3DMark.
     
    jlippo likes this.
  3. Panino Manino

    Newcomer

    Joined:
    Nov 27, 2017
    Messages:
    69
    Likes Received:
    62
  4. cheapchips

    Veteran Newcomer

    Joined:
    Feb 23, 2013
    Messages:
    1,463
    Likes Received:
    1,415
    Isn't the real challenge how well it scales to discrete GPU sizes?
     
  5. tekyfo

    Newcomer

    Joined:
    Apr 12, 2012
    Messages:
    6
    Likes Received:
    1
    From what I know, Kepler still had the failsafe mechanisms in case the compiler inserted control codes were wrong. Code with wrong hints would be slow, but not incorrect. Maxwell and subsequent archs removed the failsafes, apparently they were confident in their compilers then.

    RDNA1 seems to have hardware scoreboarding, but I think that is a prime subject for removal in RDNA2, software scheduling is just more power efficient. The most elegant approach would be the one of GCN, where arithmetic dependencies just don't exist.

    Since Volta, NVIDIA moved from dual issue to single issue. The two units (FP + INT or FP/INT + FP) can still be fully utlized, as they are only half rate since NVIDIA also double the number of thread dispatch units.
     
    DavidGraham likes this.
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,469
    Likes Received:
    4,398
    Location:
    Well within 3d
    Which aspects appear to be using hardware scoreboarding?
    Generally, RDNA's single-issue per wavefront capability would not require much in the way of tracking. Until the current instruction has released a pipeline interlock upon completion, no possible dependent instructions can issue.
    Some of GCN's quirks in this regard came with the interplay between vector and scalar domain functions, which may have been redesigned with RDNA or made less of an issue because the scalar path is no longer shared between vector units, which might have made an interlock more feasible.
    There are still errata for some of the corner cases, or when dealing with specific instruction combinations that interact with separate or shared domains such as LDS or VMEM.
    The more general way of handling separate domains is the wait count functionality, which isn't scoreboarding or different from GCN.
     
  7. tekyfo

    Newcomer

    Joined:
    Apr 12, 2012
    Messages:
    6
    Likes Received:
    1
    It was my understanding, that these pipeline interlocks would have to be encoded by the compiler in the instruction as control codes, which is what would be termed "software scheduling". Not having these interlocks as control codes in the instruction stream requires hardware to track the dependencies. From how I see it, single or dual issue doesn't fundamentally change that, dual-issue just makes the issue (ha!) more pressing as the issue pace would be higher and probably more logic would be required, as not just one but two instructions have to be checked whether they could be issued and the second instruction is influenced by the first.

    GCN didn't need pipeline interlock, so there are no control codes in the encoding. I assumed that RDNA's encoding is not that much different and does not contain control codes, as to make the software view as similar as possible to GCN.
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,100
    Likes Received:
    1,186
    Location:
    London
    RDNA has clause as a "control code":

     
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,469
    Likes Received:
    4,398
    Location:
    Well within 3d
    A pipeline interlock is a hardware check in the pipeline that stalls the next instruction if there is a dependence on an instruction currently in-progress. It's a check near the top of the pipeline that can be relatively simple if the execution model is straightforward. With AMD's GPUs, a single wavefront can only issue one instruction at a time, so a relatively small set of flags would need to be checked. Scoreboarding usually tracks a more significant amount of data, often for the purposes of some kind of re-ordering or handling multiple instructions from a thread in-flight, which wouldn't happen here.

    The lack of interlocks is where the compiler/developer is expected to have control codes or insert NOPs or independent instructions.
    GCN had a simpler way of handling hazards for vector instructions by not issuing anything until the current instruction's result was able to be bypassed, hence the longer cadence.
    It did have areas where the lack of interlocks was evident, either in the table of wait states for various fall-through cases, or where waitcnt instructions are needed.
    RDNA removes most of the wait states, but keeps waitcnt instructions and has a number of "bugs" that don't look that different from some of the wait states GCN had in its table.
    RDNA's longer execution latency versus its issue latency means there is now a specific check whether there is a dependence in a small window of register IDs, although I don't know if that counts as full scoreboarding in the classical sense.
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,100
    Likes Received:
    1,186
    Location:
    London
    NVidia's original scoreboarding was costly partly because it was handling varying operand fetch latencies, not just instruction-dependency chains (register read-after-write).

    Operand fetch latencies were complicated because the register file was very loosely organised (or rather, register allocations were not simply mapped, the banks versus register allocation problem, "vertical and horizontal allocations"). This meant that register fetch latencies varied. The effect was that instruction-level scoreboarding had to track many more datapoints than we see with contemporary GPUs, making it an operand scoreboarder.

    "Interlock" in RDNA sounds simply like a new case of waitcnt.

    When the RDNA shader compiler emits a clause, the implication is that a stall caused by a register read-after-write must not be filled by switching thread. If a clause was not used, then the GPU can switch thread to fill the stall.

    Obviously indexed registers and local data share gather operations (or lane-to-lane broadcasts that aren't 1:1 mapped) cause all GPUs to engage in some kind of operand scoreboarding.
     
    Lightman and BRiT like this.
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,469
    Likes Received:
    4,398
    Location:
    Well within 3d
    I'm curious what form the documentation for the GPU architecture will take now that the goal is to be approachable as a compute architecture as well. Intel's GPU documentation has been available in very dense form in the past, but it didn't seem to catch on as a topic of discussion to the extent that the other GPUs have.

    edit: This is probably wandering too far afield from Xe, which I have much of the responsibility for.
    To continue:

    Another factor that was mentioned in passing after the transition to the explicit encoding was that Fermi's architecture had reg/mem operations, where an instruction's source operands could be from a register or memory location, meaning the more intensive tracking methods for the latter may have bled into the tracking for the former. Later ISAs took on a load-store model that would have made this unnecessary.
    Even with the explicit wait cycles, I was under the impression that variable register fetch latency could still occur with bank conflicts. I thought the explicit delay tracking helped with dispensing with tracking the execution status of instructions whose outputs would be sourced by later operations. It would be a simpler set of status flags for register IDs, with a short lifetime. The hardware pipeline would also dispense with stages devoted to the run-time analysis of having to check multiple reads with multiple sources, if the explicit tracking bits can direct references to a specific location.

    GFX10 was flagged by LLVM as having a banked vector register file, but I haven't seen it mentioned as such by AMD's presentations.

    An interlock would be capable of automatically detecting a specific dependence. AMD's method allows for a much broader time window and number of outstanding operations, particularly in light of how many wavefronts would have their individual tracking, but at the cost of precision. A non-zero counter indicates something hasn't resolved yet, but which one of potentially dozens of operations that is wouldn't be known.
    Examples where the count must be zero would likely be classified as an architecturally visible point where there is a critical lack of interlocking, such as how often scalar memory operations are recommended to be used with a waitcnt of 0.

    Some of the hazards/bugs from the following show microarchitectural gaps that RDNA still has when it comes to detecting dependences, frequently at similar points that the GCN ISA listed as needing NOPs or wait states (EXEC dependences, scalar/vector sourcing, etc.):
    https://gitlab.freedesktop.org/mesa...18f4a3c8abc86814143bf/src/amd/compiler/README

    An interesting reference is to the instruction s_waitcnt_depctr, which is not referenced in the ISA doc.
    It's indicated as a fix for an unhandled WAR hazard on the EXEC mask, something a more cohesive pipeline would have physical checks for.
    This may be a check on a more global internal counter on the number of instructions that have left the instruction buffer and whose operand fetches are outstanding. Using it would be relying on it as a sort of catch-all for missing interlocks or hazard checks..

    Coincidentally, I ran across that band-aid showing up on AMD's GPGU site discussing the porting of the PS4 game Detroit to the PC on a 5700XT:
    https://gpuopen.com/learn/porting-detroit-2/.
    It also shows the compiler using s_inst_prefetch, which was documented as being capable of hanging shaders and wasn't even documented in the original release of the RDNA ISA doc. It apparently exists officially at least since the Feb 2020 version.

    Clause terminology showed up in Vega, briefly, in relation to scalar memory ops. The other AMD reference would be its VLIW scheduling days.
    It seems like the weaker implicit version in Vega (now called groups?) shows a major motivation for bringing back an optional form of the VLIW method is to get better behavior from the memory pipeline. It may also feed into the introduction of the ordered memory mode as well. I'm less clear on the benefit of the VALU clauses, but perhaps there is some wavefront bring-up that is ALU heavy and is worth optimizing for at the expense of SIMD utilization.
     
    #291 3dilettante, Sep 24, 2020
    Last edited: Sep 24, 2020
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,100
    Likes Received:
    1,186
    Location:
    London
    Burnt by Larrabee? I still love that concept, I wanna see it rise again. 1024x1024 at 72fps real time ray tracing more than 10 years ago. ~280 million ray-triangle intersection tests per second. That was without BVH traversal it seems, founded upon a novel triangle structure:

    https://www.graphicon.ru/html/2007/proceedings/Papers/Paper_46.pdf

    Ah yes, I'd forgotten about operands coming from memory.

    Are you thinking of indexed register fetch or something else? I'm trying to understand the scenario for variable register fetch latency.

    This looks pretty tasty:

    https://homes.cs.washington.edu/~wysem/publications/quals-gpgpu-vrf.pdf

    So the problem here is to distinguish between known latencies and variable latencies. I'm not convinced that the "wait" counters proceed in an obviously deterministic way. This is because the number is statically defined, but the machine progresses in a way that cannot be determined statically at compile time (e.g. LDS broadcast - the count of iterations that will be required cannot be known at compile time).

    Which makes me contemplate that the number may be a mode signifier rather than a cycle count.

    I've been reading that series of articles. Vulkan on PC still leaves a lot to be desired in terms of basic performance constraints caused by graphics engine design decisions when porting from console... e.g. draw calls on console are practically "free" compared with PC...

    s_inst_prefetch 0x3 may be setting the SIMDs to expect a 4-cycle instruction-dependency latency, I suppose.

    s_waitcnt_depctr 0xffe3 could be interpreted as -29, sort of 7 sets of 4-cycles?...

    To be honest, my eyes have always glazed over on these "wait" instructions. "Just pray there are other hardware threads around to fill the stalls that would otherwise happen."

    RDNA, which can use 32-wide and 64-wide hardware threads, complicates the "latency" question even more, which makes me think that some of these might be modes, not statically defined periods of time.

    Another tenet of RDNA is that hardware threads complete as fast as possible (wall-clock time), which helps with coherent use of memory (caches) and also frees-up register allocations (register allocations live for as short a time as possible). Shorter lifetimes equals less thrashing and makes the response times for instantiation of new hardware threads shorter (the queue for them proceeds in a less "clumpy" fashion).

    The use of clauses in VALU code would tend to trade pipeline stalls for enhanced coherency and overall reduced duration of a hardware thread. But it seems very much "optional".

    The whitepaper also suggests that VALU clauses help to concentrate the computation of results that will then be used in further instructions:

    It seems that wave64 mode can also help with shorter lifetimes, so there's new compiler complexity here. The whitepaper seems to indicate that graphics (geometry? pixel?) prefers wave64, with wave32 preferred for compute.
     
    Rootax likes this.
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,469
    Likes Received:
    4,398
    Location:
    Well within 3d
    I recall Larrabee getting a fair amount of attention on this forum and elsewhere. The evangelism for it seemed more approachable.
    There were several Gen iGPU architectures that had architectural documents released for them, but those didn't get as much attention.
    The documents themselves had details, but perhaps the lack of general focus on that type of product and in an overly dense format made it less approachable.


    Register bank address conflicts, primarily. Usually the addresses are assigned to single-ported banks modulo-4, which would mean an instruction could take anywhere from 1, 2, or 3 cycles to fetch operands depending on if they hit the same bank.
    It seems that the most recent GPUs may have gone for two dual-ported register banks, which would make conflicts rarer.


    s_waitcnt or its related waitcnt instructions for GCN/RDNA would be instruction issue barriers waiting for a counter of outstanding operations to match or fall below a given value. It would be simpler for the hardware to track the incrementing and decrementing of a set of counters, versus tracking register dependence lists and unit statuses. The lack of precision would come from the non-deterministic way the hardware's execution would proceed. Issuing a dozen vector reads and then waiting for vmcnt to drop back to zero would most often stall the pipeline more than if the hardware could track the individual readiness of operands, since one outlier's latency could be hidden by the issuing of subsequent instructions whose execution could cover the delay.

    Some LLVM or Mesa notes indicated the idea was to control the instruction fetch behavior of the wavefront. I'm hazy on if this governs the L1 instruction cache or the wavefront's instruction buffer, although it seems like the numbers involved would be more appropriate for a 4-cache line buffer.
    The hardware assumes that it should fetch instructions that are further down the instruction stream, which makes sense in straight-line code. This would be less efficient in the case of a tight loop, since the hardware would prefetch instructions past the end of the loop and thrash the buffer at each iteration. The prefetch instruction could be used to dial back how aggressive the prefetch is so that the loop can stay in the buffer instead of forcing repeat traffic to the shared instruction cache.
    However, per various bug lists, the instruction could somehow lock up the shader.

    It's an undocumented instruction, so it's not certain if the parsing by the code view is correct, or how an immediate with those values would be parsed. It's possible it monitors various internal counters, and some arbitrary bit ranges are assigned to them. If this is being used as a barrier for a WAR hazard between adjacent instructions, it would seem like the best value would be 0 for whatever counter is of primary concern, and all 1s for everything else.

    Perhaps that makes more sense in an inter-wavefront basis than it does for within a wavefront. Within a wavefront, just having VALU instructions placed together would accomplish the same result without a separate instruction that has a finite capacity for clause size. Perhaps it matters more for a producer shader whose execution latency would delay a larger number of consumers.

    Some of graphics pipeline may still be sized for 64-wide wavefronts, and graphics data can be more uniform in terms of batch sizing versus arbitrary compute. Wave64 can also permit a form of shared register between the 32-wide halves of the wave. This, coupled with the subvector execution mode, seems reminiscent in some ways of shared registers from the VLIW days.
     
  14. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,372
    Likes Received:
    3,754
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...