GCN and mixed wavefronts

Discussion in 'Architecture and Products' started by ieldra, Aug 31, 2016.

Tags:
  1. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    Hi guys, I'll get right to the point. I'm wondering if you can have compute and graphics wavefronts in execution at the same time on different SIMDs within a CU.

    This diagram from anand really threw me off because I thought the waves/warps were the columns. So I thought you had one wave completing per cycle, instead you have 4 quarter waves completing per cycle, a wave in four cycles.

    [​IMG]

    I'm assuming instr 1..4 are actually all the same, no idea who made this damned thing.

    Anyway, AMD is pretty clear in their documentation that their 'async shaders' scheme is enabled by fast context switching (dedicated cache within each ACE afaik) but some people are telling me that you don't need a context switch at all and that each SIMD can be working on a wavefront from a different kernel

    Can someone shed some light on this and tell me where they get their info ?

    thanks in advance :)
     
  2. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,796
    Likes Received:
    2,054
    Location:
    Germany
    Is a zero-cost/zero-overhead context switch still a context switch as per classical definition? IOW does it make sense to talk of context switches when all that's changing is some sort of flag, whether or not a Wavefront originates from an ACE or from the Graphics Command Processor?
     
    ieldra likes this.
  3. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    Why can't anything be nice and simple :p

    I understand what you're getting at, but the wavefront must be cached (l2) if there's no context swap, and if it's cached then there was wasted cache - right?. I hope you get what I mean. This is what I'm trying to wrap my head around having read the async shaders whitepaper a few times I was pretty confident there is a context swap and the whole CU is assigned to a compute kernel from the ACEs. If this isn't the case, and you have mixed compute/graphics wavefronts on different SIMDs within the CU then there wouldn't be context swaps unless the wavefronts in question weren't previously dispatched to the CU.

    A context swap is a context swap if there's data from one kernel being dumped and that of another kernel being retrieved, at least by my humble standards
     
    CarstenS likes this.
  4. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,796
    Likes Received:
    2,054
    Location:
    Germany
    Yes, the wavefront's data needs to be stored somewhere. But it would need storage anyway when it's ready for execution - no matter if it's seamlessly interleaved with other wavefronts or not. Maybe, just mabye, contention could be higher when intermixing additional wavefronts, leaving less wiggle room for others. But I'm guessing that the ACEs would have some kind of knowledge of the machine's filled-upness (is that a legit word?) and stall their issue when the rest of the machine has no free resources anyway.
     
  5. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    One GCN CU can have up to 40 waves running concurrently (10 per SIMD). It doesn't matter where each wave has originated. There can be any mix of pixel/vertex/geometry/hull/domain/compute shader waves in flight at the same time (from any amount of queues). Instructions from these 40 waves are scheduled to the CU SIMDs in a round robin manner. If some of these 40 waves is waiting for memory, the GPU simply jumps over it in the round robin scheduling.

    There is no need to store wave's data in off-chip memory. Each CU has enough on-chip storage for the metadata of these waves. Waves are grouped as thread groups. A single thread group needs to execute on a single CU (this is true for all GPUs, including Intel and Nvidia). This is because threads in the same thread group can use barrier synchronization and share data through LDS (64 KB on-chip buffer on each CU). All GPU architectures use static register allocation. The maximum count of registers used during a shader life time (even if some branch was never taken) needs to be allocated for the wave. Simplified: The GPU scheduler keeps track of available resources (free registers, free LDS, free waves) on each CU. When there's enough registers on some CU for a new thread group (thread group = 1 to 16 waves), the scheduler spawns a thread group for that CU. Each GCN CU has 256 KB of registers, 40 wave slots and 64 KB of LDS. There is no need to context swap kernels (*). Each thread group is guaranteed to finish execution once started. GPU programming model doesn't support thread groups waiting for other thread groups (atomics are supported, but the programmer is not allowed to write spinning locks). Nvidia GPUs work similarly. Intel has per wave (they call them threads) register files. Their register allocation works completely differently (shader compiler outputs different SIMD widths based on register count).

    (*) Exceptional cases might need context switch (= store GPU state to memory and restore later). Normal flow of execution doesn't.
     
    fellix, Lightman, Kej and 7 others like this.
  6. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    So let me get this straight; you can have wavefronts from different origins in flight on a single CU. A threadblock/threadgroup runs on only one unit due to LDS and sync, and dispatch is at the thread group level, the part about register allocation etc that'sfine. You specifically mentionethere's no need to context swap *kernels* and I used the wrong term there possibly - doesn't need to be a kernel, just context swap of sets of waves - generally speaking will you have graphics on compute waves executing in parallel on a single CU ?
    Threadblocks are guaranteed to finish once started, but can more than one be in execution at any given time ? Obviously if the wavefronts in question are already in flight there's no context swapping involved (at least not when they're dispatched for execution) but this still doesn't satisfy my curiosity about the context switching being fast due to the ACEs. If the "context swap cache" I mentioned is actually LDS then what do the ACEs have to do with it ?

    Thanks for your replies guys, appreciate it :)
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Sebbbi, In what context does GCN support 16 wavefronts in a thread group (work group)?
     
  8. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    16 waves * 64 threads/wave = 1024 threads. That's the biggest thread group size supported (required) by DirectX. 64 KB register file / 1024 threads = 64 registers/thread. If you need more than 64 registers per thread, the compiler is going to spill to memory (or LDS). A CU can run two 1024 thread (16 wave) thread groups at the same time, if your shader needs 32 or less VGPRs.

    The image printed on wall of every console graphics programmer:
    [​IMG]
     
    Heinrich4, Lightman, function and 5 others like this.
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    ieldra likes this.
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,120
    Likes Received:
    2,867
    Location:
    Well within 3d
    The GCN ISA document also defines work-group size as being up to 16 wavefronts. Each CU has 16 buffers for wavefront barriers as well.
     
    ieldra, Razor1 and sebbbi like this.
  11. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    Note that those "16"s are not related to each other at all since if you had a workgroup with 16 wavefronts, you couldn't store 16 workgroups in a CU :)
     
    ieldra and sebbbi like this.
  12. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    1024 threads (16 waves) is not an optimal thread group size for GCN. It doesn't evenly divide the 2560 thread CU maximum concurrent thread count (40 waves). 1024 thread group cannot achieve max occupancy (max latency hiding).

    Good GCN thread group sizes (max occupancy): 64, 128, 256, 320, 512, 640.

    But maximum occupancy is only achievable in shaders with no more than 256 KB / (2560 threads * 4 bytes/register) = 25 registers/thread.

    Sometimes larger thread groups (than 640) are the best choice, as it allows sharing data between larger groups of threads. Max occupancy is not always equal to best performance.
     
    Kej, Heinrich04 and ieldra like this.
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,120
    Likes Received:
    2,867
    Location:
    Well within 3d
    Mistaken numeric association on my part. The GCN architectural slides are more explicit in describing 16 work group barriers, assuming those are the same as described in the whitepaper.
     
  14. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    I've looked at the whitepaper before but it didn't answer my question, I'll take another look anyway.

    I already knew each SIMD has up to 10 waves in flight, now it's clear they each execute a quarter wave per cycle and not a wave per cycle over 4 SIMDs, I know it's *possible* that they be mixed compute/graphics.

    But... As Sebbbi said each thread group is guaranteed execution on a single CU, and each thread group is guaranteed execution once started, and the scheduler assigns a threadgroup at a time to one CU in round robin. Presumably this applies to ACEs just as well the gcp.

    Will the CU be executing waves from two different threadblocks simultaneously at any point? On different SIMDs

    As for the context switching it obviously wouldn't be necessary if the scheduler identifies a dependency causing latency for some threadblocks in a task beforehand and dispatches groups from a compute shader instead.

    In the case it hadn't been accounted for beforehand context switching is fast.

    ACEs are microcode programmable right? AMD has preprogrammed scheduling modes/routines, so I'm assuming the context switching is a sort of last recourse in general cases, otherwise used when preempting for time critical kernels?
     
  15. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    Ieldra ? 1 quarter wave by cycles? Occpuancy ? GCn is an overthreading architecures, perfect for parallel computing ( with split kernel if we speak about OpenCL )... But ofc this have include some optimizations in graphics case ( in the engine who was used and are still used ) .

    I dont know, but it seems most of your response, are in your question, with GCn never consider one approach, GCN is a mullti approach ressources....
     
    #15 lanek, Sep 1, 2016
    Last edited: Sep 1, 2016
  16. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,432
    Likes Received:
    261
    Potentially. It depends on priority of waves on the SIMD and wait status.
     
    Lightman and ieldra like this.
  17. keldor314

    Newcomer

    Joined:
    Feb 23, 2010
    Messages:
    132
    Likes Received:
    13
    If AMD hardware is anything like Nvidia (and I rather think it is), you can absolutely have multiple blocks resident on the same CU. I use this to allow wave sized blocks which avoid the need for syncthreads in some kernels.

    Allowing blocks from different kernels on one CU at the same time is a much stickier issue. One has to remember that if this happens, you'll end up with a lot of contention over the instruction cache. This means that scheduling can be very tricky. Do you favor homogeneous loads on each CU, thereby saving instruction cache space (remembering that the compiler may just be looking at the size of that cache to determine how far it can unroll loops and so forth), or do you favor heterogeneous loads, which can help when one task's footprint is conveniently sized to "fill in the gaps", or perhaps you can offset a kernel that bottlenecks on some per-CU resource, like local memory bandwidth, or perhaps texture cache.

    Dynamic profiling, anyone?:twisted:
     
    ieldra likes this.
  18. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    This is definitely true. You can have as many thread groups executing simultaneously on a single CU as long as you have resources available. Resources in this case being waves (40 -> 2560 threads), registers (256KB) and LDS (64KB). Scalar register count can also be a limitation, but in practice never is. If a group's thread count is not dividable by wave size (64), it is rounded up. This would give a maximum count of 40 thread groups per wave (assuming single wave groups).

    DirectX programmers can't count on a fixed wave/warp size, since Nvidia = 32, AMD = 64 and Intel = {8,16,32}. A good compiler should remove synchthreads calls in kernels with thread group size less than the GPUs native wave/warp size.
    Heterogeneous loads are a big performance win in many cases on GCN. There are many real world examples available (look at recent GDC/SIGGRAPH presentations).

    Common examples are:
    - One kernel is sampler bound and the other isn't. Example: parallax mapping (N trilinear/aniso taps for root finding), anisotropic filtering in general (multiple textures), bicubic kernels (3x3 bilinear taps per pixel), blur filters (N bilinear taps), etc. A pure compute task (no tex filter instructions) can fill the execution gaps nicely.
    - One kernel uses lots of LDS (thread group shared memory) and this limits occupation. Other kernel with no LDS usage increases occupancy.
    - One kernel is heavily memory (and L1 cache) bound, while the other is mostly math crunching (ALU instructions using registers and/or LDS).
    - Tail of another kernel. Work finishes at wave granularity (after last barrier). Waiting for the last wave to finish means that processing cycles are lost on each CU (every time kernel changes).
    - Resource bottlenecks (waves/registers/LDS). Kernel allocates some CU resources heavily, while other resources are left unused. It is better to schedule thread groups from multiple kernels (different shader) to utilize the CU resources better.
    - Uneven split of resources. There are cases where a single kernel doesn't evenly divide all CU resources (waves/registers/LDS) between thread groups (see my post above). Reminder is left unused. Better to schedule multiple kernels (with different resource counts) to use all the resources.

    The most common heterogeneous task is executing vertex and pixel shader on the same CU. Vertex + hull + domain shader is another example. GCN is not limited to running multiple compute shaders concurrently on the same CU. You can have a mix of different graphics kernels and compute kernels.
     
    #18 sebbbi, Sep 1, 2016
    Last edited: Sep 1, 2016
  19. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    Vertex + pixel is what I always consider when trying to wrap my head around how the scheduling would work. Under what conditions would the SIMDs likely be partitioned between different kernels executing in parallel? If a kernel is hitting some bottleneck that hinders utilization (be it register count, lds, L1, FFH dependency) what determines whether it will execute another concurrently or in parallel? Is control over this exposed in any way?
    Sorry if I sound like a broken record, just really curious about this parallel mixed wavefronts idea
     
  20. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    ACEs won't really perform a context switch to my understanding. They simply stall and a different ACE takes priority. That's what the HWS would handle to my understanding. They should be able to switch, but more streams than ACEs would be required to consider it.

    So here's a fun question. On GCN with a HWS does the programmer explicitly have to handle complementary pairings? I'm assuming Pascal could do something similar in software, but relies on accurately predicting the results?

    SIMDs aren't partitioned. They simply have a number of waves assigned which are being processed round robin selecting the next "ready" wave. If one stalls it's simply skipped, if they all stall you have some downtime. I'm not 100% sure if they account for prioritization.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...