GCN, GCN2.3.., Vega, and Navi Instruction Cache limitations

Discussion in 'Architecture and Products' started by Infinisearch, Sep 28, 2020.

  1. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    776
    Likes Received:
    145
    Location:
    USA
    I was reading the document: GCN_Architecture_whitepaper.pdf and was suprised when I read the following:

    A cluster of up to 4 Compute Units share a single 32KB L1 instruction cache that is 4-way associative and backed by the L2 cache. Cache lines are 64B long and typically hold 8 instructions.........

    The shared L1 instruction cache has 4 banks, and can sustain 32B instruction fetch per cycle to all 4 Compute Units. Instruction fetching is arbitrated between SIMDs within a CU based on age, scheduling priority and utilization of the wavefront instruction buffers.

    So by definition 4 SIMD's per CU, 4 CU per instruction cache, and up to 10 wavefronts per SIMD. ===
    up to 4 linear instructions per CU per clock cycle per 4 CU's.

    Doesn't that seem like a gigantic bottleneck for a lot of workloads? So if my memory and math is right 8 wavefronts per simd would be 32 registers per thread, 4 wf per simd 64 reg/thread. So Lets say 4CU = 16 SIMD's 8x8 + 4x8 = 96 wavefronts for a 'good' mixed workload.

    Now this is where I really should use some pen and paper so correct me if i'm wrong...

    384 cycles per 96 wf's. 16 Simd's 16 ins per cycle, 4 cycles per wavefront. So at this point it seems reasonable.
    but the 32kb instruction cache is backed by the standard L2 cache.
    32kb /64(insteadof96tomuchmath) ...= 64 instructions per wf per 4 CU's. and there's no instruction prefetch.

    So backed by the standard L2 cache. + 64 instructions per wf per 4 CU's at a good occupancy. and there's no instruction prefetch. === Gigantic bottleneck? Why not a seperate instruction L2.

    Does vega or GCN 2 or 3 do any better?? what about . Navi or Navi 2??

    Sorry for the long post, thanks for any advice or input in advance.

    edit - forgot to emphasize and put a question mark for the following:
    ***Why not a seperate instruction L2?***
     
    #1 Infinisearch, Sep 28, 2020
    Last edited: Sep 29, 2020
  2. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    340
    Likes Received:
    278
    GPU usually runs just a couple of shader program at a time for a very large dispatch size. Wavefronts are scheduled round robin with fair progress (oldest first), at least in GCN 1.0. So wavefronts in the same and neighbouring CUs make progress in the shader program at a similar pace, and in turn are very likely to hit on the same I$ lines, or new lines brought in by others.

    The only exception is programs with lots of large conditional uniform branches (instead of vector lane predication), which can potentially thrash the I$. But conditional branches on GPUs are in the “use with care” realm to begin with, and most are lowered to lane predication.

    RDNA improved anyway, with now 2 CUs sharing one I$.
     
    #2 pTmdfx, Sep 28, 2020
    Last edited: Sep 28, 2020
    iamw likes this.
  3. rSkip

    Newcomer

    Joined:
    Jan 10, 2012
    Messages:
    15
    Likes Received:
    29
    Location:
    Shanghai
    In your example, there are only two instruction streams, one with 32 registers/thread and the other with 64 register/thread. So L1 instructions has [32KB / 2(streams) / 8B(per instruction) = 2048 instructions] per instruction stream.
     
    iamw likes this.
  4. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    776
    Likes Received:
    145
    Location:
    USA
    in my example there are 96 wavefronts, therefore 96 instruction streams. Thats not including branch divergence because I'm not certain on how its handled.

    edit - therefore up to 96 different unique program counters.

    edit - now that I think about it branch divergence should make no difference.
     
    #4 Infinisearch, Sep 28, 2020
    Last edited: Sep 29, 2020
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...