GCN, GCN2.3.., Vega, and Navi Instruction Cache limitations

Infinisearch · Sep 28, 2020

I was reading the document: GCN_Architecture_whitepaper.pdf and was suprised when I read the following:

A cluster of up to 4 Compute Units share a single 32KB L1 instruction cache that is 4-way associative and backed by the L2 cache. Cache lines are 64B long and typically hold 8 instructions.........

The shared L1 instruction cache has 4 banks, and can sustain 32B instruction fetch per cycle to all 4 Compute Units. Instruction fetching is arbitrated between SIMDs within a CU based on age, scheduling priority and utilization of the wavefront instruction buffers.

So by definition 4 SIMD's per CU, 4 CU per instruction cache, and up to 10 wavefronts per SIMD. ===
up to 4 linear instructions per CU per clock cycle per 4 CU's.

Doesn't that seem like a gigantic bottleneck for a lot of workloads? So if my memory and math is right 8 wavefronts per simd would be 32 registers per thread, 4 wf per simd 64 reg/thread. So Lets say 4CU = 16 SIMD's 8x8 + 4x8 = 96 wavefronts for a 'good' mixed workload.

Now this is where I really should use some pen and paper so correct me if i'm wrong...

384 cycles per 96 wf's. 16 Simd's 16 ins per cycle, 4 cycles per wavefront. So at this point it seems reasonable.
but the 32kb instruction cache is backed by the standard L2 cache.
32kb /64(insteadof96tomuchmath) ...= 64 instructions per wf per 4 CU's. and there's no instruction prefetch.

So backed by the standard L2 cache. + 64 instructions per wf per 4 CU's at a good occupancy. and there's no instruction prefetch. === Gigantic bottleneck? Why not a seperate instruction L2.

Does vega or GCN 2 or 3 do any better?? what about . Navi or Navi 2??

Sorry for the long post, thanks for any advice or input in advance.

edit - forgot to emphasize and put a question mark for the following:
***Why not a seperate instruction L2?***

pTmdfx · Sep 28, 2020

GPU usually runs just a couple of shader program at a time for a very large dispatch size. Wavefronts are scheduled round robin with fair progress (oldest first), at least in GCN 1.0. So wavefronts in the same and neighbouring CUs make progress in the shader program at a similar pace, and in turn are very likely to hit on the same I$ lines, or new lines brought in by others.

The only exception is programs with lots of large conditional uniform branches (instead of vector lane predication), which can potentially thrash the I$. But conditional branches on GPUs are in the “use with care” realm to begin with, and most are lowered to lane predication.

RDNA improved anyway, with now 2 CUs sharing one I$.

rSkip · Sep 28, 2020

In your example, there are only two instruction streams, one with 32 registers/thread and the other with 64 register/thread. So L1 instructions has [32KB / 2(streams) / 8B(per instruction) = 2048 instructions] per instruction stream.

Infinisearch · Sep 28, 2020

in my example there are 96 wavefronts, therefore 96 instruction streams. Thats not including branch divergence because I'm not certain on how its handled.

edit - therefore up to 96 different unique program counters.

edit - now that I think about it branch divergence should make no difference.

GCN, GCN2.3.., Vega, and Navi Instruction Cache limitations

Infinisearch

pTmdfx

rSkip

Infinisearch

Similar threads