Infinisearch
Veteran
I was reading the document: GCN_Architecture_whitepaper.pdf and was suprised when I read the following:
A cluster of up to 4 Compute Units share a single 32KB L1 instruction cache that is 4-way associative and backed by the L2 cache. Cache lines are 64B long and typically hold 8 instructions.........
The shared L1 instruction cache has 4 banks, and can sustain 32B instruction fetch per cycle to all 4 Compute Units. Instruction fetching is arbitrated between SIMDs within a CU based on age, scheduling priority and utilization of the wavefront instruction buffers.
So by definition 4 SIMD's per CU, 4 CU per instruction cache, and up to 10 wavefronts per SIMD. ===
up to 4 linear instructions per CU per clock cycle per 4 CU's.
Doesn't that seem like a gigantic bottleneck for a lot of workloads? So if my memory and math is right 8 wavefronts per simd would be 32 registers per thread, 4 wf per simd 64 reg/thread. So Lets say 4CU = 16 SIMD's 8x8 + 4x8 = 96 wavefronts for a 'good' mixed workload.
Now this is where I really should use some pen and paper so correct me if i'm wrong...
384 cycles per 96 wf's. 16 Simd's 16 ins per cycle, 4 cycles per wavefront. So at this point it seems reasonable.
but the 32kb instruction cache is backed by the standard L2 cache.
32kb /64(insteadof96tomuchmath) ...= 64 instructions per wf per 4 CU's. and there's no instruction prefetch.
So backed by the standard L2 cache. + 64 instructions per wf per 4 CU's at a good occupancy. and there's no instruction prefetch. === Gigantic bottleneck? Why not a seperate instruction L2.
Does vega or GCN 2 or 3 do any better?? what about . Navi or Navi 2??
Sorry for the long post, thanks for any advice or input in advance.
edit - forgot to emphasize and put a question mark for the following:
***Why not a seperate instruction L2?***
A cluster of up to 4 Compute Units share a single 32KB L1 instruction cache that is 4-way associative and backed by the L2 cache. Cache lines are 64B long and typically hold 8 instructions.........
The shared L1 instruction cache has 4 banks, and can sustain 32B instruction fetch per cycle to all 4 Compute Units. Instruction fetching is arbitrated between SIMDs within a CU based on age, scheduling priority and utilization of the wavefront instruction buffers.
So by definition 4 SIMD's per CU, 4 CU per instruction cache, and up to 10 wavefronts per SIMD. ===
up to 4 linear instructions per CU per clock cycle per 4 CU's.
Doesn't that seem like a gigantic bottleneck for a lot of workloads? So if my memory and math is right 8 wavefronts per simd would be 32 registers per thread, 4 wf per simd 64 reg/thread. So Lets say 4CU = 16 SIMD's 8x8 + 4x8 = 96 wavefronts for a 'good' mixed workload.
Now this is where I really should use some pen and paper so correct me if i'm wrong...
384 cycles per 96 wf's. 16 Simd's 16 ins per cycle, 4 cycles per wavefront. So at this point it seems reasonable.
but the 32kb instruction cache is backed by the standard L2 cache.
32kb /64(insteadof96tomuchmath) ...= 64 instructions per wf per 4 CU's. and there's no instruction prefetch.
So backed by the standard L2 cache. + 64 instructions per wf per 4 CU's at a good occupancy. and there's no instruction prefetch. === Gigantic bottleneck? Why not a seperate instruction L2.
Does vega or GCN 2 or 3 do any better?? what about . Navi or Navi 2??
Sorry for the long post, thanks for any advice or input in advance.
edit - forgot to emphasize and put a question mark for the following:
***Why not a seperate instruction L2?***
Last edited: