... (2 cycles per wavefront, I believe) ...
The posted slides state ACEs can create one workgroup and dispatch one wavefront per cycle. They don't say anything about the graphics engines, maybe each engine can do the same, or maybe just the big CP can do this.
It's going to be very rare that NDRanges that consist of only a single wavefront are issued. Usually there'll be 10s of wavefronts all the way up to 100s of thousands. So even with only a small set of NDRanges live on the GPU at any one time, the GPU will have plenty of choice about what work to issue to CUs to maximise their utilisation.
Thanks for the explanation. I'm not sure it gets me any nearer to a explicit wavefront life-cycle "picture" for the whole graphics pipeline for one drawcall (be it 1 triangle or more).
There must be a way to balance the whole thing. A way to rank and list all local optimization minima for some pipeline configuration and stall probability.
Something like, 1 triangle with [1,1,1,3] tesselation will lead to:
- 3 vertex shader threads: 1 SIMD, 18.75% utilization, if it stalls bad luck, any non-multiple of 4 vertices means underutilization
- 1 hull shader const thread: 1 SIMD, 6.25% utilization, ...
- 1 hull shader cp threads: 1 SIMD, 6.25% utilization, ...
- 6 domain shader threads: 1 SIMD, 37.5% utilization, ...
- say 120 pixel shader threads: 1 SIMD, 750% utilization, ...
Problem is they get executed 4x (why?, what's the technical reason?), for a wavefront size of 64, so divide them all by 4. Let's try to tune this:
- 64 vertex shader threads: 1 SIMD, 400% utilization, if it stalls bad luck
- 62 hull shader const thread: 1 SIMD, 387.5% utilization, ...
- 186 hull shader cp threads: 1 SIMD, 2906.25% utilization, not divisible by 64
- 372 domain shader threads: 1 SIMD, 5881,25% utilization, arrg not divisible by 64
- say 120 pixel shader threads: 1 SIMD, 750% utilization, ...
Is it all on the the same SIMD? It seems, except I put the HS into mem-mode, then the DSs might be created somewhere else.
Are the pixel shader threads also all piped though the same SIMD?
What is the really real maximum utilization one could possibly construct?
It appears possible to me to design a slider for tesselation, which, when used, oscillates up and down in rendering speed, 4x being faster than 2x and such. I'm certain this oscillation appears in other configurations without tesselation as well. Subdivide your NURBs, oh, 2512 vertices is faster than 2468 (reason in the vertex shader), oh wait 2546 is even faster (reason in the rasterizer, no thin triangles).
I hate being unable to carefully tune it.
I was really happy with the old version of the CodeAnalyst which had the pipeline analysis, it helped me soo much getting everything I could out of x86 SIMD.
This assumes the algorithm cannot be refactored to potentially split into multiple phases, each requiring a lower number of max registers.
The phases could distribute normally within the confines of the existing implementation. It would have overhead, but it also would not disadvantage the breadth of programs out there that do not need to break GCN's resource and execution split.
I don't see a possibility to create a custom graphics pipeline, like 2x domain shader stages, or 6 pixel shader stages. I see it conceptually possible, but I don't know if the hardware could support it. In a compute shader I could fake a pipeline even in one monolothic shader by separating scopes {} and communicate data across scopes though LDS. I don't have an LDS in the pixel shader. I have only the possibility to use a UAV as scratchpad.
But on the other hand I've seen it works out better more often than not, to split compute shader into smaller separate pipeline-pieces, even if it meant to go through UAVs, I believe because it was easier to load-balance the hardware with more but tinier fragments of work (less points of stall as well). I don't think GPR pressure reduction was causing it, as the data was "cached" in the LDS instead of the UAV anyway.
Edit: correct HS/DS numbers