RoOoBo said:
I wouldn't bother with the queue sizes in the table as I may be changing them with each new experiment.
Ah - I was leading up to trying to see if the green/red triangle problem was, essentially, incapable of arising in your architecture, due solely to queue sizes or whether it was more subtle than that (i.e. not treating the queues as strictly FIFO).
Until late July (the original paper was submitted in May-June or so) there wasn't a fragment distribution policy implemented for the shader units. Fragments were generated on an 8x8 tile basis and quads would be removed before shading by HZ and ZST. The quads would then be assigned to a free shader unit in a round robin basis. It wasn't very texture cache friendly ... Now, after July, there isn't a propper distribution mechanism implemented yet but the assignament is made on N+ (N being large, in the experiments I think it was set at 128) fragments per shader unit, and when one becomes full to the next shader with free resources. Very weird things happen with different configured Ns. A propper and configurable distribution mechanism is what I should be working on right now (likely to be tile based).
In a tile-based rasteriser, presumably you could multi-thread the rasteriser. I know in the other paper you've done a shader-implemented rasteriser - which in itself suggests that multi-threading is possible (if there's more than one shader pipeline to run the rasteriser program).
Could you also multi-thread the primitive assembly and triangle setup engines? I presume you could because those steps are being removed as fixed-function in DX10, and implemented as shader programs too, aren't they?
We don't have that concept of a batch yet and I'm unlikely to call it that way, too confusing with the other batches. May be shader work assignment group or unit or something ...
I dare say you're in a good position to set the standard here, since the IHVs seem so coy about this subject.
The only scheduling that is done outside the shader unit is to send vertex inputs to the shader before sending fragment inputs (vertex first scheduling). As the shader unit doesn't have a penalty for fetching instructions from either kind of input each cycle they just get mixed. And the number of vertex inputs is limited by the queues in the geometry pipeline.
One thing that puzzles me about the unified pipeline is whether running multiple vertices in a work unit (e.g. in Xenos it's 16 vertices) will cause problems with vertex batch granularity. Put simply, if you've got 18 vertices to shade with a specific program, before the next batch uses a slightly different program, then in a traditional MIMD pipeline GPU, each vertex progresses individually through a pipe, quite happily. There's no issue of granularity as there is no work unit, as such.
But in a unified architecture, you have two work units: 16 vertices and 2. The second work unit wastes 14 threads' worth of resources. It just seems to me that vertex shading prefers finer-grained parallelism than fragment shading. Is that fair?
Another 'to be done' is downgrading the quite idealized shader unit to work in SIMD way (so a whole batch must execute the same fetched instruction before starting with the next). But I don't think that fetching an instruction (or group of instructions) every cycle or a few cycles is that problematic. CPUs implement higher fetch bandwidth at higher frequencies.
Since current GPUs are repeatedly executing a single instruction, it seems that they can "cut out" instruction decode from the primary pipeline (e.g. make it a separate task that runs "every so often" in a dedicated decode unit, delivering the decoded instruction and register file indices "just in time"). But in terms of the main pipeline organisation, does this actually amount to anything useful?
Jawed