Vertex Shaders: MIMD versus SIMD

Jawed

Legend
In R420 and NV40/G70, the vertex shaders are arranged as 6-way or 8-way MIMD.

In Xenos the vertex shaders are (effectively) arranged as 16-way SIMD, with a possibility of making a 3-way super-MIMD grouping, i.e. each of the three banks of ALUs could be executing different instructions.

Keeping it simple for a second, what's the effect of an SIMD architecture in Xenos versus the MIMD architecture of more conventional GPUs in vertex shading?

Why do conventional GPUs use an MIMD architecture as opposed to SIMD?

Will Xenos's SIMD architecture be a disadvantage for vertex shading?

Jawed
 
Jawed said:
IKeeping it simple for a second, what's the effect of an SIMD architecture in Xenos versus the MIMD architecture of more conventional GPUs in vertex shading?
Potentially worse than current GPUs dynamic branching performance (on vertices)
Why do conventional GPUs use an MIMD architecture as opposed to SIMD?
Cause there isn't any coherency to exploit between vertices, imho.

Will Xenos's SIMD architecture be a disadvantage for vertex shading?
Maybe..but it's not a big deal.
 
R420 has MIMD vertex shaders? What good would that be if there's no dynamic branching?

MIMD basically means that each unit has its own instruction pointer and decode units. So they need more (internal) instruction fetch bandwidth. But it also means each unit can branch independently and they don't have to output vertices at the same time. And I think scaling the number of vertex units is easier this way.

With SIMD, all units work in lockstep on the same instruction. Decode logic is required only once and you have to fetch only one instruction per cycle. Outputting vertices is probably a bit more difficult since all units finish at once. Branching is more complex. If only one of the vertices you're working on takes the other path, you have to calculate both (or do some kind of re-grouping). Just like NV40 pixel shaders.
 
Jawed said:
In R420 and NV40/G70, the vertex shaders are arranged as 6-way or 8-way MIMD.
MIMD? Really? Given the emphasis on things like the constant-based boolean branches and constant-based integer loops, I'd have thought they'd be SIMD.

Do you have any references?
 
Xmas said:
With SIMD, all units work in lockstep on the same instruction. Decode logic is required only once and you have to fetch only one instruction per cycle. Outputting vertices is probably a bit more difficult since all units finish at once.
Double buffering the output would solve that, but surely it'd be no worse a problem than full MIMD (with branching) where results could arrive out of order. Urk.
 
NV40/G70 are easy:

http://www.beyond3d.com/previews/nvidia/nv40/index.php?p=8

R420 is more difficult. I'd assumed it's MIMD from the per pipe flow-control:

http://www.beyond3d.com/reviews/ati/r420_x800/index.php?p=6

But I can't find anything that categorically states that it's either SIMD or MIMD.

But, anyway, with dynamic branching, we get back to the "batch-size" question that plagues dynamic branching in NV40 fragment shading - where the batches are so large that dynamic branching is a bust.

Is there any data on the efficacy of dynamic branching in NV40/G70's vertex shaders?

Does anyone know how vertices are split across the vertex shaders? Vertex shader code that I've seen was operating upon multiple vertices in one piece of code, one vertex after the other. So I don't understand how batching of vertices works...

But if dynamic branching can work at the vertex or triangle level, then that's presumably more efficient than if it worked at a much more granular level.

Really, what I'm curious about is whether Xenos's SIMD architecture (16-way) is an impediment to vertex shading, in itself, particularly with complex shaders and dynamic branching.

Jawed
 
Simon F said:
Double buffering the output would solve that, but surely it'd be no worse a problem than full MIMD (with branching) where results could arrive out of order. Urk.
Maybe that's the reason for G70's large post transform cache?
http://www.ixbt.com/video2/images/g70/post-tnl.png
http://www.digit-life.com/articles2/video/g70-part2.html

Jawed said:
But, anyway, with dynamic branching, we get back to the "batch-size" question that plagues dynamic branching in NV40 fragment shading - where the batches are so large that dynamic branching is a bust.

Is there any data on the efficacy of dynamic branching in NV40/G70's vertex shaders?
NVidia states "no-penalty branching". Which should be possible if the whole shader program is stored on-chip. There are no batches - each vertex is processed independently. But each of the units work on multiple threads (vertices) at once to hide latency. When one thread finishes, the next vertex is fetched from the pre-transform cache.
 
Simon F said:
Xmas said:
With SIMD, all units work in lockstep on the same instruction. Decode logic is required only once and you have to fetch only one instruction per cycle. Outputting vertices is probably a bit more difficult since all units finish at once.
Double buffering the output would solve that, but surely it'd be no worse a problem than full MIMD (with branching) where results could arrive out of order. Urk.

The article about the GeForce 6800 in the March-April issue of Micro explicitely says that the vertex shaders are MIMD. It also says that 'data dependant branches are free of the penalty normally accompanying single-instruction, multiple data implementations'.
 
Jawed said:
Really, what I'm curious about is whether Xenos's SIMD architecture (16-way) is an impediment to vertex shading, in itself, particularly with complex shaders and dynamic branching.
Good question. I guess it depends on how "lock step" each thread has to be. Xenos switching threads per instruction should allow the actual branch instruction cost to be mitigated somewhat, but you still end up with cothreaded verticies in different places in the shader.

How ATi handles that is a very good question.
 
I think a SIMD implementation can easily deal with the "all vertices completing at once" problem by introducing a 1 clock instruction propagation delay - i.e.

Code:
I-cache ----
            |
            ./             
         ALU 1
            |
            ./
         ALU 2
            |
            ./
            
            .
            .
            .
         ALU n

clocks
(time): 1    2    3  ...    n
ALU 1  i_1  i_2  i_3  ...  i_n
ALU 2       i_1  i_2  ...  i_n-1
...    
ALU n                      i_1

Dunno if that's a good idea or not, but it should alleviate the problem so long as n isn't huge (and the vertex shader isn't incredibly short).

[/code]
 
Back
Top