Unified Shaders: With traditional pipes?

Dave Baumann said:
Well, look, Eric certainly describes R420 as MIMD across the pixel processors.

Across the pixel processors it is MIMD if we define DATA as "A vector of pixel". But inside a single pixel processor it is SIMD as it execute the same instructions for at least 4 pixel. This makes it SIMD for the whole system as it is impossible to execute a different instruction for each pixel. Anyway as long as it is faster as a MIMD CPU we still win more than we give.
 
We're not talking about R420 (but I see what you mean..), I was adressing NV40/G70.
That's what nvidia is saying about NV40 (from GPUGEMS2):
Furthermore, the fragment processor works on groups of hundreds of
pixels at a time in single-instruction, multiple-data (SIMD) fashion (with each fragment
processor engine working on one fragment concurrently), hiding the latency of texture
fetch from the computational performance of the fragment processor.
 
So are there any real-world examples of per-vertex dynamic branching vertex shaders in games?

Jawed
 
Jawed said:
So are there any real-world examples of per-vertex dynamic branching vertex shaders in games?

Jawed

Yes, if you want and give me a little time I can show you such shaders from current games.
 
:cool: I'm just curious what will happen to the performance of such vertex shaders under a unified architecture where the shaders have to execute "batches" of vertices, rather than being MIMD at the vertex level.

Jawed
 
Demirug said:
Across the pixel processors it is MIMD if we define DATA as "A vector of pixel". But inside a single pixel processor it is SIMD as it execute the same instructions for at least 4 pixel.
Demirug, my initial point was that NV40 (although there is some query of G70) appears to work on a single batch across all the quads, but R300+ doesn't, they work on multiple data across multiple quads, hence across the quads NV40 would appear to be SIMD while across the quads R300+ are MIMD. As far as the processing units are concerned they are all designed as quads at the moment - that is the lowest unit of processing; you ask an engineer what they think of a "pipeline" is, and they'll start talking about the quads, not individual fragment processors.

nAo said:
We're not talking about R420 (but I see what you mean..), I was adressing NV40/G70.
That's what nvidia is saying about NV40 (from GPUGEMS2):
nAo, I know, I was talking about G70, if it does indeed operate different batches simultaneously (which NV40 appears not to do).
 
Dave Baumann said:
nAo, I know, I was talking about G70, if it does indeed operate different batches simultaneously (which NV40 appears not to do).
Dave, let's say that:
G70 is 1024 times more SIMDier than MIMDier
Xenos is 64 times more SIMDier than MIMDier
R520 is 16 times more SIMDier than MIMDier

Well..I'm kidding ;) (sorry, I don't even know if this make any sense in english..)
 
nAo said:
Dave, let's say that:
G70 is 1024 times more SIMDier than MIMDier
Xenos is 64 times more SIMDier than MIMDier
R520 is 16 times more SIMDier than MIMDier

Well..I'm kidding ;) (sorry, I don't even know if this make any sense in english..)
I do see what you are saying, but its not really the right way around of looking at it, because those "16" on R520 are not operating across the quads in a single cycle, but there are 4 completely separate ones running across all the quads in 4 cycles (similarly with Xenos as well, 3 separate ones operating across each of the 3 SIMD arrays over4 cycles).

As a side note, "Xenos is 64 times more SIMDier than MIMDier" and "R520 is 16 times more SIMDier than MIMDier" is going to be the interesting point for R600.
 
On a side-note, the X1 series seems to retain the screen-tiling that was introduced with R300, so R520's four quads are also MIMD at the quad level - i.e. four distinct batches are able to run concurrently.

I'm curious whether this organisation can be retained in a Xenos-like unified architecture. This question seems to be centred on the organisation of texturing. Xenos appears to use a single-threaded texturing engine (i.e. all 16 pipes are running the same batch).

But there's no reason to infer that a PC version of Xenos would have a single-threaded texture engine. If it was multi-threaded, like R520's (or R580's), then the base-unit for scheduling would become 4 objects (vertices or fragments). This would ameliorate the dynamic branching overhead of going to SIMD vertex shading, with the reduced size of vertex batches (compared to Xenos's vertex batches of 16 x 4 phases - i.e. 64).

The final question, for me, lies in the apparent ability of Xenos to perform proactive texture scheduling, so that the texture engine is never idle. I'm still unclear whether R5xx can do this - I suspect not. And I wonder if that inability, if true, will be retained in a PC unified architecture - i.e. R600.

Jawed
 
Dave Baumann said:
I do see what you are saying, but its not really the right way around of looking at it, because those "16" on R520 are not operating across the quads in a single cycle, but there are 4 completely separate ones running across all the quads in 4 cycles (similarly with Xenos as well, 3 separate ones operating across each of the 3 SIMD arrays over4 cycles).
We can say the same about G70 too, since the number of pixels in a batch is much higher than the number of fragment shader pipelines.
Anyway it's not a big deal how you call it, imho at this time it makes more sense to call R3x/4x/5x/NV40/G70 SIMD processors (regarding their fragment shading abilities), if this MIMD thing was down at the quad level then I could agree with your nomenclature..

As a side note, "Xenos is 64 times more SIMDier than MIMDier" and "R520 is 16 times more SIMDier than MIMDier" is going to be the interesting point for R600.
One thing is sure Dave, you really love to tease us :)

ciao,
Marco
 
Last edited:
3dcgi said:
It's actually the other way around. Vertex processing happens before setup and rasterization.

Possible reasons to keep vertex and pixel shading separate might be MIMD vs. SIMD and Vec4 + scalar vs. Vec3 + scalar. In other words Nvidia currently has MIMD vertex shaders and SIMD pixel shaders.

Look to 3dlabs for another possible reason to avoid hardware unification. IIRC, their vertex shaders are 36 bit and the pixel shaders are 32 bit.

I'm sure there are many more potential reasons.

Thanks :)
 
Dave Baumann said:
Well, look, Eric certainly describes R420 as MIMD across the pixel processors.
It's wierd that sireric posted that a year ago today. When I first looked at it I thought it was posted today.
 
Back
Top