What type of vliw pixel processing format does NV30 use?

The R300 has a very powerful vliw pixel processing implementation, and is able to execute a 4-component vector op, a complex scalar-op and a texture addressing op, in parallel. I wonder if the NV30 will be capable of such feats, only with twice the output when using the f16 format.

What processing format do you guys think the NV30 is using? It seems it could issue 2 integer instructions for every floating point (digit-life article), so it may be able to issue 2 floating point ops when not issuing integer ops. The question is whether these are vector, scalar, or any arbitrary vector or scalar.
 
What do you mean by "complex scalar op"?


(speculation mode on)
The NV30 can probably execute two 64-bit ops per cycle, 1 128-bit op per cycle, and can probably simulataneously fetch a texture. Logically, the texture unit is separate from the PLU, just as it was in DX8's shader pipeline, so I bet most DX9 designs will be able to overlap them and have a separate port for updating s# registers.

It's possible that the NV30 doesn't even have a separate scalar unit, but just reuses the FP registers.

Does the R300 really have a 4-component vector and a separate scalar unit, or can it merely issue a 3-component op and a scalar in parallel.

For example, can a dp4 and scalar add operate in parallel?
 
Luminescent said:
The R300 has a very powerful vliw pixel processing implementation, and is able to execute a 4-component vector op, a complex scalar-op and a texture addressing op, in parallel. I wonder if the NV30 will be capable of such feats, only with twice the output when using the f16 format.

What processing format do you guys think the NV30 is using? It seems it could issue 2 integer instructions for every floating point (digit-life article), so it may be able to issue 2 floating point ops when not issuing integer ops. The question is whether these are vector, scalar, or any arbitrary vector or scalar.

So You think that the R300 has more "power" clock for clock, or?

Something like :

R300 = 3 96bit ops x 325 = 975 Mops
NV30 = 2 64bit ops x 500 = 1000 Mops


At least this would explain the high clockspeed of the NV30.
 
DemoCoder said:
The NV30 can probably execute two 64-bit ops per cycle, 1 128-bit op per cycle, and can probably simulataneously fetch a texture.

Ram stated this:

ram said:
One press guy asked about this at the launch and the answer was that a 32bit integer op only would need half of a cycle (so two 32bit ops per clock per pipeline), while FP16 needs 1.

Which indicates to me that it takes two cycles to execute a 128-bit op.
 
It doesn't neccessarily follow. It could be that a FP16 and FP32 execute at the same speed, but that would go against the idea of half precision running 2x the speed. I have trouble buying that the NV30 has more transistors than the R300, less transistors dedicated to bus-width and AA modes, and now has half the FP units of the R300 as well? So what's all the extra transistors for?

The only way this issue is going to be settled for certain is by running shader benchmarks on the card to time pixel shaders.


Anyway, Dave you were at the launch, so can you confirm that NVidia publically said this? They certainly didn't say this at the COMDEX launch. There's also the possibility that the press guy didn't know what he was talking about. I talked to some PR people at COMDEX who didn't seem to have a clue, so it doesn't bode well.
 
They didn't talk about shader speed at all - I only got what I specifically asked about in the interview, but there's no mention of actual numbers of cycles.
 
Didn't we just have a discussion where we eventually agreed that the nv30 can do 2 4-component MACs per pipe per cycle ?

Inferred from the 51GFLOPS figure that were stated in some marketing material:

400MHz*8 pipes*2 4D macs/pipe*8flop/4D mac = 51GFLOPS.

So the nv30 clearly has more computational pixel shading power than the R300.

Cheers
Gubbi
 
Ummm, two points there: First off it 500MHz, not 400 and second, how does that cater for the vertex shader?
 
At the time there were talk of a 400MHz part, I believe. Also it was specifically stated that the 51GFLOPS were in the pixelshaders.

Also, as Democoder asks, where else have they used all the transistors if not in the shaders ?

You can only use so much for caches, FIFOS and bandwidth saving gizmos. Nv30 also tests fewer AA subsamples per pipe (4) than the r300 (6).

Cheers
Gubbi
 
Gubbi said:
Also, as Democoder asks, where else have they used all the transistors if not in the shaders ?

What about the legacy parts (Register combiners etc.) the NV30 allegedly has opposed to R300? Could they be (at least partly) responsible?
 
I would expect the new shaders to be a functional superset of register combiners. So the driver will simply turn register combiner extension calls into a corresponding shader program that is then executed.

Cheers
Gubbi
 
Gubbi said:
I would expect the new shaders to be a functional superset of register combiners. So the driver will simply turn register combiner extension calls into a corresponding shader program that is then executed.

Cheers
Gubbi

NVidia explicitaly has said that the old register combiners are in hardware and can still be used AFTER the pixel shader stage. So, yes, there is a pixel shader and a register combiner per pipe that can be switched on/off as the programmer wants.
 
From what I was told there are probably some very interesting things about pixel shader performance on GeForceFX:
-fixed point is faster then fp16 is faster then fp32
-less temp registers you use the faster will go
And I guess NVidia hasn't told us everything yet...
 
RoOoBo said:
Gubbi said:
I would expect the new shaders to be a functional superset of register combiners. So the driver will simply turn register combiner extension calls into a corresponding shader program that is then executed.

Cheers
Gubbi

NVidia explicitaly has said that the old register combiners are in hardware and can still be used AFTER the pixel shader stage. So, yes, there is a pixel shader and a register combiner per pipe that can be switched on/off as the programmer wants.

That's what I meant. And if I'm not mistaken, it was said that the NV30 has dedicated hardware for integer shaders whereas the R300 does everything in FP. What I'm wondering about is how much transistors these parts eat up.
 
From what I have read, it seems to correspond that the NV30 can execute at least 1 128-bit op per cycle, or 2-64-bit ops, or it can issue integers directly with legacy integer support in the form of register combiners (2 sets of 8 combiners). I don't know how many integer ops 1 combiner can perform a cycle, but I'm guessing around two. So, this would mean the NV30 could execute 4 ints per cycle (2 sets of eight combiners, assuming 2 int ops/cycle per combiner), which would accord with Nvidia's claim of ints running twice as fast as 64-bit floating point and four times as fast as 128-bit.
 
Only I doubt we'll ever seen ints running four times as fast in PS-intensive games (read: future games). I don't even think register combiners are used often in today's games.

But, since plain old 32-bit color will still be used for a number of things for some time, the NV30 could be very impressive in performance for a while to come.
 
On one block diagram a while back I could have swore I saw that the NV30 register combiners can combine more than one pipeline fragment together so that you could use them for a 2D post-processing without using multipass.

Maybe I'm misremebering, but it seems like the NV30 register combiners could exist for more than just backwards compatibility. It would be nice if they could be used as a programmable convolution filter, although the edge cases might cause some issues.
 
A few thoughts:

-If the register combiners are separate from the pixel shader (as RoOoBo said), could NVidia include those in their figure? Or would a register combiner only operate on integers, and thus not qualify as a floating point op?

-Did someone say that the integer format on NV30 is only 12-bit? That's a minor disappointment, as even ATI's paper said that their car demo needs 16-bit normals and 11-12 bits weren't enough. Seeing how neither ATI nor NVidia support filtering of floating point textures, high precision integers can be important, such as Z-related calcs in the PS.

-DaveB: Vertex shaders don't have the processing power of pixel shaders, so they don't really advertise that figure. NV30 has the equivalent of 3 or 4 vertex shader units (they're a bit vague on this), each capable of one vector op per cycle. Each of the 8 pixel pipelines can do two vector ops per cycle (again, AFAIK). So each cycle, you get 16 fp ops from the VS, but 64 fp ops form the PS. My numbers may be a bit off, but you get the point. That's why that "ray-tracing in realtime" demo was done with R300's pixel shader.

Hmm. Looking back at the information we have, it seems like there are a lot of holes. As DemoCoder said, it looks like we'll just have to look at specific shader benchmarks when the part arrives.
 
Back
Top