A Blast from the past: speculation on the NV3x architecture

KimB

Legend
We all know that the NV30 is described as a 4x2 architecture, and the NV31 is often described as a 2x2 architecture, with something odd going on in the NV34.

I'd like to suggest something different. What if instead of making the NV31 and NV34 chips have fewer pipelines, they just removed math units so that these architectures take longer to do a single SIMD instruction?

This would explain a number of irregularities that having a reduced number of pipelines do not explain:

1. Sometimes the chips act like they have more pipelines than they seem to have from first analysis. This just says that they couldn't reduce the size of all parts of the functional units, so for some functions, the NV31 and NV34 must have the same amount of power as the NV30 (most likely operations not directly linked to math, such as that in shaders, filtering, or blending).

2. The lower-cost chips still support the DDX/DDY instructions, which require the state information for four pixels at once.

3. nVidia opted to not support a vec3 + scalar architecture. Choosing to achieve scalability by lengthening the number of clock cycles to complete a single SIMD instruction may make such a thing much harder to implement.
 
I have considered this before (on NV34). However, it didn't turn out to be very well. Of course, both are still possible. It is possible that NV34 has four FP32 units but FP32 units have two cycles throughput on all instructions, or NV34 has only two FP32 units with one cycle throughput.

However, I like the later theory better because of some experiment results (on a 250Mhz NV34):

1. The simplest shader gives about 500Mpix/s (495 Mpix/s), suggesting two pixels output. Or, if there are four pixel pipelines, they have two cycles throughput.

Code:
texld r0, t0, s0
mov oC0, r0

2. Adding a simple instruction to above shader gives about 250Mpix/s (235 Mpix/s).

Code:
texld r0, t0, s0
add r0, r0, c0
mov oC0, r0

3. Adding another texld to above shader gives almost the same result (220 Mpix/s).

Code:
texld r0, t0, s0
texld r1, t1, s1
add r0, r0, r1
mov oC0, r0

4. Dependent texture access is slower, 157 Mpix/s, suggesting that texture units are not serialized in a pixel pipeline.

Code:
texld r0, t0, s0
texld r1, r0, s1
add r0, r0, r1
mov oC0, r0

From these results, I have to say that a 2x2 configuration when using pixel shader 2.0 is the most probable explanation.
 
pcchen said:
4. Dependent texture access is slower, 157 Mpix/s, suggesting that texture units are not serialized in a pixel pipeline.

Code:
texld r0, t0, s0
texld r1, r0, s1
add r0, r0, r1
mov oC0, r0

From these results, I have to say that a 2x2 configuration when using pixel shader 2.0 is the most probable explanation.
It won't be able to do two texture reads anyway in a single cycle if they're not both independent. So that's part of your performance hit.

Anyway, making the pipelines take longer to execute a single SIMD instruction won't necessarily improve dependent texture performance, not if that information is needed the very next instruction.
 
Chalnoth said:
It won't be able to do two texture reads anyway in a single cycle if they're not both independent. So that's part of your performance hit.

No, it can. Compare my second and third experiment results.

Anyway, making the pipelines take longer to execute a single SIMD instruction won't necessarily improve dependent texture performance, not if that information is needed the very next instruction.

It can be pipelined. There is no apparent advantage if you try to reorder arithmetic instructions.

I forgot another experiment:

Code:
texld r0, t0, s0
texld r1, t1, s1
add r0, r0, c0
add r0, r0, r1

gives 155Mpix/s. This suggests that NV34 do not have any "mini-FP" units (just like NV30).
 
pcchen said:
Chalnoth said:
It won't be able to do two texture reads anyway in a single cycle if they're not both independent. So that's part of your performance hit.
No, it can. Compare my second and third experiment results.
Those didn't use dependent textures.

Here's my examination of your results:
Code:
texld r0, t0, s0
mov oC0, r0
This executing at about 500Mpix/sec sets the fillrate of the chip. This is effectively just one instruction (the move apparently doesn't take any time).

Code:
texld r0, t0, s0
add r0, r0, c0
mov oC0, r0
As you stated, this had a fillrate of about 250Mpix/sec. This makes sense for a DX9 shader, as the texld and the add would have to be executed in subsequent cycles, making for a 2-cycle shader.

Code:
texld r0, t0, s0
texld r1, t1, s1
add r0, r0, r1
mov oC0, r0
This one didn't add anything the hardware couldn't handle, and so executed in the same two cycles (the tex/FP unit can handle two textures). There was some added performance hit that was likely only due to the added memory bandwidth.

Code:
texld r0, t0, s0
texld r1, r0, s1
add r0, r0, r1
mov oC0, r0
Now the texture reads were no longer independent, and thus had to be executed in separate cycles. That makes for three cycles for this shader. This makes perfect sense if the maximum throughput is 500Mpix/sec, as the execution of this shader was pretty close to 166Mpix/sec.

(please note that my definition of a cycle is very loose here, and doesn't necessarily take on the definition one cycle=one clock cycle)

This all points to an architecture that is functionally identical to the NV30, but doesn't have as much processing power per clock. I don't think your tests state anything against my own assertion.
 
Just this morning I saw unofficial announcement that relatively soon, much of NV3x internal architecture will be revealed. Also was claimed that David Kirk gave OK.
 
Chalnoth said:
Now the texture reads were no longer independent, and thus had to be executed in separate cycles. That makes for three cycles for this shader. This makes perfect sense if the maximum throughput is 500Mpix/sec, as the execution of this shader was pretty close to 166Mpix/sec.

It is definitely possible to design a two TMU pipeline which can process dependent texture access with one cycle throughput (they can be pipelined). GF3/GF4 Ti all can do this with pixel shader 1.1.
 
Here's a snipet from The Register:-

"The 130nm 5950 - also known by its codename, 'NV38' - and is being pitched as the leading 5900-series chip. As we reported earlier today, its 256-bit, eight-pipeline core operates with 950MHz DDR graphics memory across a 256-bit bus yielding a bandwidth of 30.4GBps

The 5700 family - aka NV36 - replace the 5600 line of mainstream chips. Like the 5950, the 5700 Ultra is based on the CineFX 2 engine, but provides only four pixel pipelines rather than eight. It also supports a 128-bit memory bus, which yields a bandwidth of 14.4GBps when the DDR SDRAM operates at 900MHz"

http://www.theregister.co.uk/content/3/33557.html

I must say after reading through a years worth of discussions on this I thought it was pretty well set in stone that the NV35 was a 4x2 architecture.
 
pcchen said:
It is definitely possible to design a two TMU pipeline which can process dependent texture access with one cycle throughput (they can be pipelined). GF3/GF4 Ti all can do this with pixel shader 1.1.
But the NV3x apparently does not.
 
THe_KELRaTH said:
Here's a snipet from The Register:-
Which states nothing we haven't been told before.

We all know that the NV3x is a complex architecture.

We also know that its pixel pipelines do not conform to the traditional definition of a pixel pipeline.

Particularly due to the remaining support of the partial derivative instructions, which require four pixels to be rendered simultaneously, I still think that each NV3x chip, when running pixel shaders, processes four pixels simultaneously. It's just that different chips have differing amounts of processing power per clock per pixel pipeline.
 
NV30 or NV35 probably be able to do that in pixel shader 1.1. However, I don't have NV30 or NV35 to do that test.
 
Chalnoth said:
Particularly due to the remaining support of the partial derivative instructions, which require four pixels to be rendered simultaneously, I still think that each NV3x chip, when running pixel shaders, processes four pixels simultaneously. It's just that different chips have differing amounts of processing power per clock per pixel pipeline.

You can do it with an internal 2x2 pixel buffer, even if you have only one pixel pipeline.
 
pcchen said:
You can do it with an internal 2x2 pixel buffer, even if you have only one pixel pipeline.
Well, that does depend on what sorts of things you can use as an argument for the instruction. Can you only use an input register as an argument, or also a value calculated in the shader?
 
If you have only one pixel pipeline (assuming it's pipelined to at least four stages), you can use the first stage to compute first pixel, then second pixel, third pixel, and the fourth pixel. Thus, at the end of each 2x2 pixels you have enough data just like with four pixel pipelines. If you have two pixel pipelines it's even easier to do that.
 
Back
Top