I'd just like to post a (originally short, now long) summary post of the current information, reliable or not, we have regarding the NV4x's "pipeline" technology. None of this information is guaranteed at all; and it's likely at least some of it, if not all of it, is wrong. IMO, though, it's the more logical and up-to-date info you'll find Most of it has already been posted at nV News in the NV40 thread.
To start simple, the NV41 will be marketed as a 6 pipelines design; as the NV4x technology assumes usage of the FP units for part of the texture addressing, each "pipeline" will have to be able to access to 2 texture lookup units; otherwise, half of this FP unit would be wasted.
Also, considering NVIDIA has the technology to "double" "pipelines" (2x2->4x1 for the NV31/NV34, for example) by what some people call "double-pumping", and NVIDIA's marketing practices, it seems logical it is only a "6 pipelines" design in that specific peak case; the logical conclusion of these conditions means the NV41 is a 3x2/6x1, just like the NV31 was a 2x2/4x1.
Another possible factor which might make the "double-pumping" mode possible in the NV4x is a "no texturing" case. That would mean that even if there are 100 arithmetic instructions, the NV41 could operate as a 6x0 if the texturing units are not used, at all, in the shading program. Switching halfway in the program is absolutely out of the question, however, IMO. This part, however, is mostly speculation on my part regarding how the NV40 is expected to operate.
Just like the NV3x, the NV4x wouldn't really have "physical pipelines" (you've insisted well enough on that, Ail, hehe); those 3 "pipelines" would thus just be one "pixel processor" as NVIDIA likes to call it (in the NV3x, and to a lesser extent in the NV4x, it seems abusive to call that a processor, but whatever).
The bypass paths can then be explained by some specific logic being used in order to operate this "pool of units", although not really a pool in practice I believe (NVIDIA's marketing loves to call it that way, though) in a specific way and order; for example, in the case of the 4x1 path of the NV31, this path operates like if there were two textures for one pixel, while in fact it then interprets this information as if it was one texture per pixel for two pixels. In the case of the NV30/NV35/NV38, their "bypass path" logic would be much more simple; it'd simply order not to send any information to any arithmetic or texturing unit (and probably not to some other stuff too, as it's not capable of 8 pixels/clock for not textured solid color triangles!)
Regarding register usage, I would tend to believe an architecture similar to the NV3x is being used, but it is likely that: a) more registers are available and b) certain operations are done in less cycles, so less registers are required. It is also possible some registers could be freed once they're never going to be used again, in order to send new pixels in the pipelines even though none are out of it yet (certain pixels would thus be reserving some registers than others); whether that's the case, I got no idea, as that part is just speculation.
If you've got no idea why there's any sort of register usage penalty in the NV3x, I suggest reading 3DCenter's "NV30 Inside" article.
I've got no idea how many VS units the NV41 has; I assume that number to be between 2 and 4, however, and 2 is the most likely one IMO. I'm also assuming the NV42 to be a 2x2/4x1 architecture with one VS unit, but that's just me, and I'd in fact be surprised if it was as simple as that, hehe! Also, it is expected each VS unit has its own dedicated texture lookup unit.
Getting back to the NV40. Basically speaking, it's a double NV41. That means its one pixel pipeline is operating on 6 pixels, or 12 under certain conditionals. Actually, that's not certain; it could be two pixel pipelines, each operating on 3 pixels or 3 operating on 2 pixels: but I personally find that significantly less logical.
Problem is, though, that the original and reliable rumors told us it's a 8x2/16x1; and nothing made sense at that point. And then, people realized there were 4 VS units, and that 12+4 = 16. Seems to make sense, don't you think?
To make myself clearer, here's a very schematic view of the NV2x and NV3x, with C = Cache, F = Fixed Function logic and P = Programmable logic.
INPUT->C->P,VS->C->F->C->P,PS->C->OUTPUT
As you see, there are two caches between the VS and PS, the two only programmable parts of the GPU's pipeline. One is before rasterization and triangle setup, one is after; the first, most important one, stores transformed vertices in a FIFO way (First-In-First-Out). In the case the PS programs are extremely complex, and the VS not, this cache will be full and all VS units will be idled. Dozens of millions of transistors will be wasted every single passing clock. The opposite is also possible, with the PS units being idled.
From my understanding, the NV40 (but not the NV41/NV42) is likely to fix the first problem (VS idled), but not the second one. The idea is it can send "pixels" into the "traditionally VS" pipeline (and, yes, keep in mind that just as on the NV3x AFAIK, it's just ONE pipeline for the VS, with each unit in the pipeline working on X vertices, resulting in an "effective" X VS pipelines). It would operate in this manner only when the "post-VS" vertex cache is full, or near-full, obviously.
Another consideration is how, in all NV4x products, the VS pipelines will use their texture lookup units. As I said before, in the NV3x and most likely NV4x, you need two texture lookup units to use the FP unit's potential to its fullest; perhaps it's possible not to have these restrictions, if the NV4x is much more of a scalar-based chip (which I don't don't, really, and even then it seems better to use Vec4 when possible).
But if that's not the case, those 4 vertex units arranged in a 4x1 fashion would have to be rearranged in a 2x2 in order to make best usage of its texturing abilities.
That's where my "even if there are 100 arithmetic instructions, the NV41 could operate as a 6x0 if the texturing units are not used, at all, in the shading program" speculation comes from; it seems logical the VS pipelines would work as 4x0 most of the time, and having to use 2x2/2x0 whenever you got loopback seems, well, strange (and stupid IMO). Also, it seems obvious the vertex pipeline are required to be able to work in a 4x1 fashion without loopback, as the pixel pipeline is, and that's the only way you can get to the 16x1 number.
It is however possible that this "Xx0 with arithmetic" mode would only be usable with the VS, and it could also be possible it only exists for the VS pipeline, or it might not exist at all. If it doesn't, then the NV40 would fundamentally be a "2 VS" design, but perhaps with bypass "T&L" paths in order to "emulate" 4 VS units there.
BTW, this brings us back to the NV30/NV35/NV38 which are capable of getting *several times* the FF lighting of all other parts on the market. I do not personally believe added FF units are likely, although they're possible. My explanation to this is that the NV30 has A) "FF bypass" modes and B) might be capable of (ab)using the PS arithmetic units if they aren't used at all (texturing only). Has anyone even ever tested FF lighting performance when using a very short PS program? I don't think so... I probably should, but you all know how lazy I am by now, I assume
---
In conclusion, my current belief basically is the NV40 is a 6x2 design, which can be "double-pumped" into a 12x1 design, just like the NV31/NV34 could go from 2x2 to 4x1. It can however (ab)use the VS (which is a 4x1 pipeline) to become a 8x2, or, when "double-pumped", a 16x1.
The NV41 is "half a NV40", but does not inherit of its VS (ab)using abilities. It is thus simply a 3x2/6x1.
And regarding just how messy this is getting, and just how much more messy it'll be in the NV5x generation, couldn't we just stop talking of this pipeline shit? I'm hardly the only, or first, person to think that, obviously. Even NVIDIA will normally be opposing itself to this notion in the (near) future.
Uttar
To start simple, the NV41 will be marketed as a 6 pipelines design; as the NV4x technology assumes usage of the FP units for part of the texture addressing, each "pipeline" will have to be able to access to 2 texture lookup units; otherwise, half of this FP unit would be wasted.
Also, considering NVIDIA has the technology to "double" "pipelines" (2x2->4x1 for the NV31/NV34, for example) by what some people call "double-pumping", and NVIDIA's marketing practices, it seems logical it is only a "6 pipelines" design in that specific peak case; the logical conclusion of these conditions means the NV41 is a 3x2/6x1, just like the NV31 was a 2x2/4x1.
Another possible factor which might make the "double-pumping" mode possible in the NV4x is a "no texturing" case. That would mean that even if there are 100 arithmetic instructions, the NV41 could operate as a 6x0 if the texturing units are not used, at all, in the shading program. Switching halfway in the program is absolutely out of the question, however, IMO. This part, however, is mostly speculation on my part regarding how the NV40 is expected to operate.
Just like the NV3x, the NV4x wouldn't really have "physical pipelines" (you've insisted well enough on that, Ail, hehe); those 3 "pipelines" would thus just be one "pixel processor" as NVIDIA likes to call it (in the NV3x, and to a lesser extent in the NV4x, it seems abusive to call that a processor, but whatever).
The bypass paths can then be explained by some specific logic being used in order to operate this "pool of units", although not really a pool in practice I believe (NVIDIA's marketing loves to call it that way, though) in a specific way and order; for example, in the case of the 4x1 path of the NV31, this path operates like if there were two textures for one pixel, while in fact it then interprets this information as if it was one texture per pixel for two pixels. In the case of the NV30/NV35/NV38, their "bypass path" logic would be much more simple; it'd simply order not to send any information to any arithmetic or texturing unit (and probably not to some other stuff too, as it's not capable of 8 pixels/clock for not textured solid color triangles!)
Regarding register usage, I would tend to believe an architecture similar to the NV3x is being used, but it is likely that: a) more registers are available and b) certain operations are done in less cycles, so less registers are required. It is also possible some registers could be freed once they're never going to be used again, in order to send new pixels in the pipelines even though none are out of it yet (certain pixels would thus be reserving some registers than others); whether that's the case, I got no idea, as that part is just speculation.
If you've got no idea why there's any sort of register usage penalty in the NV3x, I suggest reading 3DCenter's "NV30 Inside" article.
I've got no idea how many VS units the NV41 has; I assume that number to be between 2 and 4, however, and 2 is the most likely one IMO. I'm also assuming the NV42 to be a 2x2/4x1 architecture with one VS unit, but that's just me, and I'd in fact be surprised if it was as simple as that, hehe! Also, it is expected each VS unit has its own dedicated texture lookup unit.
Getting back to the NV40. Basically speaking, it's a double NV41. That means its one pixel pipeline is operating on 6 pixels, or 12 under certain conditionals. Actually, that's not certain; it could be two pixel pipelines, each operating on 3 pixels or 3 operating on 2 pixels: but I personally find that significantly less logical.
Problem is, though, that the original and reliable rumors told us it's a 8x2/16x1; and nothing made sense at that point. And then, people realized there were 4 VS units, and that 12+4 = 16. Seems to make sense, don't you think?
To make myself clearer, here's a very schematic view of the NV2x and NV3x, with C = Cache, F = Fixed Function logic and P = Programmable logic.
INPUT->C->P,VS->C->F->C->P,PS->C->OUTPUT
As you see, there are two caches between the VS and PS, the two only programmable parts of the GPU's pipeline. One is before rasterization and triangle setup, one is after; the first, most important one, stores transformed vertices in a FIFO way (First-In-First-Out). In the case the PS programs are extremely complex, and the VS not, this cache will be full and all VS units will be idled. Dozens of millions of transistors will be wasted every single passing clock. The opposite is also possible, with the PS units being idled.
From my understanding, the NV40 (but not the NV41/NV42) is likely to fix the first problem (VS idled), but not the second one. The idea is it can send "pixels" into the "traditionally VS" pipeline (and, yes, keep in mind that just as on the NV3x AFAIK, it's just ONE pipeline for the VS, with each unit in the pipeline working on X vertices, resulting in an "effective" X VS pipelines). It would operate in this manner only when the "post-VS" vertex cache is full, or near-full, obviously.
Another consideration is how, in all NV4x products, the VS pipelines will use their texture lookup units. As I said before, in the NV3x and most likely NV4x, you need two texture lookup units to use the FP unit's potential to its fullest; perhaps it's possible not to have these restrictions, if the NV4x is much more of a scalar-based chip (which I don't don't, really, and even then it seems better to use Vec4 when possible).
But if that's not the case, those 4 vertex units arranged in a 4x1 fashion would have to be rearranged in a 2x2 in order to make best usage of its texturing abilities.
That's where my "even if there are 100 arithmetic instructions, the NV41 could operate as a 6x0 if the texturing units are not used, at all, in the shading program" speculation comes from; it seems logical the VS pipelines would work as 4x0 most of the time, and having to use 2x2/2x0 whenever you got loopback seems, well, strange (and stupid IMO). Also, it seems obvious the vertex pipeline are required to be able to work in a 4x1 fashion without loopback, as the pixel pipeline is, and that's the only way you can get to the 16x1 number.
It is however possible that this "Xx0 with arithmetic" mode would only be usable with the VS, and it could also be possible it only exists for the VS pipeline, or it might not exist at all. If it doesn't, then the NV40 would fundamentally be a "2 VS" design, but perhaps with bypass "T&L" paths in order to "emulate" 4 VS units there.
BTW, this brings us back to the NV30/NV35/NV38 which are capable of getting *several times* the FF lighting of all other parts on the market. I do not personally believe added FF units are likely, although they're possible. My explanation to this is that the NV30 has A) "FF bypass" modes and B) might be capable of (ab)using the PS arithmetic units if they aren't used at all (texturing only). Has anyone even ever tested FF lighting performance when using a very short PS program? I don't think so... I probably should, but you all know how lazy I am by now, I assume
---
In conclusion, my current belief basically is the NV40 is a 6x2 design, which can be "double-pumped" into a 12x1 design, just like the NV31/NV34 could go from 2x2 to 4x1. It can however (ab)use the VS (which is a 4x1 pipeline) to become a 8x2, or, when "double-pumped", a 16x1.
The NV41 is "half a NV40", but does not inherit of its VS (ab)using abilities. It is thus simply a 3x2/6x1.
And regarding just how messy this is getting, and just how much more messy it'll be in the NV5x generation, couldn't we just stop talking of this pipeline shit? I'm hardly the only, or first, person to think that, obviously. Even NVIDIA will normally be opposing itself to this notion in the (near) future.
Uttar