NV40: 6x2/12x1/8x2/16x1? Meh. Summary of what I believe

Arun

Unknown.
Moderator
Legend
I'd just like to post a (originally short, now long) summary post of the current information, reliable or not, we have regarding the NV4x's "pipeline" technology. None of this information is guaranteed at all; and it's likely at least some of it, if not all of it, is wrong. IMO, though, it's the more logical and up-to-date info you'll find :) Most of it has already been posted at nV News in the NV40 thread.

To start simple, the NV41 will be marketed as a 6 pipelines design; as the NV4x technology assumes usage of the FP units for part of the texture addressing, each "pipeline" will have to be able to access to 2 texture lookup units; otherwise, half of this FP unit would be wasted.
Also, considering NVIDIA has the technology to "double" "pipelines" (2x2->4x1 for the NV31/NV34, for example) by what some people call "double-pumping", and NVIDIA's marketing practices, it seems logical it is only a "6 pipelines" design in that specific peak case; the logical conclusion of these conditions means the NV41 is a 3x2/6x1, just like the NV31 was a 2x2/4x1.

Another possible factor which might make the "double-pumping" mode possible in the NV4x is a "no texturing" case. That would mean that even if there are 100 arithmetic instructions, the NV41 could operate as a 6x0 if the texturing units are not used, at all, in the shading program. Switching halfway in the program is absolutely out of the question, however, IMO. This part, however, is mostly speculation on my part regarding how the NV40 is expected to operate.

Just like the NV3x, the NV4x wouldn't really have "physical pipelines" (you've insisted well enough on that, Ail, hehe); those 3 "pipelines" would thus just be one "pixel processor" as NVIDIA likes to call it (in the NV3x, and to a lesser extent in the NV4x, it seems abusive to call that a processor, but whatever).

The bypass paths can then be explained by some specific logic being used in order to operate this "pool of units", although not really a pool in practice I believe (NVIDIA's marketing loves to call it that way, though) in a specific way and order; for example, in the case of the 4x1 path of the NV31, this path operates like if there were two textures for one pixel, while in fact it then interprets this information as if it was one texture per pixel for two pixels. In the case of the NV30/NV35/NV38, their "bypass path" logic would be much more simple; it'd simply order not to send any information to any arithmetic or texturing unit (and probably not to some other stuff too, as it's not capable of 8 pixels/clock for not textured solid color triangles!)

Regarding register usage, I would tend to believe an architecture similar to the NV3x is being used, but it is likely that: a) more registers are available and b) certain operations are done in less cycles, so less registers are required. It is also possible some registers could be freed once they're never going to be used again, in order to send new pixels in the pipelines even though none are out of it yet (certain pixels would thus be reserving some registers than others); whether that's the case, I got no idea, as that part is just speculation.
If you've got no idea why there's any sort of register usage penalty in the NV3x, I suggest reading 3DCenter's "NV30 Inside" article.

I've got no idea how many VS units the NV41 has; I assume that number to be between 2 and 4, however, and 2 is the most likely one IMO. I'm also assuming the NV42 to be a 2x2/4x1 architecture with one VS unit, but that's just me, and I'd in fact be surprised if it was as simple as that, hehe! Also, it is expected each VS unit has its own dedicated texture lookup unit.

Getting back to the NV40. Basically speaking, it's a double NV41. That means its one pixel pipeline is operating on 6 pixels, or 12 under certain conditionals. Actually, that's not certain; it could be two pixel pipelines, each operating on 3 pixels or 3 operating on 2 pixels: but I personally find that significantly less logical.

Problem is, though, that the original and reliable rumors told us it's a 8x2/16x1; and nothing made sense at that point. And then, people realized there were 4 VS units, and that 12+4 = 16. Seems to make sense, don't you think?

To make myself clearer, here's a very schematic view of the NV2x and NV3x, with C = Cache, F = Fixed Function logic and P = Programmable logic.
INPUT->C->P,VS->C->F->C->P,PS->C->OUTPUT

As you see, there are two caches between the VS and PS, the two only programmable parts of the GPU's pipeline. One is before rasterization and triangle setup, one is after; the first, most important one, stores transformed vertices in a FIFO way (First-In-First-Out). In the case the PS programs are extremely complex, and the VS not, this cache will be full and all VS units will be idled. Dozens of millions of transistors will be wasted every single passing clock. The opposite is also possible, with the PS units being idled.

From my understanding, the NV40 (but not the NV41/NV42) is likely to fix the first problem (VS idled), but not the second one. The idea is it can send "pixels" into the "traditionally VS" pipeline (and, yes, keep in mind that just as on the NV3x AFAIK, it's just ONE pipeline for the VS, with each unit in the pipeline working on X vertices, resulting in an "effective" X VS pipelines). It would operate in this manner only when the "post-VS" vertex cache is full, or near-full, obviously.

Another consideration is how, in all NV4x products, the VS pipelines will use their texture lookup units. As I said before, in the NV3x and most likely NV4x, you need two texture lookup units to use the FP unit's potential to its fullest; perhaps it's possible not to have these restrictions, if the NV4x is much more of a scalar-based chip (which I don't don't, really, and even then it seems better to use Vec4 when possible).
But if that's not the case, those 4 vertex units arranged in a 4x1 fashion would have to be rearranged in a 2x2 in order to make best usage of its texturing abilities.

That's where my "even if there are 100 arithmetic instructions, the NV41 could operate as a 6x0 if the texturing units are not used, at all, in the shading program" speculation comes from; it seems logical the VS pipelines would work as 4x0 most of the time, and having to use 2x2/2x0 whenever you got loopback seems, well, strange (and stupid IMO). Also, it seems obvious the vertex pipeline are required to be able to work in a 4x1 fashion without loopback, as the pixel pipeline is, and that's the only way you can get to the 16x1 number.
It is however possible that this "Xx0 with arithmetic" mode would only be usable with the VS, and it could also be possible it only exists for the VS pipeline, or it might not exist at all. If it doesn't, then the NV40 would fundamentally be a "2 VS" design, but perhaps with bypass "T&L" paths in order to "emulate" 4 VS units there.

BTW, this brings us back to the NV30/NV35/NV38 which are capable of getting *several times* the FF lighting of all other parts on the market. I do not personally believe added FF units are likely, although they're possible. My explanation to this is that the NV30 has A) "FF bypass" modes and B) might be capable of (ab)using the PS arithmetic units if they aren't used at all (texturing only). Has anyone even ever tested FF lighting performance when using a very short PS program? I don't think so... I probably should, but you all know how lazy I am by now, I assume ;)

---

In conclusion, my current belief basically is the NV40 is a 6x2 design, which can be "double-pumped" into a 12x1 design, just like the NV31/NV34 could go from 2x2 to 4x1. It can however (ab)use the VS (which is a 4x1 pipeline) to become a 8x2, or, when "double-pumped", a 16x1.
The NV41 is "half a NV40", but does not inherit of its VS (ab)using abilities. It is thus simply a 3x2/6x1.

And regarding just how messy this is getting, and just how much more messy it'll be in the NV5x generation, couldn't we just stop talking of this pipeline shit? I'm hardly the only, or first, person to think that, obviously. Even NVIDIA will normally be opposing itself to this notion in the (near) future.


Uttar
 
In conclusion, my current belief basically is the NV40 is a 6x2 design, which can be "double-pumped" into a 12x1 design, just like the NV31/NV34 could go from 2x2 to 4x1.

Now, who was it that I suggested that to recently...?[/quote]
 
Yah - a lot of people seem very sure about the 16-"somethings"/clk specification, even at this rather late stage where NDAs have been signed, there have been presentations/demos etc.

I must admit, the reported transistor count of 175 million seems to favour the 6x2 + pixel processing in VS idea more than that involving 4, NV31-esque blocks that can operate on a quad at a time. 16x1 would be remarkable - I'm sure it's just the same, non-textured "zixel" mode we're used to.

MuFu.
 
Well, I'll just point out, concerning one part of the speculation, that 8(p)+4(v) = 12, and 4(p)+2(v) = 6.

This seems to naturally lead to a viable alternative explanation, where 16 still means 16x0/8x2, but a "12" is the result when the vertex processing is re-tasked for 8x2 type usage. One factor that makes this seem more reasonable (to me) is transistor budget and a presumed cost for register file improvements being significant for transistor budget and magnified by pipelining ("parallelism") capability. Even if the "x0" capability isn't affected, this is a valuable benefit.

Also, I've been wondering if the primary usefulness of such odd and cumbersome looking parallelism grouping might be aimed (in terms of the most efficiency, not as their only function) at resolving conditional execution rather than general case execution speed. This was a thought around the "12 pipeline" R420 rumors, wondering how the R3xx would be adapted for new functionality, but it seems applicable to the NV4x as well while it is trying to solve a register file problem at the same time.
 
Uttar I read your post at nvNews and here, and was wondering if the PS/VS units are so similar (identical) why do the VS units not have register problems (forgive me if register usage in the VS isnt a problem)? Also, why wouldnt nVidia create 4? blocks of PS/VS units (16 shaders/pipes assuming each block works on a quad) that could each be used for PS/VS depending on the work load?
 
But 6 is not a 'natural' digital number.
Its not on the 2 4 8 16 32 64 series.
Therefore its impossible & the NV40 will not exist just like the r300 with its unnatural fp24 shaders :rolleyes:
 
Uttar,

I still think that architectures have become that complex these days, that it's very likely that you'll never hit the correct spot while speculating.

I am opposing as much to the physical pipeline issue, because (a) they just add more confusion to the mix and (b) IHVs abuse the terminology for exaggerated marketing hype that has little to no resemblance to reality. Perfect example for (b), ist the V8 Duo claiming 16 pipelines, whereby pipelines are most likely TMUs, whereby they're useless anyway due to bandwidth contrainsts etc etc.

I care what comes out at the other end; in fact to be honest whether 4*2 or 8*1 in the recent past it doesn't make much difference to me and I don't think it's the real spot where in any case advantages or disadvantages could be detected. In fact I'd dare to say that it would be entirely possible to have a 4*2-alike design and not carry the same weaknesses as NV30 f.e. did.

I can see all kinds of speculations regarding NV40 and R420 and while from the stuff that circulates the rumour mill tidbits like 3*NV35 or 3*RV360 could make actually some sense in the end, outside of those every piece of "information" is actually misleading than of any real value.

***edit: I would suggest to not underestimate the NV41 in terms of VS capabilities that early. NVIDIA might have learned a significant lesson with the 5700U. Just a simple suggestion....
 
Please don't think I'm being pretentious or whatever by replying on a per-person basis, just makes it easier for me :) Anyway...

Demalion: Possible, but that implies then, the NV41 wouldn't be "half a NV40" (which is what nV pretends it is - they could be oversimplifying though) and the NV40 could be marketed as more than a 16 pipelines design (which it isn't AFAIK). Also, either VS and PS would be different functionality-wise, or the VS could then be described as 4x2; I hardly see how that makes sense, really. That many dedicated texture lookup units for the VS seems like complete and utter waste to me :(

MuFu: Why do you say it should be 16x0? The NV34 is 2x2/4x1, and has only 45M transistors. I doubt this "double-pumping" functionality is all that expensive, really. But 12x0/16x0 is still very possible indeed, sadly...

Lost: I doubt they can do that, although it's certainly theorically possible. But beside in the workstation market where T&L is king, it seems like a waste to me, once again. Keep this mind this is still only "Pipeline Level Distributed Processing", if I can permit myself to invent yet another boring term. ILDP would still be reserved for the NV50, or later, depending on whether things go right and whether my info is right. ILDP being such an unprecise term, this might very well be what NV is refering to, who knows...

Ail: The problem is many sources are NOT technical. That means if you try to tell them there are anything more than textures and physical pipelines in GPUs, they'll probably just tell ya it's way over their head and they can't tell you. Of course, if you're lucky enough to have an engineer source, that's different, but it's rarer from my experience.
IMO, it is thus required to try to make "technical sense" out of those 'physical pipelines' in a XxX way, as well as from marketing speech.

In the case of XGI, their marketing had little precedent (well, they did, but we had thought maybe they would change their ways thanks to their new name; how naive we've been, there also!). It was thus extremely hard to guess what they meant by "pipelines"; although, had we assumed they were the same people who marketed the 2x4 Xabre as a 4x2, we could have assumed it, most likely.

With NVIDIA, their definition of pipelines roughly seems to be "whatever our (?)ixel peak output is". Whether that is pixels or zixels, they don't seem to care though, so you can never be perfectly sure of that part :?

Personally, I think most of the crucial information is missing anyway, as we don't have any arithmetic unit information. At all. I agree with you a 4x2 could be pretty much as good as a 8x1; if not better if it had more arithmetic units. But this way at least, we know the number of texturing units as well as the peak outputs. Better than nothing IMO.


Uttar
 
Uttar you are suggesting: making a 8x1/4x2 is no harder than making a 4x1/2x2, but I think it's not true - at all.
The thing is that all GPU's work on quads.

That means that a 8x1 design have to work on two quads paralelly, but these two quads might have different LOD, angle of anisotropy, etc. therefore they could take different number of cycles to execute.

That's why the R300 is having two completely parallel 4x1 units, and I think that's pretty much the only solution to go above the 4 pixel/clock troughput.

Of course it is possible that NV40 has a similar design to the R300 in this...
 
Hyp-X said:
Of course it is possible that NV40 has a similar design to the R300 in this...

It most likely does, although I must admit I've got no idea how they're going to do quad operations. After people making me notice some more facts, it seems like (fundamentally, keep in mind there's most likely double pumping) 3x2(PS)+3x2(PS)+2x2(VS) makes sense. This would also imply that if NVIDIA intends to use some broken NV40s (and yes, that, you've suggested it before :p) as NV41s, then it would need to have as much VS power/clock.


Uttar

P.S.: Keep in mind a smarter design is required in that POV for Shaders 3.0 if you want good performance, due to Dynamic Branching.
 
R300 / R350 / R360 already operate on two quads permanently.

NV35 does in its Z/stencil mode. However, if NV40's pipelines bear more similarities to NV31 (/ NV36) than NV30 (/ NV35) then I would guess on more.
 
Uttar, I think you're tagging on some assumptions to my comment that don't seem necessary.

For one thing, I don't propose that it changes the 16x0. I don't expect beyond a 256-bit bus, and I think 8x>0 / 16x0 goes with that. Also, I don't see much benefit because the "16x0" would seem likely to be vertex processing limited.

For another thing, there need not be extra texture unit demands beyond what is proposed for shader model 3.0 functionality for vertex shaders.

Finally, the pixel pipelines could be "less stacked" than before...i.e., floating point processing "array 'microunits'" dedicated to texture capable "unit arrays" (parallelism) that go with TMUs, like the 4 "root" (or "gatekeeper", I think?) units for NV3x.


Overall transistor budget increase:

Less array "depth" (pipelining), which should reduce register transistor cost per parallel stream.
Some of these array resources allocated to go with additional TMUs, perhaps simplified to deal with one texture sample alone if necessary (I don't know the overall cost for TMUs, or savings for such a reduction, but looking at NV25->NV30, it doesn't seem to be the most significant cost...so I don't think it is necessary).
The vertex shader units are (the best case) just another "quad" like the above. The benefit is efficient PS/VS 3.0 level functionality for the general case usage of the featureset, and the ability to be allocated interchangeably upon the completion of a program for a quad, depending on remaining workload (looking at the "caches" as you mentioned). If it is worse than this, and they need to "borrow" pixel processing texture units, they could still provide a computational benefit along the lines of my nebulous "conditional" musings, but this approach would be a problem with the NV40's functionality in any case.

With the first step, this looks feasible within the transistor budget. It doesn't seem to require more processing capability than a full 8 pixel pipeline card (or, "2 pixel quad" if you prefer), which looks to be necessary to be competitive. What it requires is more cleverness and engineering finesse, but of a fairly narrow scope (all seems to relate to their register/pipelining management solution to the NV3x problems, which would seem to be their focus of necessity).
 
Ailuros, what would Nvidia learn from the 5700U? Is it weak in a sense or does it not have enough VS units?
 
I thought it had the same vertex power, as both are clocked the same and use the same three vertex shaders...?
 
Back
Top