NV35 pipeline organization

LeStoffer · May 14, 2003

Luminescent said:
Wow, could it be there are Nvidia lurkers here?

Absolutely. The company just doesn't seem to keen on letting them post here.

Xmas · May 14, 2003

Uttar said:
8 FX12 MUL/ADD units
8 FX12 MUL units

The FX12 MUL units have the use of enabling 8 LRP ops/clock in FX12 mode instead of 4. They are also less sophisticated than the other parts of the pipeline: heck, they can't even be used when the op is dependent, contrary to the rest of the pipeline. They're obviously different, but I'm wondering in which way they are...

They are most likely register combiner units as described in NV_register_combiner extension specs.
They take four inputs (each with modifiers: negate, bias, invert, etc.) and provide up to 3 outputs:
- (A * B), (C * D), ((A * B) + (C * D))
- (A * B), (C * D), (A * B) or (C * D) conditional
- (A dot B), (C dot D)

Arun · May 14, 2003

Hmm, yes, you're probably right Xmas. Although it wouldn't be the same register combiners as in the NV25, since it's FX12 and not FX9.

Uttar

Luminescent · May 14, 2003

If it is true that in NV35, only the fp32/tex units have ddx/ddy filtering capability, what about all other special functions in the remaining fp32 shaders? They won't be capable of cos/sin/rsq, etc.?

Arun · May 14, 2003

Luminescent said:
If it is true that only the fp/tex unit has ddx/ddy filtering capability, what about all the other special functions? Does this mean the other fp units will not be capable of cos/sin/rsq, etc.?

No idea. I'd be surprised if it couldn't do most of them, though. I guess we'll need more info on all those things.

Uttar

Basic · May 14, 2003

One thing bout ddx/ddy.
It seems dirt-simple to do it, one of the arithmetically simplest instructions in the set. It's just a subtraction, and the result is shared between pipes. The only odd thing about it is that it needs data transfered between the pipes.

psurge · May 14, 2003

Basic: remember that interview with David Kirk stating that the NV30 performed most ops 4 pixels at a time (and could have up to 32 pixels in flight in the shader units)?

I strongly suspect that shader ops (including ddx/ddy), are executed
on 2x2 pixel blocks - i.e. each "pipe" operates not on pixels, but on pixel stamps.

This makes obtaining the inputs for ddx/ddy instructions "easy", plus allows you to hide some latency of the shader ops (4 cycles sounds about right for a pipelined fmac).

Luminescent · May 14, 2003

In accordance with thepkrl's NV30 model and the MDolnec's information, which claimed Nvidia basically scrapped the integer hardware for NV35, a single pixel pipeline of NV35 probably resembles this:

temporary registers (R0,R1,..,H0,H1,..)
|
FLOAT (perhaps does DDX/DDY for dependent fetches)
| \
| TEXTURE <-- f[TEX0],f[TEX1],.. (DDX/DDY is free)
| /
FLOAT <-- temp registers
|
FLOAT <-- temp registers
|
(loopback to temporary registers or output)

Where the bolded items represent the new additions and subtractions to the NV30 pipeline. Accordingly, an NV30 pipeline resembles this:

(thepkrl)
temporary registers (R0,R1,..,H0,H1,..)
|
FLOAT (perhaps does DDX/DDY for dependent fetches)
| \
| TEXTURE <-- f[TEX0],f[TEX1],.. (DDX/DDY is free)
| /
INTEGER <-- f[COL0],f[COL1]
|
INTEGER <-- f[COL0],f[COL1]
|
(loopback to temporary registers or output)

These models are in accordance to thpkrl's architectural findings and are consistent with the data we have to date.

Alongside that, from the referenced NV30 thread, I found this interesting:

(thepkrl)
When textures are fetched, nothing else can be done in the FLOAT/TEXTURE unit. The following INTEGER unit can use the fetched texture result freely. FLOAT/TEXTURE unit can perform one fetch if the coordinates come from a previous calculation (meaning 4 textures/cycle). If the coordinates come directly from inputs as in PS1.1, the unit can perform a pair of texture fetches (meaning 8 textures/cycle).

Basically the texture unit can only make two fetches when the textures are directly given as imputs and there are no impending results from a previous operation (dependency). In NV30, one fp shader unit seems to share its resources with the texture unit (maybe ddx/ddy, as previously surmised), perhaps explaining NV35's 2fp+2tex vs. 3fp tradeoff.

Luminescent · May 14, 2003

psurge:
I strongly suspect that shader ops (including ddx/ddy), are executed
on 2x2 pixel blocks - i.e. each "pipe" operates not on pixels, but on pixel stamps.

Would this explain why shader benchmarks such as rightmark (see pixel shader results here) run unoptimally on NV35 (considering 12 fp shader units). It seems either the drivers or the applications are not yet ready to properly distribute the fp ops to the 4 pipelines of 3 pipelined fp units each. The R3xx's pipeline is already configured to distribute all operations evenly, with 8 parallel fp shaders. Given its parallel nature, a pixel block is less necessary to make full use of resources (there is one shader assigned to at least 1 pixel every clock), while NV35, on the other hand, should require the pixel blocks to exploit the serial nature of its pipelined fp units.

If no more than 1 pixel can be fed to each pipeline for shading, 2 shaders will remain idle per pipeline. If a larger pixel block can be fed, all the units can be busily chewing on the supplied pixels. This is all of course in my rationale; anyone with knowledge on the subject should feel free to correct me.

Luminescent · May 15, 2003

Wow, I just ran into Uttar's NV35 leaked spec page, which pretty much confirms the 12 fp ops per clock number as the NV35's possible shader execution rate. Check it out:

Accelerated pixel shaders allow for up to 12 pixel shader operations/clock

Luminescent · May 15, 2003

I've been trying to justify Digit-life's low pixel shader 2.0 (Rightmark) scores all day and I think I've arrived at a conclusion. I arrived at it by contrasting the R350 and NV35 architectures through analogy (the following information should be a given to the techheads at B3D, but being that 3D hardware is only a hobby of mine, as of now, I've only given the subject serious thought recently).

Here is the didactic, mock scenario (in a world of conventional/steriotypical pipelines):

Let us say two vpu's (A & B) have equal clockspeeds. Vpu A has a fillrate of 3 while B has a fillrate of 1.5. Assuming a simple world, we can conclude that if vpu B has n piplines, vpu A has 2n pipelines. Vpu A, however, has 2 texture units per pipeline while vpu B has 1 texture unit per pipeline (both are capable of loopback). Vpu A and B attain the same Gtexel rate; vpu A by pipelining its three units/pipe and vpu B through more extensive superscalar (8-way) execution with one texture unit per pipeline.

Now the benchmarks come rolling in. On benchmarks with extensive use of multitexturing; vpu A is put to use most efficiently, its texture units can be used simultaneously, and its resources are well employed. Vpu B will have to loopback to multitexture, although it will produce twice as many pixels when it outputs its final results, so the performance of vpu's A and B are relatively equal. When the heavy single texturing marks begin to arrive, vpu A struggles because its it can only use one of its texture units per pipe and the extra logic is put to no use. Vpu B excels compared to A, because all its resources are employed efficiently and it has twice the amount of crucial resources that A does.

By equating textures to instructions and texture units to fp shader units:

We can observe NV35 suffering from the same fate endured by vpu A NV35 may contain 2 fp shaders per pipeline, but if the benchmark's pixel shaders consist of short instruction counts, the vpu will incur a disadvantage. Each pixel will need less shading (instructions), and unless those instructions come in factors of two, NV35 will not effectively use its 2 fp units per pipeline. Each pipeline's resources will have a greater likelyhood of being unemployed, as opposed to the other leading brand vpu which can address more pixels simultaneously through with only 1 fp shader per pixel in shaders with instruction counts which are factors of 1 per pixel.

This may have seemed obvious, but it sure offered me some speculation venting therapy. Please correct me if any of the comparisons or concepts are not valid.

Luminescent · May 15, 2003

I just read the following from Anandtech's review of the NV35, which seems to confirm the given NV35 conjectures and information about the pixel shader pipeline:

If you correctly pack the instructions that are dispatched to the execution units in this stage you can yield significantly more than 8 pixel shader operations per clock. For example, in NVIDIA's architecture a multiply/add can be done extremely fast and efficiently in these units, which would be one scenario in which you'd yield more than 8 pixel shader ops per clock.

It all depends on what sort of parallelism can be extracted from the instructions and data coming into this stage of the pipeline.

antlers · May 15, 2003

Anyone want to change their opinion on whether extensive and blatant cheating, rather than architectural changes, is responsible for the observed performance improvements in NV35 now that NVidia has been caught, well, blatantly and extensively cheating?

It's enough to make you think that 50% of a video card drivers CPU load is due to benchmark-detection schemes.

Luminescent · May 15, 2003

Dangit, you just had to suck the fun out of it, didn't you!!

Being that this architectural thread has focused more on the NV35's pixel shading arrangement, alongside the fact that it seems harder to cheat in the pixel shader 2.0 mark of 3DMark03 (due to the immediate appearence of rendering artifacts), the Detonator FX ordeal would seem to not be a major concern (with respect to changes in architecture). I do belive that fp32 bit precision is being used for the 5900 FX this time around, but the cheating (if any

) would seem to stem from hand tayloring the amount rendered rather than the rendering quality.

demalion · May 15, 2003

Your analogy works for the TMU example, but the shader example, when applied to the R3xx versuse NV3x, fails to recognize that the R3xx can do more than 1 op per cycle as well.
The shader doesn't have to be short, it just has to be limited in applicability to what the architecture can perform in one clock cycle: the instructions have to meet specific criteria. With the R3xx, when that fails, you can still process 8 pixel ops per clock...with the NV3x (presumably all of them, we need some testing) you can only do 4.

NV30: Best case, 12. Worst case, 4.
R300: Best case, 24. Worst case, 8.

The NV30 opportunties depend on restriction, even if instruction count increases (use a tex op OR use a complex fp op...also, limit register usage even if it increase instruction count, as long as the instruction count increase fits the restricted template of additional ops that can be performed). This seems to remain true for the NV35, except its template allows fp processing where the NV30 required integer.

The R300 opportunities depend only on having the opportunities made visible (it needs to know that there is a tex op that it could be doing in the same clock cycle, it needs to know that some ops are scalar and some are vec 3 if that is all that is required).

The R300's specific opportunities allow the NV30 opportunities to be similarly visible to its template (the scalar op would allow optimizing register usage, and it can also try to reschedule tex ops if the instructions fit its further template requirements), but not vice versa (extra instruction count to reduce register usage could decrease its performance for cases where it would not for the NV30, and the low level optimizer would have to spend significant resources to analyze for that, which would be introduce more CPU speed and software dependency). This can't be expressed in just counting textures or instructions per clock alone.

The day for hasty examples;
It is more like adding some hypothetical characteristics to your TMU comparison besides the textures per clock...have a 4 by 2 TMU architecture where each TMU can bilinear sample in 2 clocks or can be used together to produce one trilinear filtered sample in 2 clocks, and an 8 by 1 TMU architecture where that TMU can trilinear or bilinear sample in 2 clocks. It's not just the 1 texture case where the 1 TMU architecture might show advantage, but any number of texture applications requiring trilinear filtering.

As far as visibility to each other's template: You could write an application to manually perform trilinear filtering by asking for 2 specific bilinear filtered samples, instead of asking for one trilinear filtered sample.

I'm not proposing that this TMU example necessarily reflect any actual architecture, or even that it completely maps to NV30 versus R300 shader functionality, just that it is more like it than just counting TMUs and texture application...please forgive the sloppiness of it in that regard.

Luminescent · May 15, 2003

For clarification, when (or if) I said say operations in the past, I meant instructions (

sorry about that; instructions is the term Mdolnec used and the one which logically fits).

Then, hypothetically, if a control unit issues 1 fmad instruction to a 128-bit fpu, and the fpu can operate on 4 components (simultaneously) with that 1 command (let us say, in one cylce). We can then affirm that it is effectively capable of 4 muls plus 4 adds in one cycle, giving 8 opc (8 operatins per cycle, or 8 flopc to be more specific). Note: The term "opc" is not conventionally recognized

This in mind, demalion, when I mentioned 12 ops per clock for the NV35, I meant 12 instructions per clock ability. It is to say, the processor is capable of 12 shader only instructions per clock (fp) in its pixel shader ( I belive you compared NV35's total shader ability with R300's total texturing and shading ability), if NV35 has 12 fpus and the control unit sends out 12 128-bit instructions, each executing in 1 cycle.

In NV35, these instructions are vector (according to thepkrl's NV30 analysis), so 4 operands are operated on for every type of fp shader instruction (although there probably is a special functions unit for functions which require table lookups and such). R350, on the other hand, can issue two instructions per fp shader (a 4 component vector op and a scalar of which there are 8 units of each); so it has a maximum potential throughput of 16 instuctions per clock (fp shader only).

Then, we are looking at 1 architecture capable of 12 (4-component) vector ops and 1 capable of 8 vector ops + 8 scalar ops. This is, however, besides the main intent, which is to resolve the following: the difference (pros and cons) between a 4 pipeline processor with 2 fp shaders in each pipeline (assuming both have equal vector and scalar execution abilities) and one with 8 pipelines containing 1 fp shader in each? This is what I was intending to show with the analogy, but as Demalion pointed out, the comparison was not as accurate as it could be.

I believe answering this would facilitate our understanding of the NV35's performance in pixel shader benchmarks, alongside this important tidbit which explains the possible penalties of register usage with shaders (in NV30, which contains 4 128-bit fp shaders):

thepkrl:
Register usage is the key to performance, as has been mentioned earlier. For maximum performance, it seems you can only use 2 FP32-registers or 4 FP16-registers. Every two new registers slow down things, and going over 8 regs slows even more:

4.2 cyc/pix: 1reg (2 movs, 16 adds)
4.5 cyc/pix: 2reg (2 movs, 16 adds)
5.8 cyc/pix: 3reg (2 movs, 16 adds)
5.5 cyc/pix: 4reg (2 movs, 16 adds)
7.5 cyc/pix: 5reg (2 movs, 16 adds)
7.1 cyc/pix: 6reg (2 movs, 16 adds)
9.9 cyc/pix: 7reg (2 movs, 16 adds)
9.9 cyc/pix: 8reg (2 movs, 16 adds)
15.0 cyc/pix: 9reg (2 movs, 16 adds)

In the above test the N registers are used in order. If the register usage order is very mixed, performance seems to drop even more. This suggest there are about 2-4 real registers for each pixel in flight (depending if output register is counted or if extra temporaries are reserved). If more registers are used, data is moved between active registers and some slower memory buffer, which adds extra instructions.

demalion · May 17, 2003

Hmm, I was considering texture ops as instructions, like "texld". If the improvements consist of register combiners were upgraded to floating point, as I understood that comment to indicate, keep in mind that the NV3x architecture seems to be a "PS 2.0+" OR texture op architecture (the other ops are a restricted set). Anyways, if it can do 8 texture loads per clock usefully for the shader excution, my peak for it should have been 16.

If you mean to just consider arithmetic ops, it seems you need to introduce the qualifier of "no texture ops" in your peak figure of twelve to reflect contrast properly. I think that with floating point register combiners, the NV35's special case is far more competitive to that of the R3xx, but depends on limited texture usage. Unlike the NV30 with its integer dependency, that dependency is realistic and useful (IMO) on the NV35. To me, FX12 just directly killed the point of longer shaders that allowed this dependency on limited texture usage and calculations for details to be useful at all.

Of course, someone (perhaps with plenty of caffeine and a frustrated penchant for teasing) needs to investigate to see all the things that are fixed, but my current concept of the NV35 is an actual delivery of "better quality pixels, not faster" (or however it goes) with regards to the NV30.

My own thinking on the register usage is an issue of a stack storage system for values, and/or of access limited to the beginning and end of the pipeline, rather than to all of the of the "in flight" stages (i.e., register usage is simply exposing latency). This seems to make sense with the idea of constants for the architecture, and comments from thepkrl about how some register MOVs are free and some aren't. You should probably consider this unsupported and wild speculation, though.

Actually, typing that gives me deja vu, like there was a hypothetical discussion about that very idea. I checked Zephyr's R300/NV30 article discussion briefly, but did not see something that seemed to be what I'm thinking of. I really could swear there was something related to that type of idea that someone else proposed last year, but my searching success rate here is pretty poor. :-?

Luminescent · May 18, 2003

With texture ops included, the NV35 is only capable of 8 (128-bit) fp ops per clock. Remeber, MDolnec stated it NV35 was capable of 12 ops if only fp shaders are used and 8 fp ops, plus two textures per pipline, if texture fetches are included. So the peak, full precision fp, arithmetic-shader op performance of NV35 should be 12 ops per clock, not 16.

Analyzing the pdf documents of the NV2x architecture's vertex shader and its internal diagrams lead me to speculate it is very similar in functionality to the NV3x's fp pixel shader pipelines. I remember Nvidia stating in a CineFx document that the pixel shader would recieve the abilities of the previous generation's vertex shader. Taking a look at the NV2x's vertex shader should give us some insight about the NV3x's fp pixel shader and the information below should show why.

According to the pdf document:
-The vertex pipelines consist of a pipelined vector core composed of an simd vector unit and a special function fp unit. Each component is computed with 32-bit, fp precision.
-All instructions have a percieved single cycle execution rate (and only 1 instrucition per clock is sent to the shader unit; scalar instructions are replicated accross all four vector components.

These facts seem to hold true for NV3x's pixel piplines; you can verify them here and here.
(note: I believe rsq instructions and lrp require a bit more than one instruction (3 or 4), so they should have more than a 1 cycle latency within the architecture).

That is why I have decided to take a look at these vertex pipelines a little more closely, so we'll have an idea of what exactly may composes these elusive "floating-point pixel shaders".

Here is a diagram of the NV2x's vertex pipline internals, which should be very similar to NV3x's pixel pipline internals:

In an extremetech interview (found here), we find the data which gives meaning to the diagram above and reason as to why each instruction (consisting of 1 or 2 operations) has a percieved single cycle execution time:
"Each vertex engine can simultaneously process three vertices, and the workload is divvied up such that the free vertex processor takes the next incoming chunk of vertex processing work. So there can be six vertices in flight within the two vertex pipelines, and during every clock cycle, each vertex engine performs one instruction on each vertex. According to Kirk, "The pipeline stages are as deep as the slowest operation. The architecture is designed to deliver single-cycle performance for all of the instructions, so latency is effectively hidden. A divide takes more than three instructions, but the latency is hidden, so it appears to take one cycle." In fact, every vertex shader instruction now has a perceived single-cycle execution time. Several pipeline stages were added to hide latency of more complex operations, and one example Kirk gave was when doing a divide, reading back the result and doing another operation."

By looking at the diagram of a single vertex shader (which may closely resemble a pixel shader of the NV3x) we see why the shader can have 3 vertices in flight. It has three units, which are probably pipelined. Since the pipelines are deep enough to facilitate single cycle execution of something like a divide, it makes sense that there is an inverse logic unit, an Alu, and an Mlu (I believe the mlu and alu compose the simd vector core and the ilu composes the special floating-point core). In a hypothetical scenario, such as the divide example David Kirk gave, it makes sense that the Ilu would take the reciprocal of one operand in an operation and the Mlu and Alu would take care of the madding (if there is such a term to express the pipelines multiply-add of an fp multiplication) of the Ilu's result and the other operand. All this would be done ("effectively") in a single cycle, using the concept of pipelining. The NV3x pixel shading unit, most probably consists of the same units, but offers extra abilities such as ddx/ddy, free conditional updates, and variable precision.

Now we know what the pipline might have in store for the CineFX pixel shader. I'm wondering if alll the fp shading pipelines of the NV35 are fully loaded with the simd vector core (Mlu+Alu) and the special floating point ops core (Ilu) (to have comparable performance with R3xx, it should). Unlike R3xx's pixel simd and scalar cores, I do not think the NV2x's vector simd core and special fp ops core aren't able to function simultaneoulsy, per clock cycle.

Dave Baumann · May 18, 2003

At the moment, given the Rightmark scores, the low difference in transistor count and the tour of the NVIDIA offices I had you can say I'm a little skeptical as to NV35's.

One of the areas they showed us was wuth actual silicon verification labs, where they go around finding the issues and potential resolutions with new silicon. The guy who runs this lab was saying that obviously NV30 was a very difficult bring up and they spent lots of time with it. However, he also stated that NV35, conversly, was easy as it was so close to NV30 in the first place.

I still cant see that if you have 3 times the float power then that wouldn't somwhow translate into more performance in a new shader benchmark. I'd like to see a few more new shader benchmarks for the NV35 preview here

LeStoffer · May 18, 2003

DaveBaumann said:
I still cant see that if you have 3 times the float power then that wouldn't somwhow translate into more performance in a new shader benchmark. I'd like to see a few more new shader benchmarks for the NV35 preview here

Yes, the differences between NV30 and NV35 are less than stellar except in Pixel Shader 7:

http://www.digit-life.com/articles2/gffx/5900u.html

Pixel Shader 7 use much more texture samples and sample some data out of 3D textures, according to ixbt. They argue that the difference has something to do greate bandwidth of the NV35. While that may well be true, I would point out that the differrence could be due to a change to the NV35 so the FP shaders isn't sharing logic with the FP texturing unit.

BTW: It is interesting to note that the jump in perfomance in ShaderMark and 3dMark03 (PS2 test) isn't really reflected in Rightmark. Better drivers as promised?

Dunno, so I'm looking forrward to your review with a host of shader investigations! 8)

NV35 pipeline organization

LeStoffer

Xmas

Porous

Arun

Unknown.

Luminescent

Arun

Unknown.

Basic

psurge

Luminescent

Luminescent

Luminescent

Luminescent

Luminescent

antlers

Luminescent

demalion

Luminescent

demalion

Luminescent

Dave Baumann

Gamerscore Wh...

LeStoffer

Similar threads