NV35 pipeline organization

demalion · May 18, 2003

Luminescent said:
With texture ops included, the NV35 is only capable of 8 (128-bit) fp ops per clock. Remeber, MDolnec stated it NV35 was capable of 12 ops if only fp shaders are used and 8 fp ops, plus two textures per pipline, if texture fetches are included. So the peak, full precision fp, arithmetic-shader op performance of NV35 should be 12 ops per clock, not 16.

Ok, I'm missing how that contradicts, but maybe because I've forgotten what he originally said.
I understand 12 instructions when counting arithmetic ops, and the 8 texture ops are precluded.
I am assuming that the 8 tex ops (even if they are restricted to PS 1.3 texture load usage) do not preclude the 8 register combiner ops (2 per pipe when 4 pipe) newly allowed to be floating point (for my current understanding of NV35). That's why I was offering the correction of "16 ops", inclusive of texture and arithmetic ops to match the R3xx's peak that I quoted, as an alternative to saying 12 arithmetic ops and necessitating saying that texture ops were precluded for that to occur to contrast it with the R3xx. Is it just a matter of my viewing it as (2 tex ops / 1 fp op) + (2 fp ops) per pipe when 4 pipes, when I should be viewing the nv35 as (2 tex ops / 2 fp ops) + (1 op, maybe)? I'd thought MDolenc's comment had been edited, but if that info is indicated in something remaining I'll try to find it.

For instance, I'm not currently under the impression that it is established that the NV35 can't be 8x1 for PS 1.3 shading (but maybe at floating point precision), which would be 16 ops per clock peak if you count texture load as an op as well.

The vertex pipeline picture presents interesting information, but I think branching and register control functionality are the key to the performance characteristics and isn't represented in any detail (that I can discern) in it. However, I haven't followed the detail in the information you've linked to yet, and it looks like their will be a wealth of information there on that. Perhaps that's where I'll find the reason you propse my 16 op correction is incorrect.

And speak of the caffeine addict who likes teasing, he seems to be up and around...

Arun · May 18, 2003

Remember the register usage problems. Wouldn't be surprised if nVidia optimized ShaderMark to inflate their scores there. Although if all they did was changing the shaders to give the same result but with less register usage, I'd hardly call that cheating. I guess without clear facts, it's hard to say what they did though...

Uttar

demalion · May 18, 2003

LeStoffer said:
DaveBaumann said:

I still cant see that if you have 3 times the float power then that wouldn't somwhow translate into more performance in a new shader benchmark. I'd like to see a few more new shader benchmarks for the NV35 preview here

Click to expand...

Yes, the differences between NV30 and NV35 are less than stellar except in Pixel Shader 7:

http://www.digit-life.com/articles2/gffx/5900u.html

Pixel Shader 7 use much more texture samples and sample some data out of 3D textures, according to ixbt. They argue that the difference has something to do greate bandwidth of the NV35. While that may well be true, I would point out that the differrence could be due to a change to the NV35 so the FP shaders isn't sharing logic with the FP texturing unit.

The type of lead the for the NV35 does seem to be explained by the NV30 simply using FX12 and both architectures stalling on texture dependency for further calculations...the 6 and 7 are frontloaded with texture loads to which calculations are then applied. With the nV35's bandwidth allowing it to overcome the processing clock speed deficit, I'd expect a larger lead if they were truly decoupled...it should offer a significant performance boost.

6 and 7 seem to be the procedural marble and fire shaders...if so, this should be the applicable code:

Code:

                                texld r0, t0, s0
				mul r7.w, c0.x, r0.x
				texld r2, t1, s0
				mad r4.w, c0.y, r2.x, r7.w
				texld r11, t2, s0
				mad r1.w, c0.z, r11.x, r4.w
				texld r8, t3, s0
				mad r10.w, c0.w, r8.x, r1.w
				mul r5.w, c2.x, r10.w
				mad r7.w, c1.x, t0.x, r5.w
				mad r9.w, r7.w, c4.x, c4.w
				frc r4.w, r9.w
				mad r6.w, r4.w, c4.y, c4.z
				mul r1.w, r6.w, r6.w
				mad r3.w, r1.w, c5.x, c5.w
				mad r5.w, r1.w, r3.w, c5.y
				mad r7.w, r1.w, r5.w, c5.z
				mad r9.w, r1.w, r7.w, c3.x
				mad r11.w, r1.w, r9.w, c3.w
				mov r3.xy, r11.w
				texld r6, r3, s1
				mov oC0, r6

Code:

				texld r0, t0, s0
				mul r7.w, c0.x, r0.x
				texld r2, t1, s0
				mad r4.w, c0.y, r2.x, r7.w
				texld r11, t2, s0
				mad r1.w, c0.z, r11.x, r4.w
				texld r8, t3, s0
				mad r10.w, c0.w, r8.x, r1.w
				mul r5.w, c2.x, r10.w
				mad r7.w, c1.x, t0.x, r5.w
				mad r9.w, r7.w, c4.x, c4.w
				frc r4.w, r9.w
				mad r6.w, r4.w, c4.y, c4.z
				mul r1.w, r6.w, r6.w
				mad r3.w, r1.w, c5.x, c5.w
				mad r5.w, r1.w, r3.w, c5.y
				mad r7.w, r1.w, r5.w, c5.z
				mad r9.w, r1.w, r7.w, c3.x
				mad r11.w, r1.w, r9.w, c3.w
				mov r3.xy, r11.w
				texld r6, r3, s1
				mov oC0, r6

(Declarations omitted for brevity).

I think the difference would be significantly greater if 1) the nv30 actually did restrict itself to floating point processing for all instructions 2) the nv35 did have texture ops completely decoupled, given the code. With 2), I think the nv35 would be closer to the 9800 than it was, as well (119.6 versus 197.4 fps), but possible register usage issues do make that a bit harder to evaluate.

BTW: It is interesting to note that the jump in perfomance in ShaderMark and 3dMark03 (PS2 test) isn't really reflected in Rightmark. Better drivers as promised?

It is unclear whether Rightmark 3D actually fully depends on floating point precision for the pixel filling test output quality requirements, and therefore how visible the NV30 using FX 12 would be, and no screenshots were provided for comparison.

We also have no screenshots for comparison for any of the results in question, that I know of, for the new drivers.

Dunno, so I'm looking forrward to your review with a host of shader investigations! 8)

Who foots the bill for Wavey's hot beverages?

LeStoffer · May 18, 2003

demalion said:
The type of lead the for the NV35 does seem to be explained by the NV30 simply using FX12 and both architectures stalling on texture dependency for further calculations...the 6 and 7 are frontloaded with texture loads to which calculations are then applied. With the nV35's bandwidth allowing it to overcome the processing clock speed deficit, I'd expect a larger lead if they were truly decoupled...it should offer a significant performance boost.

So you suspect/suggest that NV35 are only using pure FP calculations while the NV30 are using FX12 but that NV35 are still being handicapped by having to share the some of FP logic with it's FP texture units?

Maybe, just note that the readme only states that you can change between FP16/FP32 (not FX12). Anyway, I feel we needs more evidence to solve this case.

demalion said:
I think the difference would be significantly greater if 1) the nv30 actually did restrict itself to floating point processing for all instructions 2) the nv35 did have texture ops completely decoupled, given the code. With 2), I think the nv35 would be closer to the 9800 than it was, as well (119.6 versus 197.4 fps), but possible register usage issues do make that a bit harder to evaluate.

Good point.

BTW: Yes, we should start to collect some money to pay for Wavey's hot beverages! 8)

Luminescent · May 19, 2003

Demalion, I see where I misunderstood you. I thought you were evaluating maximum fp shader throughput, which is not attainable when textures are involved. If you include texture ops, you lose 1 fp shader op per pipeline but you gain 2 texture ops, so you are right when you say 16 possible ops per clock.

Now, as for that pdf, it only elaborates upon some of the internals of the NV2x vertex shader, restrictions, and capabilities. Reading that pdf pointed out, to me at least, many similarities in funcionality and performance between a NV2x vertex shader's performance and NV3x's fp pixel shader's (which is why I posted it). Here are some close similarities which lead me to my conclusion:

-The vertex shader cannot branch, only evaluate conditional calls
-It executes most instructions in one cycle, with scalar and vector ops requiring roughly the same time.
-The vertex pipeline executes only 1 instruction per cycle (per pipeline), no vector + scalar pairing.

All in all, I saw many similarities between the two, and I thought including a detailed pdf would prove this point. Knowing this, I wanted to post the microcode diagram to give insight to those who have never really seen what goes on in these units. As you can tell from other posts in a variety of threads, hardware architecture is one of my favorite facets of 3D tech.

Hope that cleared things up.

Luminescent · May 19, 2003

As to whether the shader benchmarks in Digit-life's review seem reflective of the NV35's supposed fp shader performance - at first glance, they seem to refute the 12 ops per clock capability. However, considering the penatly NV35 pays for more than 2 registers at fp32 precision, it wouldn't surprise me if registers were a big reason for Digit Life's Rightmark results.

If we can find a pattern between the different tests, like a correlation between the results and the amount of registers used or the amount of texture ops compared to arithmetic ops, we should be able to effectively evaluate NV35's performance with respect to R350, which would help Wavey to asses (if that's how you spell it) this in his review, unlike any of the other reviewers out there.

For example, let us say in test 1 there are a series of light attenuation instructions and one of them is a dp3 (which should have a 1 cycle execution latency, per pipeline, with a max of 2 registers on NV35), but 3 registers are used instead of 2, according to thepkrl's numbers, the instruction will take 1.45 cycles as opposed to 1 cycle.
Note: For those who are skeptical, this is how the numbers added up:
Here it says that for 16 adds (or 16 1 cycle ops, for the NV3x) and the use of 3 registers, the NV30 takes 5.8 cycles. Since the NV30 has 4 pipelines and each add instruction takes 1 cycle, the performance should be 4 cycles. This means that per pipeline each instrution is taking 5.8/4 cycles, yielding 1.45 cycles (almost 50% more time for using 3 versus 2 or 1 registers).

Can someone count the registers used per instruction and see if, on average, they exceed 2 (which is the mark at which performance penalties are supposed to kick in, compared to 1 or 2 registers)? Most of arithmetic can be done in 1 cycle (with NV3x), so loosing out to R350 that badly on instructions like dp3's, mads, etc. (not rsq), which should be cake for NV35 and 12 fp units, indicates to me register usage might be to blame (even with textures enabled, NV35's raw fp shader performance should be on par if not above R350's, considering clockspeed).

A careful assesment of the shader benchmarks vs. the results, should give us a better idea of the detrimental performance impact which registers can have on NV3x, as opposed to R3xx. I say NV3x because MDolnec affirmed NV35 also has the register performance drawbacks of the NV30.

Dave Baumann · May 19, 2003

I ran both Ilfirin's and MDolenc's Shader benchmarks last night on both NV30 and NV35 and NV35 did display an improved performance in both, however this was not a 2X performance increase by any means but a deformace delta (+25% or so at a guess, don't have the numbers in front of me now). Now, I'm wondering about the temp registers as well. NV30 reported a very odd number of registeres through DX Cap and I wonder if the drivers were using a number of them to assist in a number of workarounds for some buggy hardware - I'll have a check tonight to see if NV35 reports a difference number that NV30 did.

LeStoffer · May 19, 2003

Dave, interesting. Keep up the investigations, they are highly appreciated. 8)

Luminescent · May 19, 2003

Very nice, Dave; keep up the good work.

LeStoffer · May 19, 2003

Regarding Rightmark's shader test, here's what I got with a Radeon 9700 Pro:

Benchmark "RightMark 3D: Pixel Shading"
Test Time: 10.00
Width: 1024
Height: 768
Window: OFF

Shader: 1
Shader Profile: Pixel Shader 2.0
FPS: 231.93

Shader: 4
Shader Profile: Pixel Shader 2.X (16fp)
FPS: 102.94

Shader: 2
Shader Profile: Pixel Shader 2.X (32fp)
FPS: 131.39

Shader: 5
Shader Profile: Pixel Shader 2.0
FPS: 55.11

Shader: 3
Shader Profile: Pixel Shader 2.X (16fp)
FPS: 90.76

Shader: 7
Shader Profile: Pixel Shader 2.X (32fp)
FPS: 162.25

Dave Baumann · May 20, 2003

Nope, same number of PS temps in both NV30 and NV35: 28.

LeStoffer, where is the download for Rightmark? Cheers.

Luminescent · May 20, 2003

The download links for all Rightmark tests are found here, at Digit-Life.

Specifically they are:
FillingRate
GeometryProcessing Speed
HSR
PixelShaders
PointSprites

demalion · May 20, 2003

That's out of date. www.rightmark.org has the latest Direct3D benchmark somewhere, and "UncleSam" posted a link recently (a search on his user name should turn it up).

I think all links, to the "Cg Rightmark3D" and "Direct3D Rightmark3D" both, can be found in the recent Rightmark 3D thread. I'm not sure if the one on the rightmark.org download page is the most recent, but the most recent one that I know of can be found through one of the aforementioned links.

NV35 pipeline organization

demalion

Arun

Unknown.

demalion

LeStoffer

Luminescent

Luminescent

Dave Baumann

Gamerscore Wh...

LeStoffer

Luminescent

LeStoffer

Dave Baumann

Gamerscore Wh...

Luminescent

demalion

Similar threads