Dawn FP16/FX12 VS FP32 performance - numbers inside

MDolenc: Hmm, are you sure I did that mistake?
Looking at my code, I see:

Code:
#MADX o[COLH], R3, R2.w, R0;
MAD R3, R3, R2.w, R0;
ADD o[COLR], R3, R1.w;

And the MADX using COLH is *commented* - that's why I didn't change it.

I pretty much supposed using both COLH & COLR like that is illegal - but I don't see where I'm doing it... :)

I'd like to see if your shader works better - Maybe it would, maybe it wouldn't. Don't know...


Uttar
 
And I see:
TEX R0, R2, TEX4, CUBE;
TEX R2, f[TEX0], TEX2, 2D;
MOV o[COLH].w, R2.x;
MUL R2.y, R2.y, 0.5;
MAD R0, R1, R2.y, R0;

TEX R3, R3, TEX5, CUBE;
TEX R2, f[TEX0], TEX0, 2D;
MUL R0, R2, R0;
MUL R3, R3, 0.1;

#MADX o[COLH], R3, R2.w, R0;
MAD R3, R3, R2.w, R0;
ADD o[COLR], R3, R1.w;
END
 
Oopsy :oops:

I guess that'll teach me to use replace instead of wanting to look like a madman modifying code by hand :p
Thanks for the fix, though :)


Uttar
 
Ante P said:
Uttar said:
Okay, so let's try this AGAIN...

www.notforidiots.com/Dawn3.zip

Psst, AnteP, don't insist too much on that or nVidia will realize you got a *beta* version and kill you for sharing that image ;)


Uttar

nah I just spoke to them and posting pics and videos is fine as long as I don't post the actual demo =)

Yeah, I'll bet they don't want to see it running on ATI hardware with better quality and 15% faster. ;)
 
5800 Ultra:
Normal: 30 fps

UTTARS VERSIONS:
Old results
FP32: 17 fps
FP16: 20 fps
FullFP16: 21 fps

New results
FP32: 17 fps
16IN32: 17 fps
FP16: 20 fps
FullFP16: 21 fps
FX12: 30 fps

MDOLENCS VERSION
FP32: 17 fps
 
Neeyik & mrbill-

Thanks for the replies.

So ARB_fragment_program seems to follow the same precision rules as PS 2.0: all architectures must be capable of supporting at least FP24, but lower precisions can be allowed where available via a per-shader hint.

Given our current understanding of the NV35 shader pipeline, this means it should run the ARB2 path at the same performance (other potential issues notwithstanding) as the NV30 path, so long as there aren't any shaders where:
1) Some variables require FP24+ precision for the image to display properly (i.e. FP32 on NV3x)
2) Other variables are fine with FP16
3) Using FP16 registers (i.e. 64 bits per value) for such variables will make it so that the total variable space divided by 256 bits has a lower integer part than if they used FP32 registers (128 bits).
 
Since many of us are still trying to explain/justify the performance of the NV35 with words or theories, I have set out to achieve this numberically. Maybe some of you will get a better idea of the performances NV35 and R350 have relative to each other in such a way (from my understanding ;)) .

To do this, I will compare the "effective" number of fragment shader fp flops each architecture can achieve, in an attempt to generate life-like, real-world performance expectations. Consequently, we need to take shader code into account; only then will we have "effective" results.

With following statement in mind, we may proceed:
"...the more non-Vec4 operations are used the more efficiencies will be gained – 3Dlabs have suggested that up to 30% of instructions even in a standard OpenGL Transformation pipeline may not be Vec4 instructions.",
the calculations procede as follows:

This is how it pans out:

To determine number of "effective" flops, I sum the maximum vec4 flop rate, multiplied by its weight (70%), with the maximum scalar rate, multiplied by its weight (30%), and divide by the total weight.

Scalar=.3 or 30%; Vector=.7 or 70% of average fragment instruction, so the total number of scalar flops availabe in each processor is multiplied by .3 and the total number of vector flops available by .7.

Since R350 has 8 fp fragment shader pipelines (each capable of simultaneously processing a vec4 and scalar op), we obtain a maximum flop capability of 380 (clockspeed in MHz)* 8 (number of fp fragment shader units)*8(number of ops with vec4 instructions i.e. mad), which yields 24.320 gflops. On scalar ops, the number becomes 380*8*1, which is 3.040 gflops.

Nv35 contains 12 fp fragment shader pipelines, each capable of either processing a vec4 or a scalar op. At a clockspeed of 450, the maximum flop capability of NV35 with vec4 ops is 450*12*8 (on mad instructions) or 43.200 gflops. With a scalar op, the NV35 is capable of 450*12*2 or 12.800 gflops. This holds true for bot fp32 and fp16 precision.

Because NV35 can execute either scalars or vectors, and not the two concurrently, the possible "effective" flop number is derived from a straight average of the weighted scalar and vector flop performances .

NV35: (.7*43.200+.3*10.800)/2=(30.240+3.240)/2=16.740 gflops
R350: (.7*24.320+.3*3.040)/1=(17.024+9.12)/1=17.936 gflops

With 6 registers in use for 3 fp units per pipeline (an average of 2 registers per fp fragment shader, assuming the performance penalties of NV30 and fp32 precision) NV35's number of effective flops becomes:
16.740 (maximum available gflops) /1.52 (clock cycles per instruction with 6 registers enabled)=11.01 gflops
Note: 2/3's comes from thepkrl's NV30 pipeline results thread and their performance analysis here.

Since R350 suffers no performance degredation when less than 32 registers are in use, the number of effective flops available on the NV35 in comparison to R350 is:
11.01 vs. 17.936 (gflops).

Thus the gflop ratio of NV35 relative to R350 is: ~0.614
In the real world this could translate into a 38% performance difference between NV35 and R350 when running fp fragment shaders at full precision. The NV35's performance, however, is only available when no texturing is required, otherwise, it loses 4 fp fragment shader units and the performance difference would, probably, increase another 20-30 percent (translates into a 60-70% difference).

References:
-"...the more non-Vec4 operations are used the more efficiencies will be gained – 3Dlabs have suggested that up to 30% of instructions even in a standard OpenGL Transformation pipeline may not be Vec4 instructions."
http://www.beyond3d.com/articles/p10tech/index.php?page=page2.inc
-Thepkrl's research
http://www.beyond3d.com/forum/viewtopic.php?p=100394#100394
-NV35 fragment pipeline fp fragment shader details
http://www.beyond3d.com/forum/viewtopic.php?p=121958#121958
http://www.xbitlabs.com/articles/video/display/geforcefx-5900ultra.html

Take what you will from this tedious explanation, but it seems the R350's fragment architecture holds some definite advantages over the NV35's, albeit precision (fp24 vs. fp32); instruction count of R350 relative to NV35 is also higher. By no means, though, is the NV35 at a great disadvantage. Its performance is definitely admirable (we would have never thought these many flops at this precision was possible in the past) and is even somewhat comparable to R350 @ fp24, but it seems to really shine and, possibly, outperform the R350 with fp16 precision. The CineFX fragment shader architecture also offers a little bit more flexibility and a couple of extra instructions.

All in all, it is just good competition amidst some nasty corruption (which needs to be abolished).
 
Yes, but you don't include consideration of texture ops in your final evaluation, nor that there might be a limitation to which operations can be done for peak operation (dp3, mul and some simple ops can be done, but what about the consideration of the rest of the PS instruction set?...many might be disadvantageous for the NV35, and there is the possibility that MOV might be advantageous for it in comparison to the R350).

The first doesn't relate to the peak calculation determination, but does to the final evaluation you offer and the applicability of it to realistic shader usage.
The second directly pertains to the root of your assumptions, but doesn't negate the idea as a good starting point.

All AFAICS, of course.
 
I appreciate the criticism demalion, but what abou this:
In the real world this could translate into a 38% performance difference between NV35 and R350 when running fp fragment shaders at full precision. The NV35's performance, however, is only available when no texturing is required, otherwise, it loses 4 fp fragment shader units and the performance difference would, probably, increase another 20-30 percent (translates into a 60-70% difference).

I'll even throw this in for instruction latency comparisons:
I have tested performance for all instructions with FP32, FP16, FX12 with both dependent operations and parallel independent operations. There is no difference between FP32/FP16 (but see about registers). FX12 operarations are significantly faster (3-4x).

Operations/cycle
FP FX
4 12-16 mul/dp3/dp4
4 12 mad/add/sub/max/min/flr/frc
4 12 seq/sge/sgt/sle/slt/sne/str/sfl
2 8 lrp
4 - sin/cos/ex2/lg2/dst/rcp/x2d/ddx/ddy
2 - rsq/lit/pow
1 - rfl
4 - pack/unpack/kil
8 - tex/txp
0.8 - txd
It seems that most instructions, on both architectures, require the same amount of latency at max precision.

The R350 and other R3xx's execute (per pipeline) mul/dp3/dp4/mad/add/sub/max/min/frc/ex2/lg2/dst/rcp at 1 op per cycle, like NV3x (remember NV30, unlike, has 4 fp units, which one has to keep in mind when reading thepkrl's performance results). The differences lie in lrp, which the R3xx can execute (sometimes) in 1 cycle, rsq (& pow, I believe) where R3xx makes it in 1 cycle and (NV3x in 2). The NV3x does support ddx/ddy (in a single cycle) and sin/cos (in a single cycle), which R3xx does not support.

The two architectures are fairly close when it comes to individual instruction execution latencies, with a few pros and cons inherent to each system. The R3xx executes a couple of instructions quicker, while NV3x has native support for others.

Hopes this makes things clearer.
 
Luminescent said:
I appreciate the criticism demalion, but what abou this:
In the real world this could translate into a 38% performance difference between NV35 and R350 when running fp fragment shaders at full precision. The NV35's performance, however, is only available when no texturing is required, otherwise, it loses 4 fp fragment shader units and the performance difference would, probably, increase another 20-30 percent (translates into a 60-70% difference).

It's called "me missing your statement". :oops: I guess I was prompted by it being missing from inclusion in the "Take what you will..." paragraph at the end, which is what I meant by your conclusion.

My only excuse is that is that it is the only mention of texturing in your post, even though the actual role of texture access in your discussion is more significant...can I be forgiven on that basis? :-?
 
No problem, demalion, guess I misplaced the statement and did not give it the emphasis it needed. As a matter of fact, I added the texturing a while after beginning the post, because I had initially missed it.

P.S. I edited my followup post with some more detailed information.
 
Yes, but note that some instructions can only be executed once per clock cycle per pipe, at best, at peak execution speed:

sin/cos/ex2/lg2/dst/rcp/x2d/ ddx/ddy, rsq/lit/pow/rfl/pack/unpack/kil

Their peak speed would then be, at best, 4 per clock, not 12.

We have indication that the possible precision of the other ops might be improved, but to my knowledge we don't have indication that how often these ops can occur has been improved as well. In any case, recognition of either possibility wasn't apparent to me, but maybe I just missed that too. :oops:

:p
 
Luminescent said:
The NV3x does support ddx/ddy (in a single cycle) and sin/cos (in a single cycle), which R3xx does not support.
I thought that the NV3x only supported sin/cos in the vertex shader not the pixel shader. (When I say support I mean as a single instruction not as a Taylor series expansion.)

Please correct me if I am wrong.
 
It seems NV3x supports sin/cos, in both the vertex shader and pixel shader, natively, in 1 cycle. Take a look at thepkrl's test results, found in my previous post (a reference is given as well as a quote). Basically, he designed a program to determine NV30's architecture and inherent performance and found sin/cos, not only supported, but available in a single cycle (per pipeline).

For more info on CineFX and Nvidias supported fragment instructions and implementations, see this link here. The document states the following in the fragment shader section:
The SIN instruction approximates the sine of the angle specified by the
scalar operand and replicates it to all four components of the result
vector. The angle is specified in radians and does not have to be in the
range [0,2*PI].

tmp = ScalarLoad(op0);
result.x = ApproxSine(tmp);
result.y = ApproxSine(tmp);
result.z = ApproxSine(tmp);
result.w = ApproxSine(tmp);

The approximation function is accurate to at least 22 bits with an angle
in the range [0,2*PI].

| ApproxSine(x) - sin(x) | < 1.0 / 2^22, if 0.0 <= x < 2.0 * PI.

The error in the approximation will typically increase with the absolute
value of the angle when the angle falls outside the range [0,2*PI].

The following special-case rules apply to cosine approximation:

1. ApproxSine(NaN) = NaN.
2. ApproxSine(+/-INF) = NaN.
3. ApproxSine(+/-0.0) = +/-0.0. The sign of the result is equal to the
sign of the single operand.

Wavey's NV30-R300 feature comparison also contains some information in the instruction set chart.
 
Yes, but note that some instructions can only be executed once per clock cycle per pipe, at best, at peak execution speed:

sin/cos/ex2/lg2/dst/rcp/x2d/ ddx/ddy, rsq/lit/pow/rfl/pack/unpack/kil

Their peak speed would then be, at best, 4 per clock, not 12.
I understand demalion, but these results are for NV30 which contains 8 fewer fp fragment shader units than NV35. Thpkrl's results are only good to show that the fp units are capable of a 1 shader op per clock (average) in NV3x. Thepkrl's results came from a 16 instruction shader, which took ~ 4 cycles on a 4 pipeline processor. This indicates to me that on average, each pipeline completed 1 128-bit op per pipeline (performance holds true with 1 or 2 registers in fp32 and 1-4 registers in fp16 mode).

Let me break it down (also helps me, mentally, sometimes):
Average number of instructions executed per clock = total number of instructions/total instruction latency (in clock cycles).

To obtain the latency of execution, per pipeline, divide the average number of instructions executed per clock by the total number of fragment fp pipelines.

For NV30, the latency of an fp fragment shader program composed of 16 adds, per pipeline, translates into (16/4)/4, resulting in 1 cycle.

Accordingly, if we have 12 fp fragment pipelines (NV35) and a 16 add instruction shader (movs are free), we obtain (16/12)/12, which results in ~0.111 cycle per pipeline.
 
I think maybe I'm misunderstanding something you are stating, then.

I'm specifically not making the assumption that all the fragment shader units in the NV35 have the full functionality of the floating point fragment shader units common between the NV30 and NV35. I'm making the assumption that there are processing units that were limited to FX12 and executed the extra operations per clock per pipeline before, and those units were upgraded in the precision of their operations and output, but not the scope of operations they can perform.

This assumption could of course be wrong, and all it would require is that all of these functions in this list are fully replicated in the pipelines as much as the dot3, mul, add, etc. units are. I just don't know that it has been established that this is the case.

My vision of the NV30 is the same number of processing units and associated functionality as the NV35, with the significant problem being limited register set replication (and presumably precision of calculation,depending on how much of a mistake it was) beyond FX12 precision. Now, it is possible that those units were fully functional for all instructions, and only precluded from expression by FX12 output not being useful for the listed functions...but I also think it is possible that the processing unit functionality replication was streamlined due to how common the operations in the other instruction list are likely to be, and how expensive some of these operations might be transistor-wise.

It comes down to how much transistor space was wasted on the NV30, since the transistor count increase from NV30->NV35 doesn't seem (to me) to support the idea of upgrading both the ability of 8 processing units to perform this list of operations and adding extra floating point register replication.

My current middle ground of disaster/mistake balance for the NV30 leads me to believe that precision processing and register pipelining was upgraded, and that the NV30 did not have the ability to perform the above list of operations at 12 ops per clock while making the mistake of depending on outputting them at FX12 precision. I think that except for the deficiencies in texture processing, and except for comparison to the R350 in particular, that this decision makes a great deal of sense for transistor-efficient performance.

I also think this correlates with Microsoft's choice of description for the ps_2_a profile slide in their GDC presentation to specify intermixing texture loads and "arithmetic" operations specifically, but I may just be interpreting their meaning of "arithmetic" incorrectly.

If I've missed something, just point it out (like you did above ;) )...I was primarily waiting for Wavey's testing to evaluate what was being proposed in this regard more thoroughly.
 
The thing is, demalion, that all of NV35's int hardware is confirmed to be gone. Yes, all 12 fx units. Think about it, 12 48-bit (12-bit per color component) int units out: it makes plenty of room for 8 new fp pipelines along with 5-10 more million transistors relative to NV30. This is primarily why I assumed the new fp fragment shader additions are not more streamlined than the original 4 in NV30. The only way I could see them being streamlined is in ddx/ddy capability, where only 4 of the twelve units, in CineFX, are needed for texturing. Read the following from Wavey's NV30-R300 article which I linked in a previous post:
Texturing instructions: As we can see, NV30 provides a TXD instruction (Texture Lookup with derivatives) instead of TEXLDB instruction (Texture Lookup with LOD bias). However, TEXLDB in PS2.0/3.0 can be emulated by TEXLDD PS3.0 / TXD NV30 instruction:

dSdX = ddx (texture coordinate)
dSdY = ddy (texture coordinate)
multFactor = 2^bias
dSdX = multFactor*ddx
dSdY = multFactor*ddy
color = txd(texture coordinate, dSdX, dSdY)

So for the cases where all the shader author wants is to add a mip LOD bias, TEXB would be faster; however, this functionality is available if desired using TXD. For anisotropic filtering, TXD also provides control over the direction of anisotropy which allows some pretty nice effects.

Partial derivative instructions: These can calculate partial derivatives with respect to screen-space x or y and are useful for anti-aliasing, height-field bump mapping and computing parameters for the TXD texture lookup with partial derivatives.
Then again, psurge said ddx/ddy only requires a subtraction and data to be shared accross the pipelines.

After re-reading your post, demalion, I understand what you mean a little more. You are skeptical as to whether the 8 added fp units are full fp units (like the original 4 from NV30) or upgraded combiners. It seems to me that combiners are more limited in their functionality and that it would take more time to custom design new and improved fp register combiners than to copy and paste the good old fp units of NV30. This is my reasoning, of course. Until we get Wavey's benchmarks, though, it is nothing but glorified speculation.

P.S. I've noticed a reoccuring pattern in which demalion and I finish off interesting architectural threads because of our endless rambling. Are we that boring. ;)
 
Luminescent-

Interesting stuff, but I wonder how much you're missing by failing to account for instruction dependencies and other such issues. Looking at your R350 numbers, you appear to be assuming that the compiler/scheduler will match up vec and scalar ops for parallel execution; ATI's 3dMark cheat would seem to indicate that their current compiler is not so good at recognizing these opportunities on its own, even when they do exist (which is far from always).

On the other hand, I'm not sure you you figure that NV35 will outperform R350 with FP16; the calculation you used shows its FP16 performance as still being a bit worse than R350, and IMO 6 registers doesn't seem like a terribly large amount, especially as shaders increase in complexity. (Increasing the used register count would of course hurt NV35's performance under both FP32 and FP16; only FP16 uses half the register count, obviously.)

Oh well; of course it's impossibly to reduce an architecture's performance to a single number in any case, but there are some more things to worry about... :)
 
I guess the term "effective" will never be trully effective, Dave. I did assume the R3xx's compiler could ideally recognize parallelism in vector+scalar code (which is kind of overshadowed by the fact that the 30% to 70% scalar/vector weighting is not always the same either).

By the way, I never actually did the "effective" flops calculation of fp16 for NV35, I just made an assumption :oops: . If it will make you happy:
Just change the denominator, obtained from the 1.52 (~3/2) cycle latency inherent to 6 registers (Note: the penalty for 6 registers in fp16 equals the penalty of 3 in fp32 mode) and replace it with a division by 1.45, which makes performance equal (16.740/1.45) 11.54 gflops.

Now that I see the numbers, there doesn't seem to be such a large difference between fp32 and fp16, at least when using 6 registers. It is 11.01 gflops (fp32) vs. 11.540 gflops (fp16) or roughly a 5% performance increase between fp32 and fp16. Now, if the number of registers with fp32 starts jumping to 8 and greater, the performance of fp16 will be even greater.
 
Ante P said:
5800 Ultra:
Normal: 30 fps

UTTARS VERSIONS:
Old results
FP32: 17 fps
FP16: 20 fps
FullFP16: 21 fps

New results
FP32: 17 fps
16IN32: 17 fps
FP16: 20 fps
FullFP16: 21 fps
FX12: 30 fps

MDOLENCS VERSION
FP32: 17 fps
Now, i respectfully request a screenshot from each version!
TO see if there is a noticeable quality difference.

And is this the "ultra" version, or the normal?
 
Back
Top