NV30/31/34/35 Fragment Processor Diagram(Speculation)

I'm wondering if an older set of NV30 aware drivers could be compared to see if there has been an improvement of performance for the "FX12 additions" case.

Could the GF 4 execute 2 ops per clock per pixel pipeline for simple workloads? Things seem to fit the description of floating point processing + ps 1.1-ps1.3 processing + register combiners AFAICS.

When integer processing, all 3 would be available as pipeline steps, but to maintain fp precision, the fp unit seems only to be able to loop back to itself or to texture memory. Otherwise, I'd think the two add fp16/fp32 case wouldn't slow down.
 
demalion said:
Could the GF 4 execute 2 ops per clock per pixel pipeline for simple workloads? Things seem to fit the description of floating point processing + ps 1.1-ps1.3 processing + register combiners AFAICS.

Actually GF4 can do 4 multiply/dot or 2 add/mux ops/clock (for both RGB and alpha, equivalent to one RGBA vector). It can also do scale/bias/expand etc operations on inputs/outputs which are not supported in fragment programs.

Now that you mention it, this is actually very close to the FX12 performance, possibly suggesting that the FX12-units might be similar to combiners and perhaps even perform their functions. Or it could be just by design (NVidia wanted similar per clock performance).

PS 1.1-1.3 is really equivalent to register combiners.

demalion said:
When integer processing, all 3 would be available as pipeline steps, but to maintain fp precision, the fp unit seems only to be able to loop back to itself or to texture memory. Otherwise, I'd think the two add fp16/fp32 case wouldn't slow down.

Fixed point can be used between texture fetches and it does improve performance significantly. This means texture/fp cannot be a totally separate pipeline stage.

For an example here are three tests with a texture fetch, followed by 50 arithmetic ops, and finally a dependent fetch based on the result:

0.21 fragm/cycle 4.80 cycle/fragm: tex+50*addFX12+dep.tex -> 10.4 add/cyc
0.07 fragm/cycle 13.77 cycle/fragm: tex+50*addFP16+dep.tex -> 3.6 add/cyc
0.07 fragm/cycle 13.82 cycle/fragm: tex+50*addFP32+dep.tex -> 3.6 add/cyc
 
Hmm...so the GF 4 functionality I thought of as "ps 1.1-1.3 + register combiners" is really just "register combiners"? Only my nv35 guess depended on them being separable, I think.

Also, I thought that DX modifiers reflected some of those input/output capabilities? Doesn't the ARB fragment extension allow similar?

Oh, and when I mentioned integer processing, I assumed from the details you provided that the "FX12 additions" wasn't using texture ops, so the fp unit could be used for an additional 4 ops (1 op per pipeline) in the same clock.

For my nv35 guess, it wouldn't be replicating the register combiners, but just part of their functionality (fixed function seems likely) given the supposed transistor budget. It should allow the integer shader processing power offered by the fp and combiners to be applied with more versatility atleast.

If more registers were offered, the fp32 processing would be improved as well, and I'd think the fp16 processing could become true 8 pipeline if the expanded register space was as versatile as they are in nv30 EDIT: and the 8 texture op functionality could then be fully utilized for fp ops. Very significant performance gains for non texture op computation workloads for both fp16 and fp32, for barely any new processing units (if you want to count the fixed function color outputing as processing units)!

Seems quite achievable, likely, and exciting competition if they can clock it at 500 MHz...intermixed fp processing would still "hobble" it compared to the IPC for the R3x0 family, but integer computation heavy workloads and fixed function output should lead according to clock speed, and true fp16 and fp32 (as opposed to the unspecified shortcuts implemented in some places now) computations, with or without texture ops should be twice as fast compared to the nv30. Fits into their marketing (yay, a change for the better), and the observations pretty well, I think.

Looks like it will be a good card, and I'm curious if they'll be able to offer wide availability for the (5900?) "Ultra" part this time. I'd guess ATI is wondering the same thing as far as the rumored "9900" goes.
 
thepkrl wrote:
I've been testing NV30 (5800 Ultra) fragment program performance with driver 43.45 (results are the same as for 42.92 with which I started).

On another note, if the NV30 has 8 fp pixel shader processors, like the diagram indicates, why is it only capable of 4 fp ops per clock? Is this a storage limitation (framebuffer/color), or a computational one?
 
Luminescent said:
On another note, if the NV30 has 8 fp pixel shader processors, like the diagram indicates, why is it only capable of 4 fp ops per clock? Is this a storage limitation (framebuffer/color), or a computational one?

NV30 has 4 fp/fx color shading units in my speculative diagram.
 
Sorry about that, guess I confused Nvidia's assesment with yours. Supposedly, they claim, NV30 is capable of 8 shader ops per clock. I guess they mean FX12/int ops :rolleyes:.

By the way, Zephyr, don't know if I missed it, but were did you obtain the NV35 info? Is it just speculation, or did you consult resources? It seems to have a good possibility of holding true (asesses all of NV30's "flaws"), but I'm not sure whether or not to take it with a grain of salt.

When should we expect a paper launch?
 
So appearently the NV30 has:
(nothing new in this post - just summarizing)

4 units (1/pipe) that can do, one of the following in one cycle:
- an FP instruction (4 per clock)
- most of FX12 instructions (4 per clock)
- two texture fetches (8 per clock)

8 register combiners (2/pipe) NV25 functionality with FX12 precision:

Some interesting points that can be seen from the tests:

- When doing FX12 instructions the "FP" units participate.
- The compiler can optimize multiple mul/dp3 instructions into a single register combiner with some limitations (not surprising)
- FP lrp uses 2 instructions as an FMAD unit is unable to execute it in one cycle (this is the same in the R300)
- FX12 lrp can be executed in a single register combiner (not surprising)
- rsq/lit/pow/rfl are macro instructions (on R300 rsq is a basic instruction)
 
Hyp-X said:
So appearently the NV30 has:
...

You forgot to mention the NV30's ability to do COS & SIN in one cycle.
Oh, sure, it's what most had expected, but I remember some debate about it some time ago.


Uttar
 
Uttar said:
You forgot to mention the NV30's ability to do COS & SIN in one cycle.
Oh, sure, it's what most had expected, but I remember some debate about it some time ago.

Uttar

Yes SINCOS is only available as a macro in PS2.0 (R300) and it takes 8 instruction slots.

Somehow I'm not excited about SIN/COS in the PS, I'd like ATAN much better. ;)
 
Back
Top