NV30 executes 2 integers alongside an fp instruction/clock?

Taking this info with a grain of salt, the Digit-life article on NV30 states:

"Number of pixel shader instructions executed per clock cycle: 2 integer and 1 floating-point or 2 (!) texture access instructions. The latter option is possible as during preceding shader's computational operations the texture units could sample texture values with known coordinates beforehand and save them in special temporary registers, which are 16 in all. I.e. the texture units can single out not more than 8 textures per clock but the pixel shader can get up to 16 results per clock."

Does this statement indicate the NV30 could execute 2 integers alongside a floating-point instruction, or only one of the two (the wording seems to include both, but I'm not sure)? What would be the advantage of executing 2 integers alongside a floating point operation be (this is in the case of color blending and color ops, assuming the integer units are combiners)?
 
JohnH said:
I'd heard that the old reg combiner pipe was still in there, sitting after the new FP stuff.

That what is the NV30 OpenGL extensions specification document (released last summer) says.
 
I just wanted to know if Digit-life meant the FX could execute 2 integers alongside an fp concurrently or only 1 of the two at a time. Maybe it is similar to a modern cpu, such as the athlon, which can issue 3 ints and fp instructions concurrently.
 
Luminescent said:
I just wanted to know if Digit-life meant the FX could execute 2 integers alongside an fp concurrently or only 1 of the two at a time. Maybe it is similar to a modern cpu, such as the athlon, which can issue 3 ints and fp instructions concurrently.

Though I don't know a whole lot about how shaders would be programmed, this doesn't seem like it would be a very efficient way of doing things. That is, it seems that most shaders would use mostly FP or Int, rarely both at the same time. I'd still be willing to bet that the entire pixel shader pipeline in the GeForce FX is floating-point, with the ability to do int calcs just through the floating-point calcs having a large enough mantissa to handle the int calcs.

If the entire pipeline can indeed do two int ops in addition to a floating-point op, then it seems pretty obvious that those two int ops are just the legacy register combiners.
 
In the Beyond 3D/Nvidia covering the Geforce FX, David Kirk responds in the following manner to the following question:

" 'I assume that anything available currently using the the 32-bit format will be run in FP16 mode?'

'Actually, no. We have native support for 32-bit integer, which is how we get the performance on the older apps. If we were to run them as FP16 then they wouldn't run as fast. So we have dedicated hardware with native support for 32-bit per pixel integer, 64-bit per pixel floating and 128-bit per pixel floating.' ''

Doesn't this confirm the fact that the NV30 does indeed contain the register combiners for legacy support? I surmise there is a smaller latency overhead when using this dedicated integer hardware?
 
Well, it seems to say, contrary to what I had thought, that the GeForce FX has a completely separate execution unit for integer calculations than it does for floating-point calculations. This, to me, seems like a tremendous waste of transistors...but I guess I don't know the whole story about the design of this sucker. The real question, then, is is there a way to make use of the integer and floating-point parts in parallel in the pixel shader? I think this will need some investigating...though it would seem odd for a single shader to deal with both data types throughout.
 
Yes Chlanoth, whether the integer pipeline can be used alongside the fp pipes is what I would like to know. The wording in the Digit-life article seems to indicate this, however it might just be an incorrect perception or improper wording.
 
Well, if there is a separate and complete integer pipeline, then it is probably a DX8 pipeline identical to the GeForce4's. This is the only way that I could see it as being beneficial, in terms of performance.
 
Chalnoth said:
Well, if there is a separate and complete integer pipeline, then it is probably a DX8 pipeline identical to the GeForce4's. This is the only way that I could see it as being beneficial, in terms of performance.

and...

this could/would very well explain the high transistor count of the NV30 despite the (as it now seems) lower processing power per cycle compared to the R300.
 
Mboeller, I wouldn't be jumping to conclusions over the NV30's ipc, as it probably executes just as many pixel ops (address, texture lookup, and color op) as the R300 at complete 32-bit precision (all the pixel processors are by nature 32-bit, and the latency I doubt is any higher than the R300's color processor's). NV30 has a lower ipc for vertex ops, but it is much more flexible and capable with vertices than the R300. The extra transistors (aside from the combiners) probably went to support the high precision functions and instruction sets of the pixel and vertex pipes, as well as the pixel color units higher date precision ability. Alongside the fact that the pixel program processors (color computation units) have the control logic to manipulate half-precision and lower precision numbers at increased processing rates.

Remember that the R300 spares logic for trufrom 2 support, which the NV30 lacks, as well as data encoding for floating point cube-maps, 3d textures, etc., and more z sampling units per pipeline for anti-aliasing. Thus, it is quite incorrect to credit the NV30's greater transistor count to the register combiners, as it is obvious that there are other forms of more advanced logic in the processor.
 
How do you get lower processing power per cycle? There is no confirmed knowledge about whether or not the NV30 can issue texture address and texture lookups in parallel with color ops, but reason would dictate it can. First of all, NVidia claims that their dependent lookups run very fast, which means NVidia spent time trying to lower pipeline latency and stalls on shader fetches. Second, even DX8/OGL texture_shader could issue texture lookup "in parallel", they was one of the reasons for the whole separate phase stuff. To not have this capability was be a step back from the GF4. Third, the texture unit IS a separate unit and the NV30 is a VLIW architecture, so why leave silicon idle if you've got a few extra bits to signal it to do something?

We've seen speculation that perhaps the 128-bit ops run in 2 cycles, suggesting half the number of FP color units, but we have no hard evidence either way.
 
Democoder, I think it was in the Extremetech interview with David Kirk where he states that the entire pipeline (all the units) are natively 32-bit. Why would they create 32-bit units which were unable to process in 1 effective cycle?

If you go back to the commencement of the thread, you will notice that the instructions per-clock performanced referenced (from the Digit-life article), was only for the color processor. This does not include the texture interpolators or address processors.
 
Even if the FP units can run 32-bit ops in one cycle, you still need enough of them to do a full 128-bit FP in one cycle. What if the NV30 pixel pipeline only has enough for 2 32-bit FP units per pipeline? This means a 64-bit op runs in 1 cycle, but 128-bit takes 2. This is similar to the way vectorized instructions execute on SSE2.

However, that's only speculation, and that's my point. MBoeller spoke as if we have some kind of factual information on this issue, vs speculation based on random PR statements.
 
Hmm democoder, never thought of that. But, acording to the 51 or 52 gflops number Nvidia was touting for the pixel shader, it is probable that the there are 4 fmacs and 1 complex math unit per pixel color pipeline. This is alongside the adress and interpolation units, which also should contain 4 32-bit units each. I remember coming up with a number around 51 gflops taking this into account. I may be wrong, but it seems very plausible. If the NV30 was half as fast when computing 32-bit, it would not be comparable to the R300 at full precision.
 
I just have a strong feeling that the 51GFLOP number is incorrect (or rather it's not for the fastest GeFX). And that it's calculated from a 400MHz core frequency.

400MHz * 8 pipes * 4 scalar/pipe * 2 FLOP/(scalar*OP) * 2 OP = 51.2GFLOP

2 FLOP/(scalar*OP) - when doing MAD
2 OP - two instructions per cycle when doing 4x16f math.

Half that number when doing 4x32f math.
 
I believe Nvidia was counting full floats to obtain that number. I remember in the presentation they quoted 200 gflops for the entire processor and implicitly stated it was counting 32-bit floats, not half floats.
 
mboeller said:
and...

this could/would very well explain the high transistor count of the NV30 despite the (as it now seems) lower processing power per cycle compared to the R300.

Talking about triangle output?

What is it, um...

FX @ 500MHz = 350 million p/sec

R300 @325MHz = 325 million p/sec

....?
 
That part is known, given the 3 vs 4 VS units difference. However, those are peak numbers, so the actual throughput isn't known. How busy all the units can be kept, how efficient some of the vertex shader ops are executed, how the driver bottlenecks the VS, all of these chop those numbers way down.
 
Back
Top