NV30 arbitrarily executes 2 Alu instructions per cycle?

In the analysis of nvidia's flop number, it was observed that the pixel pipeline of the NV30 is capable of executing 2 Fmads or 2 alu instructions per cycle. I was curious as to the precision of each fmad. Are the units working on full floats or half-float data to obtain this number (51 gflops)? Whether it is working on half-float data for full float data to achieve the stated flop number, is the NV30 possibly issuing two distinct instructions for the alu's or is it just one alu working on two groups of data? For example, let us say that the NV30 can only execute 2 Alu instructions on half-float data, for each of the 2 sets of half floats (16 per vector unit), would the pixel unit be able to issue a sin/cos instruction for 1 alu while issuing a dp4 instruction for the other?

We know that the R300 can execute both a scalar and a vector operation in one cyle (also a texture op). Although the vector unit in the pixel program processor is for just 3-components (RGB), there is also a scalar unit (I'm guessing for the A channel), which works in parallel with the vec3 processor. Would the R300 hold any advantage over the NV30 because it can execute a scalar and vector simultaneously, or would the NV30 be able to use a vector unit for scalar processing (math functions) and have the other vector unit for RGBA calculations (assuming it can execute 2 arbitrary half-float instructions per cycle in 64-bit mode)?

Thankyou if you can be of any help.
 
I don't know for sure.

if it were a integer unit, I'd be certain you could split them apart with little effort (or couple them together, depending on how you looked at it).

But with FP. I'm not sure the complexity of the floating point operations would lend themselves to this sort of thing. But I'm ignorant of the operations that goes on in a logical floating point device.
 
Well, what I was referrug to with the term alu was generally the fpu. I assumed that the term alu defined either, a floating point or an integer device. Sorry if I mislead anyone.
 
RussSchultz said:
But with FP. I'm not sure the complexity of the floating point operations would lend themselves to this sort of thing. But I'm ignorant of the operations that goes on in a logical floating point device.

No, CPUs must do this for support of doubles and long doubles on 32-bit processors (64-bit and 128-bit floats, respectively). As a side note, I recently did a quick little performance test on my Athlon, using a program that is easily small enough to fit entirely within cache, and I found that irrespective of memory bandwidth, doubles run at half the speed of floats, meaning that the 64-bit float is actually comprised of two 32-bit floats for processing.

Now, the big question is: how many transistors does such a thing take?
 
The vec4 fmad unit of the NV30 pixel program processor is probably an simd engine which works on 32-bit float components and 16-bit half-floats. If it can execute a 2 half-floats in the time it executes a float, then I am guessing the unit is comprised of either 2 64-bit vec4 processors or a 128-bit vec4 unit which can process 8 half-float inputs and 4 float inputs. Do you guys believe the pixel program processor is comprised of a 128-bit vec4 unit (1 instrucition issued, 2 operations executed per cycle) or 2 64-bit vec4 units, allowing for separate instructions to be issued while additionally performing 2 operations per clock (in half-float mode)?

The only problem I see with 2 64-bit vec4 units lies beneath the fact that the pipeline must be able to carry-out a 4 comonent float operation in 1 cycle. I do not know how 2 smaller 64-bit units would be joined to compose a 128-bit vec4 unit.
 
Hmm, if this is true, it is a pretty nifty setup. However, from where in the article did you concur that the NV30 uses 64-bit word lengths? Maybe from the following paragraph:

"Output Register and temp register share the same space in NV30. That is to say a fragment program fails to load if its total temporary and output register count exceeds 64. Each FP32 temporary or output register used by the program counts as two registers, and each FP16 temporary or output register used by the program count as a single register. R300 has a similar limitation."
 
Chalnoth said:
No, CPUs must do this for support of doubles and long doubles on 32-bit processors (64-bit and 128-bit floats, respectively). As a side note, I recently did a quick little performance test on my Athlon, using a program that is easily small enough to fit entirely within cache, and I found that irrespective of memory bandwidth, doubles run at half the speed of floats, meaning that the 64-bit float is actually comprised of two 32-bit floats for processing.

Now, the big question is: how many transistors does such a thing take?

First of all, floating point numbers are either 32bit, 64bit or 80bit on x86 platforms.
Second, it's a much more complex thing to make a piece of arithmetic hardware which can either operate on one 64bit float or two 32bit floats than it would be for integer units. For integer units it's pretty much straightforward for most operations, not so for floating point. Cutting precision though is straightforward, and it's what's done on CPU's. Even though all floats regardless of size are stored on the 80bit/entry stack you can set the operating precision by masking the x87 control word. If you measured exactly double the speed with floats over doubles then it's very likely coincidental that it's exactly double the speed. There are so many things that prohibit such archetectures in CPU's, especially on x86. What you are seeing is that operating in lower precision mode takes fewer cycles for each instruction.

For a GPU though I think the easiest way to do it would be to have one full unit that can operate on a full precision float but can cut precision for operating on lower precision units, plus a smaller floating point unit for only operating on halffloats.
 
For floating-point multiplies, you can subdivide the multipler into 2 smaller multipliers that either operate on half-precision data or operate combined on full-precision data. There is no corresponding easy trick available for splitting FP adds, though, which should be clear from how FP adds are done:
  • First, you normalize the two inputs so that they have the decimal point in the same position
  • Then, you perform the actual addition
  • Then, you detect and discard any leading zeros in the result
  • Finally, you apply rounding.
And, Chalnoth, did you make sure to align all the doubles to memory addresses divisible by 8, as well as just making sure the entire data set fit into L1 cache? Misaligned memory accesses could well produce the 50% speed hit you saw. Also, did the compiler use SSE/3dnow instructions for the single-precision data (which would speed up single-precision operations tremendously)?

AFAIK, a single-precision FP multipler takes about 20K to 35K transistors, depending on the level of pipelining; a single-precision FP adder takes somewhat less, strongly dependent on whether it is optimized for speed, power usage or area usage.
 
Aran, is it possible to have two double precision (64-bit ) fpu's which work simd on 4 16-bit values and which can be combined as a 128-bit unit for 4*32-bit operation?
 
Luminescent said:
Aran, is it possible to have two double precision (64-bit ) fpu's which work simd on 4 16-bit values and which can be combined as a 128-bit unit for 4*32-bit operation?
Only partially. You will need separate adders for each precision level, possibly reusing some adders like Humus indicated. Reusing the same multiplier circuits at each precision level should be doable, though.

Although I still consider such a circuit to be a bit of a chimera.
 
Would it be possible that the NV30 pixel program processor fpu uses a similar setup to the Hitachi SH-4, which contains 4 32-bit fmuls and a 128-bit fadd unit? The following picture is more descriptive:
http://www.segatech.com/technical/cpu/sh4fpudiagram.gif

The author of the segatech article writes:
"The SH-4 architecture includes impressive 3D floating point hardware. Each of the four floating point multipliers (fmuls) can receive two 32-bit values and produce a multiplied result that is passed to a four-input floating point adder. This hardware reads two 128-bit vectors (two sets of four 32-bit values) out of register files, multiplies the four 32-bit pairs at the same time, adds the four products together, and puts the 32-bit result back into the register file. This provides the equivalent of 288-bit data crunching (2 x 128 + 32 = 288).

A typical application for this processing power would be to perform the following transformation instruction, which involves seven operations:

f0*f4 + f1*f5 + f2*f6 + f3*f7 ' f7

The SH-4 can execute this seven-operation instruction in three clock cycles. Yet, because the architecture is fully pipelined, it can issue one of these instructions every cycle.

The figure (above, right) shows a better example of what the SH-4's floating point hardware can accomplish. Here the back register file is loaded with 16 values and the hardware performs the following matrix operation in seven clock cycles:

f0*b0 + f1*b1 + f2*b2 + f3*b3 ' f0
f0*b4 + f1*b5 + f2*b6 + f3*b7 ' f1
f0*b8 + f1*b9 + f2*b10 + f3*b11 ' f2
f0*b12 + f1*b13 + f2*b14 + f3*b15 ' f3

The SH-4 is fully pipelined, and the RISC architecture can repeat these 16 fmuls and 12 fadds (28 operations) every four clock cycles, for an average of seven floating point operations per cycle. The superscalar CPU and double-precision fmov allow registers to be loaded from, and stored to cache during these four cycles, so the operations are sustainable. At its 200-MHz clock speed, the SH-4 achieves 1.4-GFlops performance, sustained."

Does this seem like a possible fpu method for the NV30, I don't mean exactly (there are possible alterations, such as 64-bit fmuls which can be split), but the general gist of it seems possible?
 
Sounds like a plain dot-product to me - I would expect each of the FP vector units in the NV30 (in both vertex and pixel shaders), however many they are in each mode, to be capable of doing one such dot-product per clock cycle (throughput, at least).
 
I just figured out that the entire pixel shader, or pixel unit, is a vector processor in itself that can issue instructions in vliw format (which can contain both scalar and simd commands) to any of it's logical units (pixel program processor, texture adress unit, tmu's), similar to the vu's in the ps2 emotion engine.
The situation is explained in the following article:
http://www.arstechnica.com/reviews/1q00/playstation2/ee-6.html

If a similar situation holds true for the NV30, then the pixel program processor is probably composed of something like 4 32-bit Fmacs which can be arranged in an simd execution format, while containing an fdiv for the more complex math instructions. Does this seem realistic? Hopefully Nvidia's implementation is similar.
 
Does anyone believe the pixel shader unit will be similar to the vertex shader in the NV2X series of processors, with 4 fmacs and a special purpose processor?

I am very interested in processing architectures and hardware, learning them is a hobby of mine, so please do not leat my questions annoy any who are surfing this thread or forum. Having said that, there is one more inquery. Let us say that for some arbitrary operation we only need a 2-way simd operation, to take place within a multithreaded vector core. The core contains 4 fmads which normally utilized in a vec4 simd setup. If the vector core was able to issue 1 vector and 1 scalar operation in parallel, would we be able to execute the 2-way simd instruction on 2 of the fmads and the scalar instruction on one of the other 2 fmads?
 
arjan de lumens said:
Sounds like a plain dot-product to me - I would expect each of the FP vector units in the NV30 (in both vertex and pixel shaders), however many they are in each mode, to be capable of doing one such dot-product per clock cycle (throughput, at least).

naturally. otherwise a DOT3 op would incur a serious perf hit to the pixel pipeline.
 
Luminescent said:
In the analysis of nvidia's flop number, it was observed that the pixel pipeline of the NV30 is capable of executing 2 Fmads or 2 alu instructions per cycle.

Just a point on your original statement. I think you'll find that the 2 floating point ops per cycle are the multiply and add of an FMAD. 1 instrunction...2 FOPS.
 
Back
Top