NVIDIA's multipurpose ALU

nAo

Nutella Nutellae
Veteran
One ALU to rule them all ;)

[SIZE=+1]Multipurpose multiply-add functional unit[/SIZE]
A multipurpose functional unit is configurable to support a number of operations including multiply-add and comparison testing operations, as well as other integer and/or floating-point arithmetic operations, Boolean operations, and format conversion operations
[SIZE=+1]Multipurpose functional unit with multiply-add and logical test pipeline[/SIZE]
A multipurpose functional unit is configurable to support a number of operations including multiply-add and comparison testing operations, as well as other integer and/or floating-point arithmetic operations, Boolean operations, and format conversion operations.
[SIZE=+1]Multipurpose functional unit with combined integer and floating-point multiply-add pipeline[/SIZE]
A multipurpose functional unit is configurable to support a number of operations including floating-point and integer multiply-add, operations as well as other integer and/or floating-point arithmetic operations, Boolean operations, comparison testing operations, and format conversion operations.
Marco
 
Last edited:
Not at all..it's a wonderful sunny day in Italy now and I'm about to go out..byebye :)
 
Hmmm .. while patent applications are interesting from a technical point of view, as an indicator of what to expect in near-future technology, I would be rather surprised if these patents get granted - the claims are, after all, a fairly accurate description of execution units found in a ton of CPU designs, such as e.g. the Altivec unit in PowerPC G4.
 
arjan de lumens said:
Hmmm .. while patent applications are interesting from a technical point of view, as an indicator of what to expect in near-future technology, I would be rather surprised if these patents get granted - the claims are, after all, a fairly accurate description of execution units found in a ton of CPU designs, such as e.g. the Altivec unit in PowerPC G4.

Not saying that it isn't found in tons of CPU designs, but according to Ars, the altivec unit in G4 is a combination of two integer units, a FP unit and a permute unit.
http://arstechnica.com/cpu/03q1/ppc970/ppc970-7.html

It would be interesting to see how the complexity of the proposed combined unit compares to simpler units.
It seems like it could be an interesting thing to implement in D3D10 where one kind of ALU could handle both integer and FP instructions, since it should lead to higher utilization of the ALU units.
 
Interesting, thx for the leg work nAo,

Wondering multifunctional units in CPU's tend to have a performance disadvantage over speciallized units, old adage in computers, you can only have flexibily or performance, not both. What would the implications be of a multifunctional units like this? I know about the stance that nV has taken with unification, but is this is the same concept of unification, seems a bit different.
 
Razor1 said:
Interesting, thx for the leg work nAo,

Wondering multifunctional units in CPU's tend to have a performance disadvantage over speciallized units, old adage in computers, you can only have flexibily or performance, not both. What would the implications be of a multifunctional units like this? I know about the stance that nV has taken with unification, but is this is the same concept of unification, seems a bit different.

You can have flexibily and performance, but not for free, it still costs extra transistors.
It should read, "flexible, performant and small. Pick two among those"
 
Hey, cool. Thanks, nAo. Green love (how appropriate in this case!) left in the usual place.

At a macro level, it's interesting that these are from late 2004 where some others we saw that seemed aimed USC-ish seemed to be mid-2003. Otoh, "2 years" seems to be a decent benchmark for making a new design a marketable reality, and these would still point you at late 2006. . .lessee, what are we expecting from NV in late 2006? :smile:
 
The main disadvantage of having an excessively multi-function unit is that it generally has somewhat higher latency than separate units by themselves would have (which is less of a problem in a GPU than in a CPU, though). Also, for power consumption, you would probably want to take some care to not accidentally activate parts of the unit that you don't need, which is easier with smaller separated units than a monster-unified unit.

In the example with Altivec, it does appear that G4 indeed has multiple units, as does G5 (970); however at least G5 performs instruction scheduling as if the group of units (except for the permute) was just one single massive unit, leading me to suspect that there is a substantial amount of resource sharing going on (in particular, it would appear rather natural in such a situation to physically share multipliers between the integer and FP functions.)
 
geo said:
At a macro level, it's interesting that these are from late 2004 where some others we saw that seemed aimed USC-ish seemed to be mid-2003. Otoh, "2 years" seems to be a decent benchmark for making a new design a marketable reality, and these would still point you at late 2006. . .lessee, what are we expecting from NV in late 2006? :smile:

Aren't the two sets of patents complementary though? I haven't read this second set yet but I would think ALU design is at a lower level than the USC described earlier. As in, this ALU is a component of the processor in the USC. It would be interesting if none of this makes it into G80 - since they're going to be hanging on that for at least two years.
 
Megadrive1988 said:
so this ALU technology is likely for the Nv6X generation, not the Nv5X / G8x generation ?
Actually, it's nearly completely obvious to me that it's for the G8x because it just wouldn't make sense for it to handle integers any other way. And yes, pretty much everything in this patent is required for DX10, except the way the handle lookup tables, but that's already being done to an extend or another as of NV3x, afaik.

Here's a DX10 SDK excerpt I found on GPGPU.org a while ago that clearly indicates how high the integer requirements of DX10 are:
Integer and Bitwise Support
The common shader core provides a full set of IEEE-compliant 32-bit integer and bitwise operations. These operations enable a new class of algorithms in graphics hardware - examples include compression and packing techniques, FFT's, and bitfield program-flow control.

The int and uint data types in Direct3D 10 HLSL map to 32 bit integers in hardware.

Differences between Direct3D 9 and Direct3D 10:

In Direct3D 9 stream inputs marked as integer in HLSL were interpreted as floating-point. In Direct3D 10, stream inputs marked as integer are interpreted as a 32 bit integer.

In addition, boolean values are now all bits set or all bits unset. Data converted to bool will be interpreted as TRUE if the value is not equal to 0.0f (both positive and negative zero are allowed to be FALSE) and FALSE otherwise.

Bitwise operators
The common shader core supports the following bitwise operators:

Operator Function
~ Logical Not
<< Left Shift
>> Right Shift
& Logical And
| Logical Or
^ Logical Xor
&&= Left shift Equal
>>= Right Shift Equal
&= And Equal
|= Or Equal
^= Xor Equal

Bitwise operators are defined to operate only on Int and UInt data types. Attempting to use bitwise operators on float, or struct data types will result in an error. Bitwise operators follow the same precedence as C with regard to other operators.

Binary Casts
Casting operation between an int and a float type will convert the numeric value following C rules for truncation of int data types. Casting a value from a float, to an int, back to a float result in a lossy conversion according to defined precision of the target.

Binary casts may also be performed using HLSL intrinsic function. These cause the compiler to reinterpret the bit representation of a number into the target data type. Here are a few examples:

asfloat() //Input data is aliased to float
asint()//Input data is aliased to int
asuint() //Input data is aliased to Uint
Obviously, this is mostly what I'd call a "good peformance/mm²" compromise - the goal here is not to pay for any extra units to make it fast, while still staying in the spec and having very reasonable and scalable performance (because it isn't just one mini-unit you added that's massively insufficient). The obvious disadvantage here is that it's likely it will take 2 cycles for int32:
[0089] In one embodiment, the multiplier supports up to 24-bit times 24-bit multiplications. Products of larger operands (e.g., 32-bit integers) can be synthesized using multiple multiplication operations (e.g., multiple 16-bit times 16-bit multiplication operations) as is known in the art. In other embodiments, the multiplier may have a different size and may support, e.g., up to 32-bit time, 32-bit multiplication. Such design choices are not critical to the present invention and may be based on considerations such as chip area and performance.
And I also wonder how it's going to handle vectorization (Are all units going to be this way? Or is it going to be Vec3/Vec4 FP32+Scalar Multipurpose?), among other things, although this is mostly a design decision they'll have to make depending on what they expect programmers to do with it. It is also questionable whether we'd be able to queue another specific kind of op during the second cycle of an int32 addition or multiplication, as well as whether int32 bitmasks are also going to take 2 cycles.
Another possibility, of course, is to make those units have a 39-bit or 40-bit mantissa from the get-go, but that does feel like overkill...


Uttar
 
Last edited by a moderator:
Uttar said:
Actually, it's nearly completely obvious to me that it's for the G8x because it just wouldn't make sense for it to handle integers any other way. And yes, pretty much everything in this patent is required for DX10, except the way the handle lookup tables, but that's already being done to an extend or another as of NV3x, afaik.

Here's a DX10 SDK excerpt I found on GPGPU.org a while ago that clearly indicates how high the integer requirements of DX10 are:
Obviously, this is mostly what I'd call a "good peformance/mm²" compromise - the goal here is not to pay for any extra units to make it fast, while still staying in the spec and having very reasonable and scalable performance (because it isn't just one mini-unit you added that's massively insufficient). The obvious disadvantage here is that it's likely it will take 2 cycles for int32:
And I also wonder how it's going to handle vectorization (Are all units going to be this way? Or is it going to be Vec3/Vec4 FP32+Scalar Multipurpose?), among other things, although this is mostly a design decision they'll have to make depending on what they expect programmers to do with it. It is also questionable whether we'd be able to queue another specific kind of op during the second cycle of an int32 addition or multiplication, as well as whether int32 bitmasks are also going to take 2 cycles.
Another possibility, of course, is to make those units have a 39-bit or 40-bit mantissa from the get-go, but that does feel like overkill...


Uttar


I can go along with that, no prob. afterall, NV5x generation is gonna be around for around 2 years+ anyway, even if that includes the "G9x" series.

assuming NV6x starts out as the "G100" series.
 
Back
Top