One problem here for the OP is that a unit considered an "ALU" in the context of GPUs is actually a much more complicated functional unit than what is usually called an "ALU" in the context of CPUs (this gets worse by the fact that while there is plenty of decent-quality literature and university courses available on CPU architecture/design, there is practically nothing of the sort available for GPU architecture/design).
In CPU terminology, a unit called an "ALU" is generally only capable of performing a single (scalar) operation at a time, and the operations it can perform are limited to very simple integer operations (add, subtract, compare, and/or/xor, shift); other common operations such as load/store, multiply, branch, as well as anything that is vector or floating-point, are normally handled by other units that are NOT normally referred to as ALUs.
In a GPU, however, an "ALU" has come to means something rather different; it is a unit capable of perfoming at least a full four-element-vector floating-point operation all at once, implying the existence of at least four floating-point units. Many "ALU" designs in GPUs also offer extra options; if you don't need a full length-4 vector calculation, then you can split the 4 units so that some of them work on one instruction and the others on another instruction at the same time (in the case of the R520 pixel shader ALUs as listed above, the only split available is length-3 vector + scalar).