According to an interview with Andy Keane and PCGH, each CUDA core consists of:
a DP-FMA, a SP-FMA and an integer ALU
... and they say in some cases the DP-FMA can be used for SP-taks.
So we are talking about up to 4 FLOPs per CUDA core?
Nvidia's Whitepaper said:Each SM features 32 CUDA
processors—a fourfold
increase over prior SM
designs. Each CUDA
processor has a fully
pipelined integer arithmetic
logic unit (ALU) and floating
point unit (FPU). Prior GPUs used IEEE 754-1985
floating point arithmetic. The Fermi architecture
implements the new IEEE 754-2008 floating-point
standard, providing the fused multiply-add (FMA)
instruction for both single and double precision
arithmetic.
A frequently used sequence of operations in computer graphics, linear algebra, and scientific
applications is to multiply two numbers, adding the product to a third number, for example,
D = A × B + C. Prior generation GPUs accelerated this function with the multiply-add (MAD)
instruction that allowed both operations to be performed in a single clock. The MAD instruction
performs a multiplication with truncation, followed by an addition with round-to-nearest even.
Fermi implements the new fused multiply-add (FMA) instruction for both 32-bit single-precision
and 64-bit double-precision floating point numbers (GT200 supported FMA only in double
precision) that improves upon multiply-add by retaining full precision in the intermediate stage.
The increase in precision benefits a number of algorithms, such as rendering fine intersecting
geometry, greater precision in iterative mathematical calculations, and fast, exactly-rounded
division and square root operations.
They aren't talking about 2 ALUs separated ALUs for DP and SP operations in their whitepaper. I still believe it's 2 flops/clock for each core.