This is incorrect. Check the SIMD Video Instructions in the PTX manual. These operate on both 8 and 16 bit integers.
Chapter 5.2.2. of NVidia PTX ISA documentation states:
"The .u8 and .s8 types are restricted to ld, st, and cvt instructions. The ld and st
instructions also accept .b8 type. Byte-size integer load instructions zero- or signextended the value to the size of the destination register."
You cannot perform any ALU calculation on 8 bit integer types (only load/store/convert is available). There are no native 8 bit registers. 8 bit types are loaded to larger registers (zero- or sign-extended).
Type Conversion chapter in the CUDA programming guideline also states:
"Sometimes, the compiler must insert conversion instructions, introducing additional execution cycles. This is the case for: Functions operating on variables of type char or short whose operands generally need to be converted to int". This pretty much means that CUDA generally processes most integer math in 32 bit registers.
The newest CUDA C programming guide (
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-instruction-throughput) includes CUDA 3.5 (Kepler K20). K20 64 bit float multiplication is 1/3 rate, but 32 bit integer multiplication is still only 1/6 rate. 32 bit integer shifts are faster (now 1/3 rate).
Not that it changes things too much, but GCN can actually do 24-bit integer ops (including integer MADD) at full speed. Although I don't think they're directly accessible through HLSL or GLSL. They also added a 4x1 SAD instruction that works with 32-bit colors.
Did they mention the throughput of the SAD instruction? K10/K20 have only 1/6 SAD rate (same as 32 bit integer multiply).
24-bit integer operations are usually good enough for graphics processing. Sometimes the 16M maximum integer value is a limitation, for example when calculation addresses to big linear (1d) memory resources, but often you can use 2d resources on GPUs, and do addressing with a pair of 24 bit integers (of course that requires twice as much integer math for address calculation).
You can also do limited set of 24 bit integer operations in the 32 bit floating point registers (as 32 bit floats have 24 bit mantissa). This is a trick often used in current generation consoles (pre DX10, no native integer processing). As long as you ensure that there's no overflow or underflow, you can add/subtract/multiply 24 bit integers in the floating point registers. Shifting of course is possible also (multiply by 2^n or 1/(2^n)). You need an additional floor instruction in the down shift case (this is safe, since float renormalization kicks out the same bits as the shifting would zero). For division, the underflow is a bigger problem (you lose N highest bits, where N is number of bits in the divider). So it's only practical for small values (often constants). Divide requires a floor instruction as well. Bit masking and logical operations are not possible when using floating point register hack (but if you know that some bits are clear, add can be used instead of or, so you can do bit packing).
On older (WLIV/vector+scalar) GPUs bit packing is actually pretty fast with a 3d dot product. For example packing a 565 integer color (three float registers containing 24 bit integers) to one can be done with a single dot product. float result = dot(vector4(2048, 32, 0), colorInt.rgb); Now the result contains a 16 bit integer that can be for example stored to a 16 bit integer render target (without any loss in precision). Similar trick can be used to bit pack the 2/3 bit indices of DXT blocks to 16 bit outputs (in GPU DXT compressors).
It's worth noting that both GK110 and Tahiti have *global* memory bandwidth in the 280 GB/s range, which is roughly equal to the L2 bandwidth for Sandy Bridge for all cores combined (quad core), and far higher than the L3. Hence, the lack of large high level caches doesn't hurt GPUs nearly as much as you'd think.
It's not exactly fair to compare a 4 core Sandy/Ivy Bridge to a GPU, since half of the CPU die is reserved for a integrated GPU. You should be comparing the 8 core parts instead, as they are pure CPUs with no integrated GPU sharing the space/power/heat budgets.
Haswell doubles the cache bandwidths. L2 bandwidth per core is 64 bytes per cycle. 8 cores at 4 GHz results in 64 bytes / cycle * 4 GHz * 8 = 2048 GB/s L2 cache bandwidth. That's over 10x the GDDR5 bandwidth of Kepler (Geforce 680), and 7.3x of the GDDR5 bandwidth of Tahiti.
It's interesting to note how very much faster GDDR5 is compared to DDR3. Both the current gen 6-core i7 (LGA2011) and GTX 680 have 4 memory channels, yet somehow the 680 manages around 180 GB/s while the i7 gets 40 GB/s. This is a 4.5x difference. Does the latency optimization of the DDR3 really hurt it that much, or is there more in play? If it really *is* latency optimization related, this is very bad for latency sensitive processors like CPUs, since they won't be able to compete in memory bandwidth, and L2 caches are a poor substitute for equal performance RAM.
I have tried to find more concrete information about this topic as well, but GDDR hasn't been used in commercially available PC/server CPUs. There's not much information about GDDR latencies in CPUs. Xbox 360 uses GDDR3 as it's main memory, and it has 500+ cycle memory latency (source:
http://forum.beyond3d.com/showthread.php?t=20930). Memory latencies of similarly clocked PC CPUs with DDR2/DDR3 are in 150-250 cycle range. However PS3 PPU has also similar (400+ cycle) memory latency, but uses XDR memory (
http://research.scee.net/files/pres...ls_of_Object_Oriented_Programming_GCAP_09.pdf). It could be that the memory subsystems of these simple speed demon PPC cores hadn't been that well optimized for latency. Xeon Phi is the most recent CPU that uses GDDR5. Is there any official Xeon Phi documentation out yet (that would describe the expected memory latencies)?