With just one or two threads per core a CPU has a lot of cache space per thread. There are many cache hits even when the data accesses diverge for a while and then old data is reused.
The difference may be not as large as you think as GPUs can hold a lot more data in the registers.
But let's make up some numbers and say a GCN CU runs 12 heavy threads. That are 3 threads per 512bit wide SIMD unit (logical width is 2048 bit in ATIs case). A CPU with SMT also runs already 2 threads for let's say two 256bit SIMD units. Each thread (aka wavefront) on a CU has significantly more vector registers (in this example 85 instead of 16) available to store it's active data before any need to resort to the caches for reuse of data. Additionally, it would have access to the local memory array (64kB per CU) available as a user controlled cache for instance. This will result in a significantly lesser loaded cache system for the same throughput to begin with.
Each CU allows to read (that's what current GPUs already do) from 4 independent addresses per clock (4x16 byte/clock) from its L1, in addition, each of the 32 banks of the 64kB shared memory is able to deliver 4 byte/clock (coming from up to 32 different locations obviously).
For comparison, Sandybridge can load up to 2x16byte/clock (from 2 addresses). The interface between L1 und L2 cache in GCN is 64 (or even 128) bytes/clock wide, Sandybridge has 32 byte/clock. One has to normalize that to the clockspeed and the intended throughput of course. So let us do it:
GCN, single CU @ 850 MHz:
54.4 (108.8) GFlop/s (with fma)
256 kB vector register space for 12*64=768 simultaneous data elements, i.e. 85 floats per data element
8 kB scalar register space for the 12 threads
64 kB local memory
16 kB 64 way associative L1
108.8 GB/s local memory bandwidth
54.4 GB/s L1 cache bandwidth (L1<->L2 is the same)
Sandybridge core @ 3.4 GHz
54.4 GFlop/s (no fma available)
1 kB vector register space for 2*8 = 16 data elements with SMT (512 byte for 8 data elements without), i.e. 16 floats per data element
256 Byte (0.25 kB) integer register space for 2 threads
no explicit local memory
32 kB 8 way associative L1
108.8 GB/s L1 cache bandwidth (L1<->L2 is the same)
I still fail to see the distinct advantage the CPU is supposed to have. While it is true, that one can imagine problems where the L1 of the CPU can hold an extended working set for the data elements that may not fit in the registers of a GPU, it is very likely that the reduction of the needed L1 traffic outweighs that easily. Because after all, the available bandwidth/flop is very comparable.
GPUs on the other hand offer hardly any wiggle room. There are too many threads for each of them to get a decent amount of cache space. Only incredibly correlated accesses benefit from having a cache at all, such as constants shared between threads, and overlapping texture filter kernels. That's fine for rasterization graphics, but not for much else.
That is not true. First, for most problems one has indeed to fetch data from close locations for data elements lying close together (if not, organize your problem accordingly, it will also help the CPU), and it is definitely true for graphics, which is still your main concern, if I didn't misunderstand you.
And it is enough that another (or the same) thread needs data from the same cache line to have a benefit. Actually the conditions to see a positive effect of the caches are basically the same as on a CPU with the only real problem being completely random memory accesses for arrays far larger than any cache (higher for CPUs) and no arithmetics to hide the latency (GPUs are often better at this).
I found it actually quite amazing how much bandwidth you can realize on a GPU with random indexing (with basically no coherence between neighboring data elements) into a 16 MB buffer (far larger than any cache) when each access reads 16 bytes (it's more than you can expect from a CPU when reading from the L3 cache).
A high-end GPU has more storage and bandwidth in total. So instead of having just a few threads that manage to make painstakingly slow progress, you get several more of them.
Again, a high-end GPU does basically nothing for a problem with a low thread count. If you have just a single thread (wavefront/warp) per SIMD/SM on a high-end GPU, a GPU with half the number of units will execute the kernel at the same speed. Scaling the number of units doesn't scale the number of threads of a kernel.
AVX units are not humongous nor power hungry. Haswell should reach 500 GFLOPS for 1 billion transistors.
How do you count that? A quad core Sandybridge has already more than 900 million transistors for ~200 GFlop/s in single precision. Do you think two 256 bit FMA units per core and all those data paths necessary so they don't just idle are coming for free?