With a logical width of 2048-bit for GCN there are only 21 registers from the code's point of view.
Nope, I factored that in already.
16 k floats register / 64 vector width / 3 threads = 85 registers per thread (4 times that per physical ALU).
With AVX-1024, the CPU would still have 16 logical registers, but quadruple the latency hiding ability.
Which would only mimic what GPUs do already on top of having more registers.
Besides that AVX1024 is quite hypothetical at the moment.
But registers don't increase the cache hit rate!
But they decrease the number of memory accesses in the first place! CPUs have also high hitrates, because they are often storing and reloading data GPUs can simply keep in their registers. 20% misses of x doesn't have to be more than 10% of y.
CPUs can also store a much larger working set in memory and thus handle more complex workloads.
Are you referring to global memory?
Whereas when GCN runs out of addressable registers, performance plummets.
So does on CPUs. Not as sharply, I agree, but you also don't feel it that much because you are already working in that regime all the time.
21 registers isn't a whole lot, especially since there isn't a viable alternative.
As said, it's actually 85 for the example given above.
CPUs gracefully spill data to cache memory.
So does Fermi and GCN. As said, maybe not as gracefully, but still. And it is less often a problem.
So the limited number of 16 registers per thread is backed by a huge amount of additional storage that can be used both for additional variables or the actual data set.
When it comes down to streaming some values from memory and doing only little arithmetics on them, CPUs as well as GPUs are limited by the bandwidth anyway. The "window of opportunity", where the relatively larger caches of a CPU may be decisive, is not that large.
So future CPUs will be capable of handling tasks as complex as compilation (where a 1 MB stack is no luxury)
That has no relevance to throughput computing or wide vector units. They won't help for this task anyway. Also not the CPU.
, as well as high throughput tasks, and anything in between. Do not for a second underestimate just how powerful this will make every CPU.
First, we have to see two 256 bit FMACs in Haswell and the effect on power consumption, needed transistors/space not only for the units itself, but also the L/S units, the caches and so on. All that needs a widening by at least a factor of 2 for the units to work without stalling. Maybe we will talk then about extending that to wider vectors again.
By the way, have you ever thought of Intel providing a basically common instruction set for the 256 bit AVX units closely coupled to the integer cores and more throughput oriented Larrabee-like in-order units with wider vectors, larger register files and so on for more power efficient execution of those workloads? You know, there are a few projections out there that it isn't possible to use all transistors at the same time at 14nm and below for power reasons anyway.