Is there any news about the cache coherent design of the Kepler? According to some books Fermi does not really support it.
SourceThe 768KB unified L2 cache is the sole memory agent, handlings loads, stores and texture fetches - thus acting as an early global synchronization point. Like the L1 cache, it probably has 64B lines and many banks; the write policies will be discussed below. While Nvidia did not discuss any implementation details, the GT200 has a 256KB L2 texture cache, implemented as 8 slices of 32KB, one slice per memory controller. If Fermi follows that pattern, the L2 cache might be implemented as 6 slices of 128KB, one for each memory controller.
Unlike a CPU, the caches in Fermi are only semi-coherent due to the relatively weak consistency model of GPUs. The easiest way to think about the consistency is that by default there is synchronization between kernels, and if the programmer uses any synchronization primitives (e.g. atomics or barriers), but no ordering otherwise.
Kepler's implementation should be pretty much the same, but with doubled throughput per L2 partition.