Nvidia BigK GK110 Kepler Speculation Thread

Kepler's 30% advantage in flops is very tiny in the grand scheme of things.

More caches, branch prediction helps all codes.

MIC does not have any branch prediction.

MIC has 2x more cache per core than Kepler. Which makes a lot of difference for everything.

In terms of FP32, its 130% more instead of merely 30%+, to be honest I was hoping MIC can perform better, due to its obviously large cache and supposedly better branch predications etc.

However, so far from my experience, it isnt, the reasons I suspect are:

1)Maybe 512bit SIMD is too wide a pure vector unit for even most parallel tasks, so in most cases a large part of the SIMD remaining ideal.

2) I am not entirely sure that Intel's SIMD can do FMA or not, if not, thats another disadantage comparing to Kepler in handling matrix maths.

3)Maybe the maths ops I tested (radix-sort, solving system of linear equations and other matrix-based maths operations) can barely benefit from large cache.

4)Due to the lack of out-of-order functionality, Intel's MIC isnt so much better than GK110 in handling complicate tasks.

5)Maybe the software support for MIC is not matured at the time I tested it. And I only play with MIC for a very short period of time, however I used alot MKL, so it is not likely the code cannot take advantage of MIC.

6)Intel has a reason to not to make MIC as good as it can be since it isnt that profitable for intel comparing to their other products(Xeon), and I suspect the main reason intel try to throw MIC is to prevent Nvidia from growing too big to handle.
 
In terms of FP32, its 130% more instead of merely 30%+, to be honest I was hoping MIC can perform better, due to its obviously large cache and supposedly better branch predications etc.

However, so far from my experience, it isnt, the reasons I suspect are:

1)Maybe 512bit SIMD is too wide a pure vector unit for even most parallel tasks, so in most cases a large part of the SIMD remaining ideal.
Nvidia is wider still at 1024 bit. So they should suffer even more from it. SO I dont think this is the case.

I really don't care for float32 for HPC. For rendering etc., it's different of course.
2) I am not entirely sure that Intel's SIMD can do FMA or not, if not, thats another disadantage comparing to Kepler in handling matrix maths.
It can.

3)Maybe the maths ops I tested (radix-sort, solving system of linear equations and other matrix-based maths operations) can barely benefit from large cache.
The matrix operations should do fine with less cache. Radix sort should not have terrible memory behavior. So, that doesn't tell us much either.
 
That's interesting. Where was it disclosed?

Other than what is hinted at Larrabee's P54C heritage and MIC's descent from that architecture, there is some brief marketing from Intel touting the low mispredict penalty of MIC's short pipeline, and there are performance counters for branch mispredicts:

https://docs.google.com/viewer?a=v&q=cache:FxEJsyK7SBAJ:software.intel.com/sites/default/files/forum/278102/intelr-xeon-phitm-pmu-rev1.01.pdf+intel+xeon+phi+branch+prediction&hl=en&gl=us&pid=bl&srcid=ADGEESiWcaQR7wMf9t33nFaOiRN77rIwiuCzTb_aoopXtsUxqgdXUXvHXUTL1OusmJH3dky2HwjOaFqFE7eRI-SZGkgZwYZLI6FjKT5mxWnGW9-FxzviOPtuxpnRR4UbZJ5ijFhcGZ1_&sig=AHIEtbTl8PVO4R-k44_cqEbT2nCgQhlNCA
 
Most HPC computing I am involved in, ending up with lots matrices ops at low level, I think this is the case for a vast range of scientific and financial computing, be it numerical solution of system of differential equations, nonlinear optimizations or dynamic system simulations.

And yes, FP32 is very useful in scientific computing since linear system of equations can be solved by iterative method (a very popular group of methods) with FP32 easily, as for the ill-conditioned systems, you should avoid them easily by redesign your numerical methods instead of betting on your CPU's rounding methods.

As for wrap, I would not equal them to a 1024bit SIMD unit since it can handle branching automatically whilst a SIMD cannot.

Anyway not trying to turn this into a GPU vs CPU war, I just tell my limited programming experience with GK110 and MIC, if your experience is distinctively different, good, share them then.

Otherwise, I think there is little point to discuss.
 
So far the real constraint I can think of for GPUs in HPC is their rather limited memory storage and the latency between them and the host, also, stability is also a concern, but this will be also the case for intel's MIC since the main unstability source for tesla is, to your surprise, the instability of PCIE slot.
 
Most HPC computing I am involved in, ending up with lots matrices ops at low level, I think this is the case for a vast range of scientific and financial computing, be it numerical solution of system of differential equations, nonlinear optimizations or dynamic system simulations.
Yes. Matrix ops really dominate.

And yes, FP32 is very useful in scientific computing since linear system of equations can be solved by iterative method (a very popular group of methods) with FP32 easily, as for the ill-conditioned systems, you should avoid them easily by redesign your numerical methods instead of betting on your CPU's rounding methods.
This is not a defense of using ill-conditioned systems. You should always use the most stable algorithm available. The real reason, for me at least, for sticking to fp64 is that, the best case upside is 2x-3x more performance. But the downside is that I have to constantly look over my shoulder. Thanks, but 2x-3x more performance is not worth that much trouble. I have bigger (and better) things to worry about. 10x could be though.


As for wrap, I would not equal them to a 1024bit SIMD unit since it can handle branching automatically whilst a SIMD cannot.
I think both mic and gpu are vector processors. So it makes a lot of sense to look at them the same way. Just because the gpu doesn't expose the predicate registers to sw doesn't make make it completely different.
 
I think both mic and gpu are vector processors. So it makes a lot of sense to look at them the same way. Just because the gpu doesn't expose the predicate registers to sw doesn't make make it completely different.
Exactly. The lane masking stuff is basically handled the same way on MIC as it is done on usual GPUs. There is no difference (and at least the GCN ISA exposes the mask register in a similar way as MIC).
 
I wonder if the large cache size on MIC is as big a benefit as they make it out to be. With a CPU style architecture, a cache miss to memory means a huge stall (equivalent to around 1000 instructions, once you take into account superscalar), due to the nature of a serial processor, whereas a GPU architecture is more tolerant, since it usually has enough active warps to have at least one warp available to cover the stall.

The numbers I've seen put even a tiny cache (32 KB) at around 5% miss rate. A 1 MB cache is somewhere around 0.5% miss rate. Of course, with more threads, you'll get more memory divergence, though a well designed algorithm can often minimize this.

Hence, the question is, does a large cache really benefit a massively parallel architecture, or is it a misconception drawn from the huge cache miss penalty a pure serial architecture takes?
 
There is no universal answer to the question of whether the large cache is a large advantage, as analysis is dependent on the workload and the utility it brings to the software and developer.

Cache itself is quite fault-tolerant and it is versatile. In terms of power efficiency and bandwidth, it is still a massive improvement to keep accesses on chip versus going off-die.
Off-die is slower and it burns much more power.

One way to view the difference between a 5% and .5% miss rate is that every off-die access is a cache miss.
In a hypothetical workload on an architecture that can use the full bandwith of the memory bus, if we kept the given 5% to .5% miss rate improvement by assuming that the majority of it is capacity-related, it's a 10x reduction in bandwidth needs, or up to 10x potential improvement in a bandwidth-limited context.
 
The question is if you're actually bandwidth limited, or if latency is the real problem.

Imagine you have a 0.5% cache miss rate to main memory, and this causes a 1000 instruction stall when it happens. This means that the average cost of a memory instruction is 5 instructions stalled, which is rather substantial. Thing is, this penalty hits you even if you're no where near the bandwidth limit.

Now, suppose that you have a 0.5% cache miss rate, and your main memory has 20 GB/s bandwidth. Then, in order to saturate it, you would need 4 TB/s total bandwidth (20 GB/s is 0.5% of 4 TB/s), including the cache. You're not going to come anywhere close to this.

Now, let's put a load on that actually saturates main memory (and hence has a truely aweful hit rate...) Assuming a memory transaction consumes 128 bits of memory bandwidth, we get a bit over 1 GT/s (let's just keep the numbers easy). This gives us a stall penalty of 1 TOp/s, meaning that without an out of order memory system, the CPU cannot even come close to saturating memory under any circumstance.

Another way to look at it is if you assume a string of memory operations, where you have to take the latency in order (think linked list...). With a 4 GHz processor, this gives you about 16 MT/s (the real latency of a memory operation is around 250 cycles). This would give you only 256 MB/s actual memory usage.

The bottom line of this is that latency is the real killer, rather than bandwidth. In order to even reach the bandwidth limit, you have to cover nearly 99% of the latency cost. Since a serial processor has a hard time doing this, reducing latency becomes the primary purpose of the cache system.
 
My contention is that the answer to whether a large cache helps was "it depends", and that there are types of workloads and design constraints where it makes a very large difference, and I gave a general example of where a change in a few percent by a single metric makes an order of magnitude difference.
I indicated as a precondition that the architecture be capable of consuming the bandwidth of the full memory bus, which with further elaboration could be due to things like a single core's memory parallelism or the existence of very many simpler cores.

The 4GHz single processor example is a counterargument based on a non-throughput chip design and a workload that doesn't provide much memory parallelism. It has the potential of being a list so long or poorly laid out that it turns out to be a cache thrasher, which is a workload I didn't consider because the existence of loads that benefit from large caches does not preclude workloads that do not.

Put 60+ of that single core that can't consume full memory bandwidth behind the same memory controller, and my point returns.
 
The bottom line of this is that latency is the real killer, rather than bandwidth. In order to even reach the bandwidth limit, you have to cover nearly 99% of the latency cost. Since a serial processor has a hard time doing this, reducing latency becomes the primary purpose of the cache system.

Correct. Hiding latency is the main thing. CPU and GPUs take different routes towards it, but both converge on the idea of keeping lots of stuff on die.

The problem current GPUs have, and which I expect to be fixed going forward, is that on chip storage is mostly in registers, which have to be statically addressed. For a lot of things, this is fine, but there are cases where it is not. Like ray tracing. You can't keep the acceleration structure in registers, but you can in cache. Which is why I expect MIC to be quite a bit better than GPUs for ray tracing. There are many other algorithms for which the working set is much more easily accommodated in cache than in registers. Which is why MIC is a vastly better architecture for computing.

x86 compatibility is nice, but that's it.
 
MIC presumably has only 4 hardware threads per core. If you want GPU-like threading to hide latency you're going to have to do your own explicit threading in code to hide more latency.

Being constrained by cache capacity instead of register capacity (like with a GPU) prolly makes this work really nicely.
 
It helps that the cache hierarchy provides terabytes of banwidth for those switches to happen with.

The bandwidth requirements for GPU context switching are an impediment for them to try any sort of context switching, and how GPUs expect to implement preemption at some later date is interesting since they will need to to address this.
 
NVidia's already well on the way to a subtle, many-levels memory hierarchy (and it's not as if CPUs have had any choice since there was not much else to spend transistors on).

As far as D3D/WDDM models of pre-emption are concerned, I guess the early years will be a mess (as ever with any new hardware focussed feature) but it'll all come out in the wash.

That is, if there are still such things as discrete GPUs by then.
 
MIC presumably has only 4 hardware threads per core. If you want GPU-like threading to hide latency you're going to have to do your own explicit threading in code to hide more latency.

Being constrained by cache capacity instead of register capacity (like with a GPU) prolly makes this work really nicely.

4 threads are fine for hiding latency with 512K per core. With addition of branch prediction, it makes it work even better.
 
The problem current GPUs have, and which I expect to be fixed going forward, is that on chip storage is mostly in registers, which have to be statically addressed. For a lot of things, this is fine, but there are cases where it is not. Like ray tracing. You can't keep the acceleration structure in registers, but you can in cache. Which is why I expect MIC to be quite a bit better than GPUs for ray tracing. There are many other algorithms for which the working set is much more easily accommodated in cache than in registers. Which is why MIC is a vastly better architecture for computing.

Not necessarily. With ray tracing (specifically, diffuse ray tracing, since this is the real problem of interest) you tend to have very large scenes with highly divergent access patterns, so even a very large cache will just get thrashed. Sure, it helps in the toy size scenes, but those are just that - toys. Any real production scene will conveniently expand itself to fill all available memory, since you want as much detail as you can have.

You have to realize also that ray space is 6-dimensional, so once you get to random diffuse rays, there is very little chance you will have two rays that happen to be near enough to each other be appreciably coherent over their traversal. Hence, bundling approaches aren't likely to be of very much help in getting cache reuse.

I can't think of a problem which has memory coherence, but only on the 10s of MB level - its all either fairly tightly coherent or else completely incoherent.

Another interesting note is that a large L3 cache will actually hurt global memory latency, since you have to add up the latency of the entire cache hierarchy. Hence, for algorithms like image filters, where you stream lots of memory but reuse is tightly constrained or non-existant, the large L3 cache may actually be a loss.

As far as registers being statically accessed, this is a major purpose of an L1 cache. Even an L1 cache is too slow to replace registers, but sometimes you need random access to small arrays.

It helps that the cache hierarchy provides terabytes of banwidth for those switches to happen with.

The bandwidth requirements for GPU context switching are an impediment for them to try any sort of context switching, and how GPUs expect to implement preemption at some later date is interesting since they will need to to address this.

Not really - the state of a single SM is only a few hundred KB - so if you chance context of SMs 10,000 times per second, you're still only at 1% of global memory bandwidth.

CPUs have much the same problem, since a context switch will pretty much end up evicting the entire L1 cache in short order.
 
Last edited by a moderator:
Not really - the state of a single SM is only a few hundred KB - so if you chance context of SMs 10,000 times per second, you're still only at 1% of global memory bandwidth.
I'm slightly unclear if this addresses both of my points, but to clarify if it did cover the first, Larrabee's soft context switches during periods of high texturing activity would be shifting partial thread state to and from memory many times faster than that.
 
Not necessarily. With ray tracing (specifically, diffuse ray tracing, since this is the real problem of interest) you tend to have very large scenes with highly divergent access patterns, so even a very large cache will just get thrashed. Sure, it helps in the toy size scenes, but those are just that - toys. Any real production scene will conveniently expand itself to fill all available memory, since you want as much detail as you can have.
Bigger scenes mean it's even more important to have better cache utilization.
Even if the cache thrashes, it is no more costly than striaght load/store to memory since power is mostly burnt in going off die.

You have to realize also that ray space is 6-dimensional, so once you get to random diffuse rays, there is very little chance you will have two rays that happen to be near enough to each other be appreciably coherent over their traversal. Hence, bundling approaches aren't likely to be of very much help in getting cache reuse.
There has been a fair amount of work for reordering rays for better cache utilization with real scenes. Nvidia's paper showed they were able to save >90% bw with their prototype design for highly incoherent scenes.

Another interesting note is that a large L3 cache will actually hurt global memory latency, since you have to add up the latency of the entire cache hierarchy. Hence, for algorithms like image filters, where you stream lots of memory but reuse is tightly constrained or non-existant, the large L3 cache may actually be a loss.
Not with directories. MIC uses them a lot.
 
But it still generates heat and adds largely useless area to the die that would be otherwise used in better ways for improving performance, needless to say the fact larger cache and more levels of cache itself can damage performance if cache-miss is high since caching itself add latencies through caching and addressing.

I remeber sandy bridge/ivy bridge's L3 cache has a latency of 40-50 CPU cycles, and L1/L2 together has 20-30 cycles of latency, whilst the latency of memory is "merely" 150-200 CPU cycles, so if your cache contributed little besides heat, die size and cache-misses, then the situation will become pretty ugly.

GK110 has 1.5MB of global L2 cache and 64K of L1 cache per core, and its L1 cache is manageable and programmable (althrough I hope they could improve their L1 cache's accessing), it also has a large amount of registers, coupling with the fact it has short pipeline and designed to use parallelization to hide latency, I think it maybe a better design routine for HPC-scale multi-threading applications.

Anyway, the release date of MIC is near, anyone can try it, I think some will be disappointed just like me.
 
Back
Top