NVIDIA Kepler speculation thread

Is there any news about the cache coherent design of the Kepler? According to some books Fermi does not really support it.
The 768KB unified L2 cache is the sole memory agent, handlings loads, stores and texture fetches - thus acting as an early global synchronization point. Like the L1 cache, it probably has 64B lines and many banks; the write policies will be discussed below. While Nvidia did not discuss any implementation details, the GT200 has a 256KB L2 texture cache, implemented as 8 slices of 32KB, one slice per memory controller. If Fermi follows that pattern, the L2 cache might be implemented as 6 slices of 128KB, one for each memory controller.

Unlike a CPU, the caches in Fermi are only semi-coherent due to the relatively weak consistency model of GPUs. The easiest way to think about the consistency is that by default there is synchronization between kernels, and if the programmer uses any synchronization primitives (e.g. atomics or barriers), but no ordering otherwise.
Source

Kepler's implementation should be pretty much the same, but with doubled throughput per L2 partition.
 
I'd like to see some actual performance investigation before claiming that GK104 sucks on compute. And I think we should be mindful that NVidia appears to be saying that an evolution of this compute architecture is its long term plan.

Minimal dynamic scoreboarding is a key concept for the future. There's no option but to track long latencies (such as off-chip) but scoreboarding every instruction, last seen in Fermi, seems to be a dead concept for NVidia.

So, then you're left with a question of multi-issue. As far as I can tell, NVidia is quite explicit in stating that multi-issue is its future. Long instruction words are where it's headed. The compiler will be in the mix.

NVidia is aiming for the knee of the curve "ALU utilisation efficiency" not the highest point on it. The future is power efficiency, not ALU utilisation efficiency.

The stuff that gets instructions executing in Fermi is too costly to carry forward. It would be a case of tail wagging the dog if left that way for the future.

---

That's not to say that NVidia can't tweak the current SMX layout. e.g. narrower SIMDs (but the same number) reducing the count of hardware threads required to avoid pure ALU stalls. But with essentially no data out there we've got no idea what the actual bottlenecks are.
 
NVidia is aiming for the knee of the curve "ALU utilisation efficiency" not the highest point on it. The future is power efficiency, not ALU utilisation efficiency.

The stuff that gets instructions executing in Fermi is too costly to carry forward. It would be a case of tail wagging the dog if left that way for the future.
Could we assume that NV has gained enough statistical data in compute workloads through the years, so they now feel more confident in "deferring" the complex scheduling in software? That would imply their architecture design philosophy in the G80~Fermi time-frame was guided by more cautious (but expensive) approach. Of course, this only affirms their perceived dedication to the parallel computing market.
 
Where is the evidence that it is? If there's one constant in getting DSPs to unleash their full potential, it's in getting to the operands fed to the ALU. Why would GPUs be any different?

The 680 v 580 comparison is sufficient evidence. They both need to feed their ALUs and one obviously does so more effectively in compute workloads.


If you think reviewers are bang on target with all the compiler sucks whining, let's discuss compilers.

a) What kind of instruction selection/scheduling optimizations can a compiler do for an in order RISC core? Just how hard do you think those are? Just what is the limit of the perf upside there?

b) Let us forget that the compute subset of the ISA, which would be affected by the compiler's optimizations, is almost certainly the same as Fermi.


That's all well and good but again actual performance carries more weight than what we think it should be doing.


Look at 680's latency hiding capability. Look at it's cache susbsytem vs the changes to ALU organization. Dual issue not going right is pretty low on 680's trouble list, if it's there at all.


You're right, the underlying reasons could be many. That doesn't change my point that a GK110 could be much faster at compute without being that much faster in games. Even if we discount the impact of the compiler and dual-issue the other potential culprits (caches, registers) you're raising aren't necessarily problems when processing the uniform workloads presented by current 3D games.

I'm not sure I understand. The 680 isn't two 580s glued together. As such, its performance will vary based on the application, and you should not expect exactly 2x the performance except on artificial series of independent multiply-adds.

For example, AES uses a lot of bit shifts. But bit shifts don't have a throughput of 1/clk/core on the 680. You will find some other shaders that speed up by more than 2x.

This is independent of issue width or amount of SW scheduling. Teasing out the effect of compiler scheduling from shader performance would require running the same shader on two SMs that are identical in every way except in the amount of compiler scheduling required.

What I'm saying is actually very simple. GK104 kicks GF110 in the teeth in games but falls behind significantly in certain compute workloads relative to its peak performance. It's clear that nVidia has managed (intentionally or not) to cripple compute throughput without impacting 3D performance to the same extent. It's then likely that they can address those deficiences and restore compute performance without benefitting 3D performance to the same degree.

I could be and probably am wrong on the main culprits. Maybe it's not the compiler or the ILP requirement. Maybe it's the size of the register file or specific instructions that are responsible. Whatever it is, games aren't suffering from it and that's the gist of my point.

Since they can't hand-tune their compiler for every HPC workload imaginable I would be really surprised if they kept the static scheduling and dual-issue in the parts targeted for that market. It's one thing to explain to Anand that they havent optimized for LuxMark. It's a whole other ballgame trying to explain to paying customers why performance sucks and/or is dramatically inconsistent.
 
One thing to take into consideration - to my understanding GK104's FP64 ALUs (they're separate) can't be used for anything else - is it how likely that GK110 will employ the same design?
I mean, if we assume ~500mm^2 size, give or take few 10mm^2's, there's only so much room left - they need to have at least 96 FP64 units per SMX to get 1/2 ratio, and FP64 ALUs surely are bigger than FP32 ones, meaning that each SMX should be a LOT bigger, not to mention everything else the chip needs to have over GK104 to improve GPGPU speed
 
@Kaotik

No there is no chance of that happening.

Surely making the FP64 ALUs a little bit fatter will allow them to have 2*FP32 throughput so it would be really strange not to use them for that.
 
It's one thing to explain to Anand that they havent optimized for LuxMark.

Welp, maybe the idea is to see what LuxMark and games have in common (for example). Both categories seem to run pretty well on ATI's older, VLIW tainted chips...so either we assume that NV's compiler people are a bunch of dopes who can't handle a much simpler case properly, or there's a (bunch of) hidden variable(s) that mess up our nice assumed 1:1 relation.
 
@Kaotik

No there is no chance of that happening.

Surely making the FP64 ALUs a little bit fatter will allow them to have 2*FP32 throughput so it would be really strange not to use them for that.

Then what's the reasoning not doing so on GK104?
 
trinibwoy said:
The 680 v 580 comparison is sufficient evidence. They both need to feed their ALUs and one obviously does so more effectively in compute workloads.
It must be great to live in a world of uni-dimensional problems where everything can be proven with a single explanation.

The point that games are doing well and compute does not points to other factors than the compiler, IMHO.
 
Last edited by a moderator:
It must be great to live in a world of uni-dimensional problems where everything can be proven with a single explanation.

The point that games are doing well and compute does not points to other factors than the compiler, IMHO.

Like I said, whether it's the compiler, dual-issue, a combination of both or any other hidden variables (for which we have no evidence) is tangential to my point that GK110's advantage in games will probably pale in comparison to its advantage in (SP) compute.

Life really is simple sometimes. nVidia has openly said they did not optimize for certain workloads. It's obvious that the 680 requires more hand holding than the 580. Thats the information we have.
 
I meant the fact that GK104 has 8 FP64-only CUDA cores

To be honest, I'm not entirely sure that GK104 (or GF104, for that matter) actually has dedicated FP64-only ALUs. This may just be a convenient way for NVIDIA to represent FP64 capability on neat little diagrams.
 
To be honest, I'm not entirely sure that GK104 (or GF104, for that matter) actually has dedicated FP64-only ALUs. This may just be a convenient way for NVIDIA to represent FP64 capability on neat little diagrams.

I doubt so many reputable reviewers would be saying it has 8 FP64-only ALUs if it didn't :?:
 
Then what's the reasoning not doing so on GK104?

Is a 32bit MADD/FMA ALU exactly 1/2 of a 64bit MADD/FMA ALU ( ie there are no extra bits needed to calculate to that precision?). Does it change the way the registers and caches load and store data etc? If there are difference like the above and they cost more power and die space then thats a likely explanation.
 
Is a 32bit MADD/FMA ALU exactly 1/2 of a 64bit MADD/FMA ALU ( ie there are no extra bits needed to calculate to that precision?). Does it change the way the registers and caches load and store data etc? If there are difference like the above and they cost more power and die space then thats a likely explanation.

There are a lot of differences between 32 bit and 64 bit floating point. For example, the mantissa in a 32-bit float is 23 bits stored, 24 bits unpacked. The mantissa in a 64-bit float is 52 bits stored, 53 bits unpacked.

So, you need more than twice the bits in the adder and multiplier for double precision. Also, remember that the hardware cost of a fast adder grows superlinearly with the number of bits (n log n) and the hardware cost of a multiplier is even worse (n^2).

Just to illustrate, if the multiplier in a unit which operates on 32-bit floats costs n transistors, the multiplier in a unit which operates on 64-bit floats costs (53/24)^2 =4.9n transistors. This is an oversimplification, but it gives you a sense of the way the problem scales.
 
I'd like to see some actual performance investigation before claiming that GK104 sucks on compute. And I think we should be mindful that NVidia appears to be saying that an evolution of this compute architecture is its long term plan.

Minimal dynamic scoreboarding is a key concept for the future. There's no option but to track long latencies (such as off-chip) but scoreboarding every instruction, last seen in Fermi, seems to be a dead concept for NVidia.

So, then you're left with a question of multi-issue. As far as I can tell, NVidia is quite explicit in stating that multi-issue is its future. Long instruction words are where it's headed. The compiler will be in the mix.

NVidia is aiming for the knee of the curve "ALU utilisation efficiency" not the highest point on it. The future is power efficiency, not ALU utilisation efficiency.

The stuff that gets instructions executing in Fermi is too costly to carry forward. It would be a case of tail wagging the dog if left that way for the future.

---

That's not to say that NVidia can't tweak the current SMX layout. e.g. narrower SIMDs (but the same number) reducing the count of hardware threads required to avoid pure ALU stalls. But with essentially no data out there we've got no idea what the actual bottlenecks are.
Is the reduction in ray tracing performance in GK104 not enough evidence for you?
 
That's all well and good but again actual performance carries more weight than what we think it should be doing.
Dodge.

You're right, the underlying reasons could be many. That doesn't change my point that a GK110 could be much faster at compute without being that much faster in games. Even if we discount the impact of the compiler and dual-issue the other potential culprits (caches, registers) you're raising aren't necessarily problems when processing the uniform workloads presented by current 3D games.
All I am saying is that the compiler is not to blame for GK104's sucky compute performance. It's cache subsystem and the register file size vs ALU size trade off made here. It has too little RF and cache to do justice to it's ALUs in compute.
Since they can't hand-tune their compiler for every HPC workload imaginable I would be really surprised if they kept the static scheduling and dual-issue in the parts targeted for that market. It's one thing to explain to Anand that they havent optimized for LuxMark. It's a whole other ballgame trying to explain to paying customers why performance sucks and/or is dramatically inconsistent.
Static scheduling is not going anywhere. It's here to stay.
 
Back
Top