NVIDIA Kepler speculation thread

Why wouldn't speculation suit other highly parallel machines?
The degree of speculation matters. Excessive speculative fetches means more registers to store the their results. Not something a non-synthetic code would be willing to do too much of.


If other tools are lagging in the absolute performance they produce then...
If the same tool produces lower performance on a newer hw by the same vendor then...

Kepler is the basis of several years' worth of tools development, I'm sure it'll get better.


This isn't your grandma's auto-vectorisation.


That's what he's selling, declarative abstractions that are meaningful to their tools instead of asking programmers to peer into manuals and trying to make sense of directed tests.

I don't know what language they will end up with. He's pretty clear that things have to change.
I think you are buying too much into Dally's vision, too soon. The kind of tool level wizardry you are speaking of (or seem to think that Kepler relies upon) simply isn't possible in C. The existing codes are in C (pretty much). The C compilers won't be able to do anything fundamentally different, or more than what they do today.
 
EVGAs are highly wanted. Other brands are easier to get - well at least in Germany. If you say you want a 680, I can get you one right now.
 
EVGAs are highly wanted. Other brands are easier to get - well at least in Germany. If you say you want a 680, I can get you one right now.

Oh it's not hard to find one for $100 above MSRP, which is probably still cheaper than getting it from germany.
 
499 Euro is like $650 US. I'd rather pay $499 US. But maybe that's just me. Last I checked there was some on Ebay for ~$599 US which would still be $50 cheaper than getting it from europe it seems. There's no doubt Europe has better availability as they did so with the 7970 as well, and it's probably because the price winds up being much higher.
 
499€ with tax in Germany means 419€ without tax, which is almost exactly 549$ at the current exchange rate.

Fat lot of good that would do, here you can usually get away with paying no tax from newegg, EG, $499=$499.

Generally you only have to pay tax if you live in one of the few states where the online retailer has a physical distribution center.
 
My understanding of Kepler is that it relies on the compiler for scheduling to the extent that GCN does. The compiler can figure out when to deschedule (or stall for sync). I don't think the compiler has any role in deciding which thread to switch to. Which is why I don't think tools have much to do with kepler, as yet.

For ALU operations, Kepler is more dependent on the compiler than GCN. GCN arranges its SIMD issue so that the next instruction cannot be issued until the current one has completed.
The general restriction is that each wave is sequential issue, so the architecture avoids the complexity and probably some performance benefits as a result.

The exception would be the memory wait commands, which can allow memory accesses to be fired off without waiting for the previous to complete.
I wonder if that could be extended to other instruction types in the future.
 
It looks like the rumored GTX 670 Ti is likely to actually be the 670.

Desktop GK107?:

NVIDIA_DEV.0F00 = "NVIDIA GeForce GT 630"
NVIDIA_DEV.0F01 = "NVIDIA GeForce GT 620"
NVIDIA_DEV.0FC1 = "NVIDIA GeForce GT 640 "

There is no Fermi with 0F**, but Mobile GK107 was 0FD*. Maybe GK108?
The mobile 600M lineup has a 96 CC Fermi coexisting with a 384 CC Kepler as the 640M LE. Given that the existing 620 is 48 CC, if the 620 follows the same pattern as the 640M LE, that would point to a 192 CC Kepler as the alternate 620, I'd say either a disabled GK107 or some lower chip. Then the 630/640 could be "full" GK107s.
 
EVGAs are highly wanted. Other brands are easier to get - well at least in Germany. If you say you want a 680, I can get you one right now.

499 Euro is like $650 US. I'd rather pay $499 US. But maybe that's just me. Last I checked there was some on Ebay for ~$599 US which would still be $50 cheaper than getting it from europe it seems. There's no doubt Europe has better availability as they did so with the 7970 as well, and it's probably because the price winds up being much higher.

This isn't a direct reply to either of you, just a commentary on the inane conclusions that Internet Forumites jump to.

It's the "oh no" the sky is falling or maybe things are just great.

With 7970 stock in Europe meant demand was low while no stock in the US meant supply was horrible and AMD barely made any cards.

With 680 stock in Europe means Nvidia has plenty of supply while no stock in the US means demand is great.

:D

It's amusing how the same exact situation for the two cards means completely different things for some people. A prime example of people seeing what they want to see, rather than limiting their conclusion jumping to just the limited facts available.

Again not a direct reply to either of you. Just a general comment on what you see around the net with the two launches.

Regards,
SB
 
I think we are talking at cross purposes here. You are speaking of the compiler's ability to schedule hw threads (or play a role in it). I am spaeking of compiler's code generation and instruction scheduling within a single thread.
I'm talking about both. Execution and a memory hierarchy that supports that execution. Both within the scope of the tools.

My understanding of Kepler is that it relies on the compiler for scheduling to the extent that GCN does.
You should elucidate because I disagree. Kepler does things (or rather requires organisation) at the ALU instruction level that GCN isn't doing. Both chips are doing similar things at the core level (loads/stores and ALUs as distinct kinds of instructions that need to be scheduled) but at the ALU level GK104 benefits from compiler assistance.

The compiler can figure out when to deschedule (or stall for sync). I don't think the compiler has any role in deciding which thread to switch to. Which is why I don't think tools have much to do with kepler, as yet.
The decision to do so has to be founded upon some model of the hardware, i.e. the likelihood that stalls are best avoided by this bundling.

I've got no idea what you think is happening if you don't think GK104 performance is dependent upon the compiler.
 
The degree of speculation matters. Excessive speculative fetches means more registers to store the their results. Not something a non-synthetic code would be willing to do too much of.
Why would the data be in registers? There's a whole memory hierarchy at the beck and call of the tools, in NVidia's ultimate vision of this stuff.

If the same tool produces lower performance on a newer hw by the same vendor then...
Maybe you should present real evidence. You know where equal hand-optimisation effort goes into both. You appear to think that existing CUDA kernels are not hand-optimised for the hardware they first ran on.

I think you are buying too much into Dally's vision, too soon.
I'm merely elucidating Dally's position. And that GK104 demonstrates a strong shift towards that vision. Dally presented evidence of their tools out-performing hand-optimisation. Now one can argue that that evidence is skewed (e.g. the person was inexperienced). And that NVidia is doomed. etc.

I'm simply elucidating what I see as a marriage of the design of GK104 with the future NVidia has painted. It's the first step towards the chip we'll see in five years.

The headline is "NVidia doesn't want to spend transistors and power on out of order ALU instruction issue". Get used to it.

The kind of tool level wizardry you are speaking of (or seem to think that Kepler relies upon) simply isn't possible in C. The existing codes are in C (pretty much). The C compilers won't be able to do anything fundamentally different, or more than what they do today.
The dirty secret of HPC (i.e. what's done on clusters) is that the performance is miserable. The existing tools are pathetic.

He was pretty clear that the language has to change. He's hardly the first to say this.
 
Silent_Buddha said:
It's amusing how the same exact situation for the two cards means completely different things for some people. A prime example of people seeing what they want to see, rather than limiting their conclusion jumping to just the limited facts available.

Again not a direct reply to either of you. Just a general comment on what you see around the net with the two launches.
The one thing to keep in mind, is that the situation you've painted above as reality is one that itself has already been filtered by your own tinted glasses. ;)
 
For ALU operations, Kepler is more dependent on the compiler than GCN. GCN arranges its SIMD issue so that the next instruction cannot be issued until the current one has completed.
The general restriction is that each wave is sequential issue, so the architecture avoids the complexity and probably some performance benefits as a result.
In what way is Kepler more dependent on the compiler, other than depending on it to dual-issue?

I wonder if that could be extended to other instruction types in the future.
It could be, but it wouldn't be worth it since mem load is the biggest latency operation by far.
 
In what way is Kepler more dependent on the compiler, other than depending on it to dual-issue?
The compiler encodes the dependence information of an ALU operation into the instruction.
The hardware scheduler does not check for inter ALU ops data hazards in order to select a warp instruction or to put a warp to sleep. The compiler tells the scheduling hardware whether a warp can be considered for issue, or when the active mask can be updated to say the warp is ready again.

GCN doesn't do this. A given wavefront cannot issue another instruction for itself until after the current instruction is done. For most SIMD ops, there is a 4-cycle wavefront where the SIMD scheduler physically cannot pick another instruction before the current one is finished.
 
The compiler encodes the dependence information of an ALU operation into the instruction.
The hardware scheduler does not check for inter ALU ops data hazards in order to select a warp instruction or to put a warp to sleep. The compiler tells the scheduling hardware whether a warp can be considered for issue, or when the active mask can be updated to say the warp is ready again.

GCN doesn't do this. A given wavefront cannot issue another instruction for itself until after the current instruction is done. For most SIMD ops, there is a 4-cycle wavefront where the SIMD scheduler physically cannot pick another instruction before the current one is finished.
So what you are saying boils down to compiler inserting some kind of "don't deschedule this warp" flags in instructions to help with superscalar issue, say to avoid in pipe registers from being corrupted. GCN avoids this by not having dual issue at all.

That much is fine, and really no big deal. Generating these don't deschdeule flags correctly is pretty damn simple. To generate instructions optimally around this don't deschedule flag is simple enough that, by now kelper's compilers should have mined out >98% of the potential there, even if they started from scratch for Kepler.Besides, Kepler isn't the first to dual issue. It has been around since fermi's days. If the compute susbset of the ISA hasn't undegone radical revisions, the compiler should have been pretty good at this even before Kelper and that should have carried over.

This fundamentally different from what (I think) Jawed is suggesting. As I understood it, he is suggesting the compiler has some say in which other warp to switch to. What you just suggested, and that is the limit of compiler's involvement in my view, is purely dependent upon a single warp's instruction stream. Which would limit compiler's role to deciding for an individual warp at a time.
 
I'm talking about both. Execution and a memory hierarchy that supports that execution. Both within the scope of the tools.
Not with C. The pointers and what not make doing anything sophisticated hell. ICC, which as probably the best optimizers around, needs a ton of #pragma's just to do auto vectorization. The kind of program transformations you are referring to go beyond that, and IMO are just not possible without a ton of pragma's, if they are possible at all, in a pointer based language. OCL/CUDA don't have those. Not sure if direct compute does.

You should elucidate because I disagree. Kepler does things (or rather requires organisation) at the ALU instruction level that GCN isn't doing. Both chips are doing similar things at the core level (loads/stores and ALUs as distinct kinds of instructions that need to be scheduled) but at the ALU level GK104 benefits from compiler assistance.
3d just described what I think is the limit of compiler's involvement in Kepler. I consider the additions he described over GCN minor enough that I lumped them all into saying that Kepler does as much as GCN does.

I've got no idea what you think is happening if you don't think GK104 performance is dependent upon the compiler.
The performance is obviously determined by compiler. But I think it's compiler has reached a point where any non negligible incremental performance is going to be hard to find from the compiler alone. Of course, LOTS of minor compiler changes will add up, but 680 will prolly be EOLed before that.

What's hurting 680? I have said it before. I think it is the reduction in 680's latency hiding ability due to lots more ALUs, not nearly enough RF and no more cache/shared mem.
 
Why would the data be in registers? There's a whole memory hierarchy at the beck and call of the tools, in NVidia's ultimate vision of this stuff.
Because L1/shared is even more scarce?

Maybe you should present real evidence. You know where equal hand-optimisation effort goes into both. You appear to think that existing CUDA kernels are not hand-optimised for the hardware they first ran on.
LuxMark, 580 vs 680.


I'm merely elucidating Dally's position. And that GK104 demonstrates a strong shift towards that vision. Dally presented evidence of their tools out-performing hand-optimisation. Now one can argue that that evidence is skewed (e.g. the person was inexperienced). And that NVidia is doomed. etc.

I'm simply elucidating what I see as a marriage of the design of GK104 with the future NVidia has painted. It's the first step towards the chip we'll see in five years.

The headline is "NVidia doesn't want to spend transistors and power on out of order ALU instruction issue". Get used to it.
680 does lot more in compiler over 580. Great. Gives more role to the compiler vs hw, cool. The right way to go, IMO. 680 will gain in near future from radical compiler driven transformations of C to gain performance? Umm......, No.

The dirty secret of HPC (i.e. what's done on clusters) is that the performance is miserable. The existing tools are pathetic.

The tools are not so bad considering the hand they are dealt. Using C means shooting the compiler in the foot and taking charge like a Real ProgrammerTM. There's only so much a compiler can do when you decide to take charge of the bare metal.

What Dally seemed to be pushing for is nested data parallelism. Now that's where compilers can really do magic, if you constrain your language tightly enough. Belloch showed in 1990, iirc, that there exists a compile time transformation from nested data parallelism (iow, the code which you to write) to flat data parallelism (iow, the code which you want to execute).

It's incredibly hard. Data Parallel Haskell is the closest anybody has come to a concrete implementation so far. And even they have a long way to go. You should listen to Simon Peyton Jones' talks if you want to have a glimpse of where Dally is looking at.
 
The one thing to keep in mind, is that the situation you've painted above as reality is one that itself has already been filtered by your own tinted glasses. ;)

That could be, but I'm not partial to either brand at the moment.

During the first couple launch weeks 7970 availability was low in the US as is the 680. During the first couple weeks availability was much better in Europe, again rather similar for both.

I do personally think demand is higher for GTX 680 than it was for 7970, but that's rather difficult to quantify in any meaningful way.

Regards,
SB
 
Back
Top