NVIDIA Kepler speculation thread

The important parameter is how many threads are required to hide a fetch from DDR. When the chip is starved of threads then the ALUs will idle when DDR becomes the bottleneck.

NVidia is aiming for a programming model where DDR fetch latency is minimised - by optimisation of the application's memory hierarchy. So, even if the number of threads required to hide DDR latency goes up, if the chip does less fetches from DDR then it matters less.

And if the compiler inserts "pre-fetch" instructions into the kernel, for example, so that the latency being hidden by threads is L2 (or L1), then the number of fetches from DDR matters even less.

ALU scheduling, in this case, is eased by pre-computed memory access patterns.

This is essentially what all GPGPU compute has been about, optimisation of the memory hierarchy against latency hiding of hardware threads. NVidia plans to take more control of this with its tools, because it's a nightmare for programmers (the optimisation space is vast even with a only a few dimensions).
 
Another open question is whether Kepler's static scheduling only supports a single in-flight instruction per warp per SIMD. All architectures since G80 can have multiple instructions in flight from the same warp. If they drop this capability that's a significant source of ILP that they've abandoned. I see no reason why you couldn't statically schedule multiple in-flight instructions though.

The description of the precompiled instruction bits for Kepler sounds like the compiler puts a value in the instruction stating how far away the next dependent instruction is. This value is used to mask the readiness state in the scheduler.

It may be a decision specific to an implementation whether the table can be updated by more than one instruction at a time.
The analysis gets more complicated with more instructions in flight. Instruction 1's distance to the next dependent instruction does not provide information on how far away Instruction 2's next dependent is.
If the next instruction actually has a closer dependence than the first, it can wind up overriding the first instruction's wait value.

There could be possible ways to check multiple wait counters at the same time and calculate how long the scheduler can fetch ahead, but any benefits would need to be weighed against the extra complexity.
 
My only observation is that 2 instructions from the same hardware thread can't be issued consecutively

Sure they can, as long as they're independent instructions (and generate no resource conflict), but not at a sustainable rate. Due to a quirk of the architecture, you can issue a few instructions back-to-back from one warp before you take a small penalty, which can be completely covered by a second warp going through the same sequence.

This is not a big issue because you generally have more than 8 warps per SM.
 
Sure they can, as long as they're independent instructions (and generate no resource conflict), but not at a sustainable rate.
Cool.

Due to a quirk of the architecture, you can issue a few instructions back-to-back from one warp before you take a small penalty, which can be completely covered by a second warp going through the same sequence.
Is that per-thread operand-collector capacity? e.g. if 3 instructions from a thread are issued, operand collection for that thread can't keep up with 4th or 5th successive instructions (e.g. banking or space conflicts), so a thread switch is required.

Or per-thread tracking capacity (capability of thread scoreboarding)? e.g. the more instructions that are "bundled" for sequential issue, the less threads can be tracked overall.

Generally I'm thinking of scheduling logic that trades-off thread-count versus bundle size versus operand quantity. A simpler version of the scheduling and issue in prior architectures.

This is not a big issue because you generally have more than 8 warps per SM.
Playing with the occupancy calculator I can't find a way to get less than 8 warps per SM. It seems there's a hard limit at 63 registers per thread (in my world 63 registers is paltry, but that's another topic).

Any scenario with only 8 warps per SM would only exist due to blocking or shared memory configuration. e.g. with 63 registers it's possible to have 32 warps per SM.
 
Any scenario with only 8 warps per SM would only exist due to blocking or shared memory configuration. e.g. with 63 registers it's possible to have 32 warps per SM.
I could be horribly wrong here, but I think Bob meant warps that were ready to execute (i.e. not waiting on external memory accesses). The question of how many ready warps you need to achieve maximum throughput is always an interesting trade-off...

If that's what he meant, one random explanation I can think of is that multiple instructions are fetched together in advance for a single warp (and the fetch unit will only get back to that warp after several cycles). However there are billions of possible explanations here, small tricks that some clever hardware guy thought of some afternoon that imposes some obscure limitation exposed to software, so I won't waste my time thinking about it too much ;)
 
When NVidia starts optimising for OpenCL I guess that will be relevant. Didn't LuxMark show a massive decline in performance with recent drivers on GTX580?

I dont think this is due to driver, but to the updated version of Luxmark... they have change some little things, and this is why the score decrease on 580.


Actually i think the work made is specially done from OpenCL ( Khronos ) to improve performance on Nvidia gpu, ofc this is not in the interest of Nvidia to improve performance there.

Here's a little comparaison ( Cuda over OpenCL, thoses are ratio )..

The day Nvidia start to optimising it on Tesla, i cant imagine what will be the result..


As you see, the performance are near to equivalent.. outside some rare case. on a Tesla M207 running Kid. ofc i cant tell on standard GFX if result will be same... Actually i see most of the computing industry is going to OpenCL ... a simple way to confirm it is to see how much have increase the job research for OpenCL developper ( +98% last year, vs only +18% for CUDA ). And a little ride in computing conferences will give a second hint... It look like the trend is in favor of OpenCL, and so Nvidia cant ignore it if they want to conserve their Hardware supremacy in this domain. Computing system developper want flexibility. Something CUDA cant and if nothing change drasitcally will never offer them.

If Nvidia persist to dont give OpenCL attention and try to slow down the project ( they are part of Khronos group ), i think they could really do a big error. They have the hardware, and i dont think the adoption of OpenCL ( in the actual situation ) will change anything for Tesla. Cause yet, AMD dont have the equivalent in this specific domain. OpenCL vs CUDA is not AMD vs Nvidia.. computing developpers dont care of the brand, if they can and want run OpenCL on Nvidia Tesla Quadro, they will do it. They will not move to AMD GFX due to this. they want hardware capability, and code capability, who offer it to them, they strictly dont care. What count is the result.
 
Last edited by a moderator:
google search for "opencl" gives about 2x the matches than "nvidia cuda". Cuda by itself gives a lot of non nvidia related matches.
 
I could be horribly wrong here, but I think Bob meant warps that were ready to execute (i.e. not waiting on external memory accesses). The question of how many ready warps you need to achieve maximum throughput is always an interesting trade-off...
8 warps comes from requiring 2 warps per scheduler/issuer, as far as I can tell.

Any less than that and ALUs will go idle regularly, even without fetches from DDR.

My play with the occupancy calculator wasn't exhaustive enough (forgot to be brutal with shared memory allocation). I have found ways to get less than 8 warps per SM all the way down to 1.

If that's what he meant, one random explanation I can think of is that multiple instructions are fetched together in advance for a single warp (and the fetch unit will only get back to that warp after several cycles).
This kind of mechanism is what I was alluding to with my scenarios for the "penalty" Bob mentioned. In a sense a "large instruction word" of variable length, with a tight constraint on the number of elements being tracked (tracking of threads, tracking of operands, tracking of bundles).

However there are billions of possible explanations here, small tricks that some clever hardware guy thought of some afternoon that imposes some obscure limitation exposed to software, so I won't waste my time thinking about it too much ;)
Someone needs to go patent diving...
 
I dont think this is due to driver, but to the updated version of Luxmark... they have change some little things, and this is why the score decrease on 580.
LM 1.0 shows similar slowdown with NV's OCL 1.1 driver. Strangely though, I can run an old path-tracer (SmallPT), written for OCL 1.0, without performance issues on every OCL1.1-enabled driver from NV.
 
Here's a little comparaison ( Cuda over OpenCL, thoses are ratio )..
[...]
As you see, the performance are near to equivalent.. outside some rare case.
A lot of those tests are non-interesting synthetics, at least judging from descriptions such as "texread_bw".

Apps like ray tracing or weather simulation, real work, should be the focus of a comparison. If CUDA on Kepler requires compiler attention then there's every chance that OpenCL kernels will lag simply because CUDA is NVidia's priority. There are a few features of Kepler that are also purely CUDA without OpenCL support (I expect, unless NVidia puts them in with an extension) which will lead to yet more difference.

NVidia's tools strategy is always going to favour CUDA, while CUDA is around and being actively developed.
 
LM 1.0 shows similar slowdown with NV's OCL 1.1 driver. Strangely though, I can run an old path-tracer (SmallPT), written for OCL 1.0, without performance issues on every OCL1.1-enabled driver from NV.

Dont the last version is using 1.2? can have an impact, but yes this is strange you dont ge the same slowdown with old .... i dont think they want force slowdown ( still all is possible )
 
Last edited by a moderator:
When NVidia starts optimising for OpenCL I guess that will be relevant. Didn't LuxMark show a massive decline in performance with recent drivers on GTX580?

The opencl specific bits are the same for 580 and 680. OCL should hurt both equally.
 
The flops/register ratio hasn't changed from GF110->GK104. So how does your theory of insufficient latency hiding explain why it is slower than GF110 in some workloads?
Reg file increased by 2x. Flops increased by 4x * (1GHz/1.4GHz) ~ 3x. Cache/Shared mem didn't budge at all. Latency hiding has reduced.
 
You mean the up to 2x improvement in ray tracing performance, right?
http://www.tml.tkk.fi/~timo/HPG2009/

Timo Aila said:
Notes: On Kepler we fetch all node and triangle data through the texture cache, and L1 is used only for low-priority traffic (ray fetches, result writes) and local-mem traversal stacks. Kepler seems to like batched tex fetches to the point that it is beneficial to issue many fetches whose results have a low probability of being needed. Also, it seems beneficial to do even more speculative traversal work, we now postpone up to 2 leaf nodes.

It's a hand tuned kernel. Look at the kind of hacks here. Obviously not something a vendor neutral ray tracer would do. The trick for any arch is to make generically written (but with "reasonable" care) code fast, and not just those written and tuned by the vendor.
 
A lot of those tests are non-interesting synthetics, at least judging from descriptions such as "texread_bw".

Apps like ray tracing or weather simulation, real work, should be the focus of a comparison. If CUDA on Kepler requires compiler attention then there's every chance that OpenCL kernels will lag simply because CUDA is NVidia's priority. There are a few features of Kepler that are also purely CUDA without OpenCL support (I expect, unless NVidia puts them in with an extension) which will lead to yet more difference.

NVidia's tools strategy is always going to favour CUDA, while CUDA is around and being actively developed.

They are interesting, are they are used for compare code and system, as a line, maybe drawed artificially, but this exactly the same tests who are used for compare a system hardware x to a system y.. as they believe thoses numbers represent the efficiency of a system ( hardware wise ) . So use them for compare different code/library in an similar environnement, the same hardware system, it is important. The result are important.. ( and for give you a little info, thoses tests was used before for show how CUDA was superior to OpenCL ( so it is not completely innocent )

Basically you will not find so much difference in real scenario, you can find with thoses numbers .. ofc, you need at a moment, fix a deadline, something who can be measured equally, and as code are different, this complicate the task a bit .

You can find thoses test not valuable, but thoses test are the ones who are made in the industry for comparing hardware for computing.. .. thoses tests have been realised by one of the most fervent CUDA supporting developpers. (i exagaerate a little bit, they are not supporting CUDA in priority, they just support Computing ( and ofc CUDA is a big part of it ) .. HPC News Wire ),

Personnally im not a computing specialist.. i use it all days for different task ( 3D / 2D engineering, with Autodsek, 3Dmaxs, Inventor, etc and i work since 1991 with Autocad (my speciality, i work with it till 21 years ). marginally i use only the efficiency for my render task, and ofc at work i use Quadro based workstations.. ( and at home, i have allways use ATI ( AMD now ) for gaming and my personal home task ( even with Autocad ).
 
Last edited by a moderator:
I dont think this is due to driver, but to the updated version of Luxmark... they have change some little things, and this is why the score decrease on 580.


Actually i think the work made is specially done from OpenCL ( Khronos ) to improve performance on Nvidia gpu, ofc this is not in the interest of Nvidia to improve performance there.

Here's a little comparaison ( Cuda over OpenCL, thoses are ratio )..

The day Nvidia start to optimising it on Tesla, i cant imagine what will be the result..


As you see, the performance are near to equivalent.. outside some rare case. on a Tesla M207 running Kid. ofc i cant tell on standard GFX if result will be same... Actually i see most of the computing industry is going to OpenCL ... a simple way to confirm it is to see how much have increase the job research for OpenCL developper ( +98% last year, vs only +18% for CUDA ). And a little ride in computing conferences will give a second hint... It look like the trend is in favor of OpenCL, and so Nvidia cant ignore it if they want to conserve their Hardware supremacy in this domain. Computing system developper want flexibility. Something CUDA cant and if nothing change drasitcally will never offer them.

If Nvidia persist to dont give OpenCL attention and try to slow down the project ( they are part of Khronos group ), i think they could really do a big error. They have the hardware, and i dont think the adoption of OpenCL ( in the actual situation ) will change anything for Tesla. Cause yet, AMD dont have the equivalent in this specific domain. OpenCL vs CUDA is not AMD vs Nvidia.. computing developpers dont care of the brand, if they can and want run OpenCL on Nvidia Tesla Quadro, they will do it. They will not move to AMD GFX due to this. they want hardware capability, and code capability, who offer it to them, they strictly dont care. What count is the result.
Eh...I can't see the photo probably cos I'm inside the GFW of China now...

It seems they do have got a nice result. I was once thinking about something like at least the high level of OpenCL may cause inefficiency and NV may tend not to pay too much on it. That's a nice optimization
 
It's a hand tuned kernel.
They're all hand-tuned kernels. That's what "reasonable care" amounts to.

Look at the kind of hacks here.
Using TEX cache is a hack?

More-intense speculative fetches than older GPUs? That's not a hack, since they were doing this already.

In fact all of this points directly at the kind of tool-based optimisation that Dally was talking about. Speculative stuff especially so, since that's something that an SM scheduler just can't grok.

Obviously not something a vendor neutral ray tracer would do. The trick for any arch is to make generically written (but with "reasonable" care) code fast, and not just those written and tuned by the vendor.
Seems you haven't seen Dally's presentation.
 
Back
Top