Nvidia Pascal Announcement

http://www.theregister.co.uk/2016/04/06/nvidia_gtc_2016/
Specifically they say: "Software running on the P100 can be preempted on instruction boundaries, rather than at the end of a draw call."
Since Kepler and Maxwell have instruction level preemption ability already, there's likely something more to it. Perhaps it's just a software change now exposing it to the graphics driver, not just CUDA? Or, thinking about it, it's probably the ability to reorder the kernel execution stack instead of just a push/pop like Kepler, which would explain the "real-time" example.
 
Since Kepler and Maxwell have instruction level preemption ability already, there's likely something more to it. Perhaps it's just a software change now exposing it to the graphics driver, not just CUDA? Or, thinking about it, it's probably the ability to reorder the kernel execution stack instead of just a push/pop like Kepler, which would explain the "real-time" example.
Maxwell & Kepler can only only switch at draw call boundaries, which is not fine-grained preemption.

This is an improvement in Pascal, if the report is correct.
Cheers
 
Since Kepler and Maxwell have instruction level preemption ability already, there's likely something more to it. Perhaps it's just a software change now exposing it to the graphics driver, not just CUDA? Or, thinking about it, it's probably the ability to reorder the kernel execution stack instead of just a push/pop like Kepler, which would explain the "real-time" example.

Your description of preemption in your previous post described it in terms of kernels being able to launch new kernels, then restore their own state.
Preemption in a more general sense would allow for an unrelated service or the system to force a kernel to stop execution and yield to something else, with or without the preempted kernel's involvement.
 
I find that highly unlikely.
It's not without prescident tho... 8800GTX (and 8800GT maybe?) had a companion chip with the video display interface hardware on it. Not saying I'd expect it, but it would free up die space on a chip that seems primarily geared to live in data centers... :p A quadro is priced high enough as it is that an extra display adapter chip wouldn't be a burdensome cost. *shrug*
 
You'd think that, at 16nm, the display control logic would be sufficiently tiny to not matter too much, and IOs being the biggest area. Which you'd need also to transfer data to an external chip.
 
I can confirm the instruction level preemption in p100. It was mentioned in a talk at GTC. Also, another talk mentioned a different variant of Pascal that would be released that has a SIMD instruction for 8-bit integers as input and doing the math in 32-bit float. The claim was that this could provide a 4x increase in throughout for some deep learning applications. This was all the detail he provided.
 
For Pascal it was claimed to have double the FLOPS / Watt (based on SGEMM, which allows close to peak FP32 rate)
Doing the maths we go from 7 TFLOP / 250 Watt to 10.6 TFLOP / 300 Watt.
Which is an improvement from 28 GFLOP / Watt to 35 GLOP / Watt.
That is 25 % better instead of 100%, far from what was predicted.
 
For Pascal it was claimed to have double the FLOPS / Watt (based on SGEMM, which allows close to peak FP32 rate)
Doing the maths we go from 7 TFLOP / 250 Watt to 10.6 TFLOP / 300 Watt.
Which is an improvement from 28 GFLOP / Watt to 35 GLOP / Watt.
That is 25 % better instead of 100%, far from what was predicted.

Double precision.
 
Besides, why would you compare the efficiency at peak FP32 performance?
As @pjbliverpool already noted, efficiency at double precision improved much further. Same goes when you utilize half precision, in both cases the 100% improvements were achieved.

And then there is also the average efficiency. And given the better ratio of ALUs to all other resources, both worst case and average utilization, and hence also efficiency should improve greatly for many formerly constrained applications. That one even for FP32 loads.

A full 100% improvement for every possible use case wasn't to be expected. That exceeds the possible gains from the node change by far.
 
He could also be actually talking about the multiplier itself.

At the gate level, the multiplier has doubled the performance per watt, but when you start factoring the memories, IO, and other components that might not scale well, the overall performance per watt isn't as impressive.
 
Do we know other power consumption numbers than the muddily undefinded TDP-number?
Additionally: Is it arch vs. arch or product vs. product?

Not saying it'll all add up, but that there are quite a few variables still, and if all else fails, Nvidia still could a 970-stunt out of it's sleeves.
 
Do we know other power consumption numbers than the muddily undefinded TDP-number?
Additionally: Is it arch vs. arch or product vs. product?

Not saying it'll all add up, but that there are quite a few variables still, and if all else fails, Nvidia still could a 970-stunt out of it's sleeves.
I really doubt even he would piss off HPC-exa/hyperscal/research clients.
Case in point back with Tesla Kepler and how these clients had full HyperQ and dynamic parallelism compared to consumer and even quadro I think.
Cheers
 
Back
Top