Nvidia Pascal Announcement

spworley · Apr 9, 2016

CSI PC said:
http://www.theregister.co.uk/2016/04/06/nvidia_gtc_2016/
Specifically they say: "Software running on the P100 can be preempted on instruction boundaries, rather than at the end of a draw call."

Since Kepler and Maxwell have instruction level preemption ability already, there's likely something more to it. Perhaps it's just a software change now exposing it to the graphics driver, not just CUDA? Or, thinking about it, it's probably the ability to reorder the kernel execution stack instead of just a push/pop like Kepler, which would explain the "real-time" example.

Frenetic Pony · Apr 9, 2016

Another leak showing GP104 using only GDDR5 memory: http://wccftech.com/nvidia-pascal-gp104-gpu-leaked/

It's also, for all appearances, got a 256bit bus. The die size would suggest around 980ti like performance, but the GPU would probably be bandwidth bound after a measly 14.3% increase from a 980 non ti...

SimBy · Apr 9, 2016

I guess we could see a refresh with GDDR5X quite soon.

CSI PC · Apr 9, 2016

spworley said:
Since Kepler and Maxwell have instruction level preemption ability already, there's likely something more to it. Perhaps it's just a software change now exposing it to the graphics driver, not just CUDA? Or, thinking about it, it's probably the ability to reorder the kernel execution stack instead of just a push/pop like Kepler, which would explain the "real-time" example.

Maxwell & Kepler can only only switch at draw call boundaries, which is not fine-grained preemption.

This is an improvement in Pascal, if the report is correct.
Cheers

3dilettante · Apr 9, 2016

spworley said:
Since Kepler and Maxwell have instruction level preemption ability already, there's likely something more to it. Perhaps it's just a software change now exposing it to the graphics driver, not just CUDA? Or, thinking about it, it's probably the ability to reorder the kernel execution stack instead of just a push/pop like Kepler, which would explain the "real-time" example.

Your description of preemption in your previous post described it in terms of kernels being able to launch new kernels, then restore their own state.
Preemption in a more general sense would allow for an unrelated service or the system to force a kernel to stop execution and yield to something else, with or without the preempted kernel's involvement.

Grall · Apr 9, 2016

CarstenS said:
I find that highly unlikely.

It's not without prescident tho... 8800GTX (and 8800GT maybe?) had a companion chip with the video display interface hardware on it. Not saying I'd expect it, but it would free up die space on a chip that seems primarily geared to live in data centers...

A quadro is priced high enough as it is that an extra display adapter chip wouldn't be a burdensome cost. *shrug*

lanek · Apr 9, 2016

Ryan Smith said:
GP100 is fully graphics capable. They aren't drawn on the diagrams, but it has display controllers, ROPs, etc. And I agree a Quadro is a good bet at some point.

Cerainly not before Q1 2017.

silent_guy · Apr 9, 2016

You'd think that, at 16nm, the display control logic would be sufficiently tiny to not matter too much, and IOs being the biggest area. Which you'd need also to transfer data to an external chip.

Kaotik · Apr 9, 2016

AnarchX said:
~300mm² GPU with 8Gbps GDDR5: https://www.chiphell.com/thread-1563086-1-1.html
Its also said that GPU-Z detects some 1152/864SPs, through missing Pascal detection. So it could be 18/36 SMM/SM.

The GPU dimensions look similar to what Huang showed at GTC2016 as "Drive PX2", now that it finally wasn't just GM204's like last time

Shaklee3 · Apr 9, 2016

I can confirm the instruction level preemption in p100. It was mentioned in a talk at GTC. Also, another talk mentioned a different variant of Pascal that would be released that has a SIMD instruction for 8-bit integers as input and doing the math in 32-bit float. The claim was that this could provide a 4x increase in throughout for some deep learning applications. This was all the detail he provided.

Voxilla · Apr 10, 2016

For Pascal it was claimed to have double the FLOPS / Watt (based on SGEMM, which allows close to peak FP32 rate)
Doing the maths we go from 7 TFLOP / 250 Watt to 10.6 TFLOP / 300 Watt.
Which is an improvement from 28 GFLOP / Watt to 35 GLOP / Watt.
That is 25 % better instead of 100%, far from what was predicted.

pjbliverpool · Apr 10, 2016

Voxilla said:
For Pascal it was claimed to have double the FLOPS / Watt (based on SGEMM, which allows close to peak FP32 rate)
Doing the maths we go from 7 TFLOP / 250 Watt to 10.6 TFLOP / 300 Watt.
Which is an improvement from 28 GFLOP / Watt to 35 GLOP / Watt.
That is 25 % better instead of 100%, far from what was predicted.

Double precision.

Ext3h · Apr 10, 2016

Besides, why would you compare the efficiency at peak FP32 performance?
As @pjbliverpool already noted, efficiency at double precision improved much further. Same goes when you utilize half precision, in both cases the 100% improvements were achieved.

And then there is also the average efficiency. And given the better ratio of ALUs to all other resources, both worst case and average utilization, and hence also efficiency should improve greatly for many formerly constrained applications. That one even for FP32 loads.

A full 100% improvement for every possible use case wasn't to be expected. That exceeds the possible gains from the node change by far.

Voxilla · Apr 10, 2016

pjbliverpool said:
Double precision.

SGEMM = Single precision floating General Matrix Multiply

Jen-Hsun claims that Pascal will achieve over 2x the performance per watt of Maxwell in Single Precision General Matrix multiplication.

Voxilla · Apr 10, 2016

McHuj · Apr 10, 2016

He could also be actually talking about the multiplier itself.

At the gate level, the multiplier has doubled the performance per watt, but when you start factoring the memories, IO, and other components that might not scale well, the overall performance per watt isn't as impressive.

CarstenS · Apr 10, 2016

Do we know other power consumption numbers than the muddily undefinded TDP-number?
Additionally: Is it arch vs. arch or product vs. product?

Not saying it'll all add up, but that there are quite a few variables still, and if all else fails, Nvidia still could a 970-stunt out of it's sleeves.

CSI PC · Apr 10, 2016

CarstenS said:
Do we know other power consumption numbers than the muddily undefinded TDP-number?
Additionally: Is it arch vs. arch or product vs. product?

Not saying it'll all add up, but that there are quite a few variables still, and if all else fails, Nvidia still could a 970-stunt out of it's sleeves.

I really doubt even he would piss off HPC-exa/hyperscal/research clients.
Case in point back with Tesla Kepler and how these clients had full HyperQ and dynamic parallelism compared to consumer and even quadro I think.
Cheers

xpea · Apr 10, 2016

GTX1080 with 8GB GDDR5X and 1070 with GDDR5
from chiphell: http://www.chiphell.com/thread-1562478-1-1.html
does it means one model available a bit latter ?

McHuj · Apr 10, 2016

Or sold at a premium with limited availability

Nvidia Pascal Announcement

spworley

Frenetic Pony

SimBy

CSI PC

3dilettante

Grall

Invisible Member

lanek

silent_guy

Kaotik

Drunk Member

Shaklee3

Voxilla

pjbliverpool

B3D Scallywag

Ext3h

Voxilla

Voxilla

McHuj

CarstenS

Moderator

CSI PC

xpea

McHuj

Similar threads