Nvidia Pascal Announcement

McHuj · Apr 8, 2016

I'll be surprised with 8GB of GDDR5. That doesn't seem like it could meet the performance. On a 256-bit bus, you're limited to 224 GB/sec (256/8*7), exactly what the existing 970 has. You're going to increase flops by 80+% and keep the bandwidth the same? Sure the increased register file and L2 cache size will help, but not that much.

It would be more believable if it were a 6GB GDDR5 board with a 384-bit bus like GM200 which would provide 336 GB/sec. That I could believe.

However, the follow up question is: Is the GDDR5x PHY backwards compatible? Meaning, can I hook up GDDR5 chips to a GDDR5x GPU? As the 1080 and 1070 would both be coming from the same GP104 chip, I doubt we'd have two different PHY's on chip.

silent_guy · Apr 8, 2016

Xalion said:
In other words, the upgrade to M40s was actually a step back for us. The upgrade to P100s would be a significant step forward.

Xalion, are the new virtual memory and on-demand paging something that make a difference? Or it is more common to keep everything as much as possible in the GPU memory at all times?

homerdog · Apr 8, 2016

AnarchX said:
~300mm² GPU with 8Gbps GDDR5: https://www.chiphell.com/thread-1563086-1-1.html
Its also said that GPU-Z detects some 1152/864SPs, through missing Pascal detection. So it could be 18/36 SMM/SM.

Maybe the GTX 1070: 2304SPs@>1,4GHz + 8GiB 8Gbps. While GTX 1080 goes for GDDR5X and full core with 2560SPs for the benchmarks with low availability because of memory and 16FF yields.

And GP106 seems also to be ready:

http://www.hardware.fr/news/14589/gtc-200-mm-petit-gpu-pascal.html ~205mm² size estimation
http://www.hardwareluxx.de/index.ph...it-pascal-gpu-und-samsung-gddr5-speicher.html A1 chips made in 13. week of 2016.

Hasn't it been made abundantly clear that GDDR5X is too far out for these cards? Unless the 1080 will be significantly delayed vs the 1070, there's little chance it'll use anything other than regular GDDR5.

Also I think the 1070 will be a bit more castrated than 2304/2560. If your SP count is correct I would expect the 1070 to have something more like 2048 SPs. But I would be happy to be wrong

homerdog · Apr 8, 2016

McHuj said:
I'll be surprised with 8GB of GDDR5. That doesn't seem like it could meet the performance. On a 256-bit bus, you're limited to 224 GB/sec (256/8*7), exactly what the existing 970 has. You're going to increase flops by 80+% and keep the bandwidth the same? Sure the increased register file and L2 cache size will help, but not that much.

Well it's 8Gbps GDDR5 on presumably a full 256bit bus (970 effectively has 224bit bus), so there is a bit more bandwidth there. And the larger RF and cache can make quite the difference.

xpea · Apr 8, 2016

silent_guy said:
Xalion, are the new virtual memory and on-demand paging something that make a difference? Or it is more common to keep everything as much as possible in the GPU memory at all times?

talking about memory, interesting benches from Nvidia Nvlink presentation (courtesy of Damien @ http://www.hardware.fr/news/14587/gtc-tesla-p100-debits-pcie-nvlink-mesures.html )

Xalion · Apr 8, 2016

homerdog said:
Lol I ninjad on an extra sentence to my previous post while you were replying =D

, it is all good. I could have been more clear. I wasn't trying to make general statements on the architectures - mainly just point out that Pascal does seem to be a great thing from my perspective. I tend to work on systems like Lightning. So "upgrade" really usually translates into "is it time to build a new machine". We need pretty solid justification (both on paper and in demos) to do that. Also, to get some of the benefit from a new architecture we need to rewrite old code. So there is a cost in both hardware and software. To be honest, software is probably an order of magnitude higher than hardware. Maxwell had a lot of good features - but when you balance that against the cost of upgrade and the loss of DP, for the applications I work on it turned out to be a net loss on the test hardware I ran things on. We probably could have improved it with enough effort, but that whole cost thing comes into it again.

On the other hand, Pascal (at least on paper - no hardware to check with yet) seems like it may give a straight upgrade before we start tinkering with the code. That is exciting to me.

silent_guy said:
Xalion, are the new virtual memory and on-demand paging something that make a difference? Or it is more common to keep everything as much as possible in the GPU memory at all times?

I'll be honest that I have not used the unified memory features much. It sounds excellent, and should make programming much easier. But I've already written the code to handle streaming and syncing without using it, so I would probably need someone to convince me a retrofit is worth it. For new code, I can see it making a huge difference. About 90% of the time I spend debugging is tracking down segfaults in kernels because something was accessed out of order (that is probably why I am so reticent to touch it once it is working). Not having to worry about that would be huge.

That is just my perspective. It is one of those things that the programmer in me loves - but the scientist says "is it worth replacing validated code with just to make writing software a bit easier".

CSI PC · Apr 8, 2016

Xalion said:
I checked to make sure and you were correct. The machines are currently using K40s, which are GK110s - not the original M2090s which were GF110s. The upgrade to M40s never happened because you were looking at a very small increase in SP (~5000GFLOPS to ~6500GFLOPS) and a rather significant loss in DP (~1500GFLOPS to ~200GFLOPS). While we try to do everything we can in SP, there are operations that require DP for accuracy over longer simulations/scenarios. Pascal if the rumored specs are true goes to double the SP and triple the DP of our current accelerators.

In other words, the upgrade to M40s was actually a step back for us. The upgrade to P100s would be a significant step forward.

And not just you,
seems that a lot are looking to upgrade from Kepler Tesla to the Pascal generation, and they skipped the M40 (even research centers with a lot of cash).
Cheers

CSI PC · Apr 8, 2016

homerdog said:
I thought GM200 was quite a massive upgrade in most every way over GF100 for FP32 and FP16 applications. Hell it should even be a lot better than GK110 as well (except for FP64 obviously). Is that not the case?

Regarding Tesla range, well you would compare the Maxwell M40 (225w card) to the Kepler K80 (this is a 300w card though).
Here the K80 still has greater FP32 performance.
8.74 teraflops for the Kepler card and 7 teraflops for the Maxwell, and the Kepler card also has ECC.
Although it is fair to say the M40 was being heavily pushed towards deep learning.
Cheers

Voxilla · Apr 8, 2016

CSI PC said:
And not just you,
seems that a lot are looking to upgrade from Kepler Tesla to the Pascal generation, and they skipped the M40 (even research centers with a lot of cash).
Cheers

Don't discredit the 'M40 too much, as it has been very important in deep learning.

CSI PC · Apr 8, 2016

Voxilla said:
Don't discredit the 'M40 too much, as it has been very important in deep learning.

See my post above this one

Cheers

Voxilla · Apr 8, 2016

CSI PC said:
See my post above this one
Cheers

Typing on the so called PC replacement, called iPad has it's latency issues

xpea · Apr 8, 2016

First time someone writes that P100 has no video output with source from the field:
http://vrworld.com/2016/04/08/nvidia-mezzanine-nvlink-connector-pictured/

Some manufacturers simply gave up on the idea of calling Pascal GPU architecture – a GPU or GPGPU, but rather called it “CPU”, which in a way, Tesla P100 certainly qualifies (no display outputs, no video drivers, pure compute output). For example, Zoom NetCom was showing its OpenPOWER design called RedPOWER P210, featuring two POWER8 processors and four Tesla P100’s. Their naming for the mezzanine connector? JCPU.

This article also mentions that P100 mezzanine format is compatible with V100 (Volta) for easy upgrade

Given that IBM’s OpenPOWER conference is taking place at the same time as GTC, we searched for more details about the Mezzanine connector and the NVLink itself, and stumbled on quite an interesting amount of details. First and foremost, ever Pascal (Tesla P100) and Volta (Tesla V100) product that utilizes the NVLink will use the same connector, making sure that you have at least one generation of cross-compatibility.

Voxilla · Apr 8, 2016

xpea said:
First time someone writes that P100 has no video output with source from the

Like any other Tesla in other words.

Grall · Apr 8, 2016

Voxilla said:
Like any other Tesla in other words.

Perhaps they meant no video on the silicon itself, not just the PCB...

spworley · Apr 8, 2016

Pascal features "compute preemption" but the details are sparse. Kepler and Maxwell already have preemption (used by "dynamic parallelism" where kernels can launch other kernels, then restore their old state). But perhaps Pascal's extends its use. How? Some speculation: Kernel priorities so a realtime kernel can jump a queue and interrupt a long-running background kernel? To remove the watchdog timer killer? To allow kernel portability, allowing an executing kernel to migrate from one GPU to another? (requiring using the new transparent unified memory)

CSI PC · Apr 8, 2016

Voxilla said:
Like any other Tesla in other words.

Yeah it is a GPU accelerator/co-processor.

Cheers

CSI PC · Apr 8, 2016

spworley said:
Pascal features "compute preemption" but the details are sparse. Kepler and Maxwell already have preemption (used by "dynamic parallelism" where kernels can launch other kernels, then restore their old state). But perhaps Pascal's extends its use. How? Some speculation: Kernel priorities so a realtime kernel can jump a queue and interrupt a long-running background kernel? To remove the watchdog timer killer? To allow kernel portability, allowing an executing kernel to migrate from one GPU to another? (requiring using the new transparent unified memory)

http://www.theregister.co.uk/2016/04/06/nvidia_gtc_2016/
Specifically they say:

Software running on the P100 can be preempted on instruction boundaries, rather than at the end of a draw call. This means a thread can immediately give way to a higher priority thread, rather than waiting to the end of a potentially lengthy draw operation. This extra latency – the waiting for a call to end – can really mess up very time-sensitive applications, such as virtual reality headsets. A 5ms delay could lead to a missed Vsync and a visible glitch in the real-time rendering, which drives some people nuts.

By getting down to the instruction level, this latency penalty should evaporate, which is good news for VR gamers. Per-instruction preemption means programmers can also single step through GPU code to iron out bugs.

Now was that within a keynote speech at the conference or did they manage to ask some NVIDIA personnel afterwards *shrug*.
Cheers

iMacmatician · Apr 8, 2016

AnarchX said:
http://www.hardware.fr/news/14589/gtc-200-mm-petit-gpu-pascal.html ~205mm² size estimation

Coincidentally my estimate (using these pictures) was 205 mm^2 ± X mm^2 ( ≈17 mm x ≈12 mm), where X is at least 30.

CarstenS · Apr 8, 2016

Grall said:
Perhaps they meant no video on the silicon itself, not just the PCB...

I find that highly unlikely. Apart from Data centers and supercomputers, even the sparsest of display options makes a few things so much easier and I should think, that GP100 if not in a Geforce, surely will turn up in a Quadro near you at some point.

Ryan Smith · Apr 9, 2016

CarstenS said:
I find that highly unlikely. Apart from Data centers and supercomputers, even the sparsest of display options makes a few things so much easier and I should think, that GP100 if not in a Geforce, surely will turn up in a Quadro near you at some point.

GP100 is fully graphics capable. They aren't drawn on the diagrams, but it has display controllers, ROPs, etc. And I agree a Quadro is a good bet at some point.

Nvidia Pascal Announcement

McHuj

silent_guy

homerdog

donator of the year

homerdog

donator of the year

xpea

Xalion

CSI PC

CSI PC

Voxilla

CSI PC

Voxilla

xpea

Voxilla

Grall

Invisible Member

spworley

CSI PC

CSI PC

iMacmatician

CarstenS

Moderator

Ryan Smith

Similar threads