Nvidia Pascal Speculation Thread

Status
Not open for further replies.
Has it been double confirmed, this 1:3 relation of actual TFLOPS to „DL TOPS“? I'd think 1:2 much more realistically.
I rather think, it's not GP200 in there, but more mainstreamy GP204's (maybe even got demoted to GP206) with 6 TFLOPS each, 3.072 ALUs and 1 GHz.
 
I am not sure this says that much about what a big pascal will look like - the press release states that drive px2 contains 2 tegra SoCs and 2 discrete pascal GPUs. So I guess 4 sp gflops at around 100W? Tegras usually also have GPUs, so maybe they are contributing flops to the peak numbers.
 
Has it been double confirmed, this 1:3 relation of actual TFLOPS to „DL TOPS“? I'd think 1:2 much more realistically.
I rather think, it's not GP200 in there, but more mainstreamy GP204's (maybe even got demoted to GP206) with 6 TFLOPS each, 3.072 ALUs and 1 GHz.
Would seem almost certain that the GPUs are a 2nd/3rd tier GPU no? Power envelope, the GDDR5 interface, and what looks like a relatively modest GPU size (unless Jen-Hsun has paws the size of Manute Bol's) seem to point to something less than "big" Pascal. Here's a quick screencap from the presentation:

0B5qcmE.jpg
 
From that it now becomes more clear how the next big Pascal i.e. GP200 will look like
4096 SP / 8 TFLOPS SP / 4 TFLOPS DP
Regarding the FP16 performance, NV have created this new metric of
DLTOPs (deep learning tera operations per second)
Which would be at 24 DLTOPS. Given the new name that indicates it's not the same as 24 TFLOPS FP16.

For compute applications that need DP and DL it will be much better compared to GM200.
But for gaming and SP that would only be about 25% improvement which is rather a small improvement.

To clarify, of course this PX 2 has no 'big' Pascal, since it uses 2x Tegra Pascal at the front and 2x Pascal GPUs at the back of the PCB. The latter are probably half the big Pascal like GP204.
Given the aggregate performance of 8 TFLOPS and 24 DTOPS of the 4 PX 2 GPUs, I infer what the first 'big' Pascal could be.
 
....
For compute applications that need DP and DL it will be much better compared to GM200.
But for gaming and SP that would only be about 25% improvement which is rather a small improvement.

One other performance boost will be better bandwidth (even with their compression) that seems to hobble the current Maxwell cards above 1080p.
This could make a big difference for those cards utilising HBM2, and maybe some for those that may get GDDR5x.
I think one of the reasons Maxwell went with lower bandwidth/bit memory interface was relating to power consumption/efficiency.

Cheers
 
Did they show it working or was this a mockup with woodscrews?


Assuming roughly ~1 TFLOPs FP32 for each Tegra 7 (expecting twice the GPU performance of Tegra X1, following the cadence between previous iterations), that's 2 TFLOPs for both Tegras combined and 6 TFLOPs for the discrete GPUs. 3 TFLOPs per GPU.

Thinking in mobile graphics solutions because those are MXM cards, 3 TFLOPs is close to a Geforce GTX 980M, or twice that of a GTX 960M using a GM107.
I'm guessing those are two Pascal GP107 cards, if the Pascal architecture turns out more of a Maxwell 3 with FinFet for twice the transistors and execution resources, as it's been suggested.
If GM107 doubled the performance of GK107, it makes sense that GP107 makes that transition again.

On the desktop front, if they end up using say 20% higher clocks, then we are indeed looking at the compute performance of a GTX 970, though probably with significantly less fillrate resources (only 32 ROPs for a 128-bit bus).

Either way, these cards are probably far away from AMD's Polaris Mini that they showed up and running. Different performance segments at least.
 
TBH I am not sure what can be taken from the Drive PX 2 design and the TFLOPs presented, this product seems to be specially designed towards Deep Learning with tracking and processing a large amount of objects and real world environment in context of large scale visual recognition.
In that terms closest comparison between Titan X and this is the AlexNet benchmark NVIDIA presents and even that is not ideal; Titan X has 450 images/sec while Drive PX 2 has 2,800 images/sec.

Cheers
 
I'm guessing those are two Pascal GP107 cards, if the Pascal architecture turns out more of a Maxwell 3 with FinFet for twice the transistors and execution resources, as it's been suggested.
If GM107 doubled the performance of GK107, it makes sense that GP107 makes that transition again.

Given the TDP of 250Watt for this PX 2 card it points more in the direction of a GPx04. Also looking at the die size on the photo, it's in the 3-4 cm2 range. As for neural net computation you are mainly limited by memory bandwdith, I bet the bus is 256 bit wide.
 
What's a DTOP?

I thought I read something that said the GPU was cut down for compute only. Guess texture units and ROPs and all of that was ripped out.
 
But what does a DL OP do ? Can it only be used for deep learning ?
It helps with car-vehicle safety/auto driving/warning systems/etc - uses multiple cameras mounted on the vehicle.
Hence why I mentioned the AlexNet benchmark NVIDIA presented where the Drive PX 2 is 6x more powerful than the Titan X for handling images/second.
I cannot see how this product can be compared to traditional GPU usage/context as it has a specialised purpose, albeit much more powerful than what is currently available.
Considering what it is doing, the power consumption and size is pretty good.

Cheers
 
But what does a DL OP do ? Can it only be used for deep learning ?
Just to add,
these two links help to put what NVIDIA is doing in perspective and how it relates to the recent news being discussed:
"1st gen" before latest news: http://blogs.nvidia.com/blog/2015/01/06/audi-tegra-x1/
Latest news evolution of the following: http://www.nvidia.com/object/drive-px.html

I am not sure if Audi's Bobby (it beat a journalist around a race track 1 lap timed lol) self drive system was built on the NVIDIA technology or was Audi's own work, although they are signed up for the current Drive PX 2 development and was working with NVIDIA back in beginning of 2015 and even going back quite a lot further still.

Cheers
 
I cannot see how this product can be compared to traditional GPU usage/context as it has a specialised purpose, albeit much more powerful than what is currently available.
Considering what it is doing, the power consumption and size is pretty good.

Cheers

I don't agree at all. The 2 GPUs at the backside are traditional discrete GPUs of some Pascal variant.
So does think: http://www.anandtech.com/show/9903/nvidia-announces-drive-px-2-pascal-power-for-selfdriving-cars

Regarding SP FLOPS / Watt performance is actually pretty poor and not much better as Maxwell at 250 Watt
But the question remains what is a deep learning operation (DL OP), we all know what is a floating point operation ie a add or a mul.
I'm pretty well aware what is needed to compute deep neural networks, and most of it is just multiply and accumulate...
 
I don't agree at all. The 2 GPUs at the backside are traditional discrete GPUs of some Pascal variant.
So does think: http://www.anandtech.com/show/9903/nvidia-announces-drive-px-2-pascal-power-for-selfdriving-cars

Regarding SP FLOPS / Watt performance is actually pretty poor and not much better as Maxwell at 250 Watt
But the question remains what is a deep learning operation (DL OP), we all know what is a floating point operation ie a add or a mul.
I'm pretty well aware what is needed to compute deep neural networks, and most of it is just multiply and accumulate...
The problem is this architecture (Drive PX 2 utilising 2 Pascal GPUs) is combining ARM and Denver processors and specifically for Deep Learning.
How can you do any comparison to a traditional Pascal discrete GPU?
If you compare then the Drive PX 2 is 6x more powerful than a TitanX in the task it was built for at the same wattage; specifically large scale visual recognition and object-image processing/compute - 450 images/sec for TitanX while Drive PX 2 has 2,800 images/sec.
So they may have similar TFLOPs but it is meaningless as their core scope-focus-implementation are very different with one only benchmarked while tightly linked to Denver and ARM implementation.
Cheers
 
Status
Not open for further replies.
Back
Top