Nvidia Pascal Announcement

trinibwoy · May 14, 2016

AnarchX said:
GTX 1080 exposes 20 Compute Units in OpenCL. So CC6.1 is like Maxwell 128SPs per SM?

It probably means GP104 doesn't get the same increase in registers as GP100. Not surprising but a little disappointing if true.

Compute performance vs Maxwell isn't perfectly linear but seems to scale well enough. I borrowed the other card numbers from Anandtech.

fellix · May 14, 2016

Results from my 980Ti @1400MHz GPU clock, mem clock @ def:

Face Detection -- 205 MPix/s
Optical Flow -- 39 MPix/s
Particle Simulation -- 1978 MInter/s

CSI PC · May 14, 2016

Ext3h said:
So a portion of the better efficiency isn't due to on-die optimizations, but an optimized and horizontally upscaled voltage regulator setup with a lower impedance and hence smaller voltage swings and less overall noise in that.
(~10% relative gains in efficiency? Possibly more if this stabilized power supply has follow-up effects on-die, as the plotted efficiency appears to be only voltage regulator efficiency.)

The rest is just praising the optimized airflow of the FEs radiator design at the expense of shielding everything, and reasoning about the up-charge.

Wouldn't they lose some efficiency if resolving some of the issues with Maxwell?
Cheers

xpea · May 14, 2016

have to explore detailed benches to see uarch change between Maxell and Pascal, but look like Nvidia went the brute force / high clocks approach this round. Something that reminds me tic toc intel tactic. new node, just bring nearly same uarch, and wait the node matures for larger architectural improvement. If this strategy is true, then Volta should be very interesting...

Razor1 · May 14, 2016

xpea said:
First waterblock compatible with 1080 from BYKSI (less than $100):

http://videocardz.com/59916/byksi-announces-first-waterblock-for-geforce-gtx-1080

and on a side note, custom watercooled 1080 boards will be announced at Computex with 2.5GHz clocks

I might actually get back into overclocking my graphics cards because of this...... Shit its almost a 80% in clocks maybe......

Pixel · May 14, 2016

fellix said:
Results from my 980Ti @1400MHz GPU clock, mem clock @ def:

Face Detection -- 205 MPix/s
Optical Flow -- 39 MPix/s
Particle Simulation -- 1978 MInter/s

Is the particle simulation highly memory bandwidth dependent? try downclocking your memory on that test, given the 980TI has a wider memory bus than the GTX 1080.

Perhaps HBM2 can't come soon enough, especially for 256bit wide bus equiped high end architectures.

Nakai · May 14, 2016

fellix said:
That's weird. Any chance the API is enumerating single TPC as a multi-processor, since Pascal pairs two SMs into a TPC?

Nope. As metioned before, for GP100 NV tried to increase registers and the size of the cache. This comes very handy, as Deep Learning networks could fit onto a single GPU.

Each TCP consists of a PolymorphEngine and number of SMs. G80 had two SMs per TCP. GT200 => 3. GF100 switched to a 1:1 ratio. GK100 and GM1/2 had the same ratio. And for Pascal only the HPC-Chip will use a different ratio for this. T

A1xLLcqAgt0qc2RyMz0y · May 14, 2016

trinibwoy said:
It probably means GP104 doesn't get the same increase in registers as GP100. Not surprising but a little disappointing if true.

Compute performance vs Maxwell isn't perfectly linear but seems to scale well enough. I borrowed the other card numbers from Anandtech.

The math:

GTX 1080 vs (Titan X, GTX 980)

Face Detection:

+31.7% vs Titan X
+79.7% vs GTX 980

Optical Flow:

+32.4% vs Titan X
+82.1% vs GTX 980

Partical Simulation:

+23.3% vs Titan X
+43.6% vs GTX 980

Voxilla · May 14, 2016

fellix said:
Results from my 980Ti @1400MHz GPU clock, mem clock @ def:

Face Detection -- 205 MPix/s
Optical Flow -- 39 MPix/s
Particle Simulation -- 1978 MInter/s

Second attempt at overclocking

Titan-X @1.3 GHz boost, (real clock still 1.4 GHz for some reason)

Face Detection -- 250 MPix/s
Optical Flow -- 44 MPix/s
Particle Simulation -- 2015 MInter/s

Razor1 · May 15, 2016

http://videocardz.com/59951/nvidia-geforce-gtx-1080-opencl-performance-leaked

pjbliverpool · May 15, 2016

Tons of new info leaked here:

http://wccftech.com/nvidia-geforce-gtx-1080-official-slides-leak/

- 314mm2
- 64 ROPS confirmed
- HDR capability confirmed
- More detail on preemption and async compute
- Hardware changes to allow multi-projection confirmed
- Improved memory compression over Maxwell to the tune of 1.2x (explains how the 1080 can beat the TitanX with less memory bandwidth)

fellix · May 15, 2016

Well, looks like the SM configuration is indeed similar to Maxwell. :???:

CSI PC · May 15, 2016

More slides shown at VideoCardz:
http://videocardz.com/59962/nvidia-geforce-gtx-1080-final-specifications-and-launch-presentation

I assume they are going to try a narrative around their "Pixel level graphics + thread level compute pre-emption", and the dynamic load balancing.
Seems the Simultaneous Multi-Projection is Pascal HW+SW technology, I wonder if it improves on the pipeline geometry/tesselation outside of VR/multi-screen - within the Polymorph engine.
Cheers

trinibwoy · May 16, 2016

fellix said:
Well, looks like the SM configuration is indeed similar to Maxwell.

Looks like a carbon copy. Say hello to Pascwell.

Grall · May 16, 2016

Begs the question what exactly NV spent $3 billion on, or whatever figure was mentioned by Jen-Hsun...

RecessionCone · May 16, 2016

trinibwoy said:
Looks like a carbon copy. Say hello to Pascwell.

It's not a carbon copy - the SM is split in half. But the Maxwell SM was already almost split in half in practice. To CUDA developers, it will seem different, though.

Razor1 · May 16, 2016

The dynamic load balancing between compute and graphics is interesteing, the lack of this feature is probably what we saw when Maxwell started using a context switch with the small async compute program MDolenc created.

CSI PC · May 16, 2016

fellix said:
Well, looks like the SM configuration is indeed similar to Maxwell.

Yeah the P100 has advantages, double the SM (per 64 cuda cores) and registers in comparison.
Cheers

ninelven · May 16, 2016

314 mm^2...... Looks like this will be another cash cow for Nvidia.

Ext3h · May 16, 2016

Razor1 said:
The dynamic load balancing between compute and graphics is interesteing, the lack of this feature is probably what we saw when Maxwell started using a context switch with the small async compute program MDolenc created.

Not only when using the compute queue in general (that's a completely different beast), but mostly also when combining dispatch and drawcalls in the same command buffer.
That would result in the driver partitioning the SMMs into either compute or graphics based on heuristics, and if the heuristic messed up, half the SMMs would end up idling for the duration of the command buffer.

For some applications it did work - but apparently only those where the driver had a suitable profile for the static partition schema. Combining draw and dispatch calls with significantly divergent run times was a guarantee to trip this.

PS: I was told that Maxwell / Pascal can only execute a single kernel per SMM, respectively that all warps must be belong to the same kernel. Someone knows whether there is anything to it? I can find neither resources denying nor confirming it. That would mean the architecture also has a problem in general with mixed workloads, not just limited to the coarse compute/graphics mix. Even if that should be true, preemption and the dynamic balancing is still a guarantee that - besides resource conflicts - mixing compute and graphics is no longer bound to fail horribly.

Nvidia Pascal Announcement

trinibwoy

Meh

fellix

CSI PC

xpea

Razor1

Pixel

Nakai

A1xLLcqAgt0qc2RyMz0y

Voxilla

Razor1

pjbliverpool

B3D Scallywag

fellix

CSI PC

trinibwoy

Meh

Grall

Invisible Member

RecessionCone

Razor1

CSI PC

ninelven

PM

Ext3h

Similar threads