Nvidia Pascal Announcement

GTX 1080 exposes 20 Compute Units in OpenCL. So CC6.1 is like Maxwell 128SPs per SM?

It probably means GP104 doesn't get the same increase in registers as GP100. Not surprising but a little disappointing if true.

Compute performance vs Maxwell isn't perfectly linear but seems to scale well enough. I borrowed the other card numbers from Anandtech.

1g0nobq.jpg


1nqdjbx.jpg


1y1rgcq.jpg
 
So a portion of the better efficiency isn't due to on-die optimizations, but an optimized and horizontally upscaled voltage regulator setup with a lower impedance and hence smaller voltage swings and less overall noise in that.
(~10% relative gains in efficiency? Possibly more if this stabilized power supply has follow-up effects on-die, as the plotted efficiency appears to be only voltage regulator efficiency.)

The rest is just praising the optimized airflow of the FEs radiator design at the expense of shielding everything, and reasoning about the up-charge.
Wouldn't they lose some efficiency if resolving some of the issues with Maxwell?
Cheers
 
have to explore detailed benches to see uarch change between Maxell and Pascal, but look like Nvidia went the brute force / high clocks approach this round. Something that reminds me tic toc intel tactic. new node, just bring nearly same uarch, and wait the node matures for larger architectural improvement. If this strategy is true, then Volta should be very interesting...
 
Results from my 980Ti @1400MHz GPU clock, mem clock @ def:

Face Detection -- 205 MPix/s
Optical Flow -- 39 MPix/s
Particle Simulation -- 1978 MInter/s
Is the particle simulation highly memory bandwidth dependent? try downclocking your memory on that test, given the 980TI has a wider memory bus than the GTX 1080.

Perhaps HBM2 can't come soon enough, especially for 256bit wide bus equiped high end architectures.
 
Last edited:
That's weird. Any chance the API is enumerating single TPC as a multi-processor, since Pascal pairs two SMs into a TPC?

Nope. As metioned before, for GP100 NV tried to increase registers and the size of the cache. This comes very handy, as Deep Learning networks could fit onto a single GPU.

Each TCP consists of a PolymorphEngine and number of SMs. G80 had two SMs per TCP. GT200 => 3. GF100 switched to a 1:1 ratio. GK100 and GM1/2 had the same ratio. And for Pascal only the HPC-Chip will use a different ratio for this. T
 
It probably means GP104 doesn't get the same increase in registers as GP100. Not surprising but a little disappointing if true.

Compute performance vs Maxwell isn't perfectly linear but seems to scale well enough. I borrowed the other card numbers from Anandtech.

1g0nobq.jpg


1nqdjbx.jpg


1y1rgcq.jpg

The math:

GTX 1080 vs (Titan X, GTX 980)

Face Detection:

+31.7% vs Titan X
+79.7% vs GTX 980

Optical Flow:

+32.4% vs Titan X
+82.1% vs GTX 980

Partical Simulation:

+23.3% vs Titan X
+43.6% vs GTX 980
 
Results from my 980Ti @1400MHz GPU clock, mem clock @ def:

Face Detection -- 205 MPix/s
Optical Flow -- 39 MPix/s
Particle Simulation -- 1978 MInter/s

Second attempt at overclocking :oops:
Titan-X @1.3 GHz boost, (real clock still 1.4 GHz for some reason)

Face Detection -- 250 MPix/s
Optical Flow -- 44 MPix/s
Particle Simulation -- 2015 MInter/s

bench2.png
 
Last edited:
Tons of new info leaked here:

http://wccftech.com/nvidia-geforce-gtx-1080-official-slides-leak/

- 314mm2
- 64 ROPS confirmed
- HDR capability confirmed
- More detail on preemption and async compute
- Hardware changes to allow multi-projection confirmed
- Improved memory compression over Maxwell to the tune of 1.2x (explains how the 1080 can beat the TitanX with less memory bandwidth)
 
Last edited:
Begs the question what exactly NV spent $3 billion on, or whatever figure was mentioned by Jen-Hsun... :)
 
The dynamic load balancing between compute and graphics is interesteing, the lack of this feature is probably what we saw when Maxwell started using a context switch with the small async compute program MDolenc created.
 
The dynamic load balancing between compute and graphics is interesteing, the lack of this feature is probably what we saw when Maxwell started using a context switch with the small async compute program MDolenc created.
Not only when using the compute queue in general (that's a completely different beast), but mostly also when combining dispatch and drawcalls in the same command buffer.
That would result in the driver partitioning the SMMs into either compute or graphics based on heuristics, and if the heuristic messed up, half the SMMs would end up idling for the duration of the command buffer.

For some applications it did work - but apparently only those where the driver had a suitable profile for the static partition schema. Combining draw and dispatch calls with significantly divergent run times was a guarantee to trip this.

PS: I was told that Maxwell / Pascal can only execute a single kernel per SMM, respectively that all warps must be belong to the same kernel. Someone knows whether there is anything to it? I can find neither resources denying nor confirming it. That would mean the architecture also has a problem in general with mixed workloads, not just limited to the coarse compute/graphics mix. Even if that should be true, preemption and the dynamic balancing is still a guarantee that - besides resource conflicts - mixing compute and graphics is no longer bound to fail horribly.
 
Back
Top