Nvidia Pascal Announcement

There's a "told you so" someone owes to the above gentleman ;)
Ha! But those words have been trademarked by Charlie.

TSMC says 16FF is 65% faster than 28nm.

If the HPC version is conservatively clocked at 1.4GHz, the consumer version should easily reach 1.6GHz. Even without an architecture change, GCN should go pretty high as well.
 
To be fair the GeForces will probably beat 1400MHz, but overall when compared oc vs oc to Maxwell I don't expect Pascal to be that far ahead on clocks.
 
Great news. ALUs are great for marketing, but big+fast register files and LDS (including fast LDS atomics since Maxwell) are more important for actual compute performance.
According to OlegSH's link https://devblogs.nvidia.com/parallelforall/inside-pascal/
the max registers per thread is still 255 but now there 14336 KB of register file divided among 3584 cuda cores instead of 6144 KB divided among 3072 cuda cores in maxwell.
It also state 64 cores per SM instead of 128 cores per SM in maxwell.

Since the max registers per thread is the same as maxwell does the register increase per core allow the per thread allocation to hit the 255 maximum in more situations (with more warps in flight)? or does this have something to do with Async compute?
edit - now that I think about it I guess it could be both.
 
So nVidia spent nearly the entire transistor budget on memory and DP. Not much in there for the gaming crowd.

Really high clocks for such a beastly chip though. GP104/6 should be interesting.
 
According to OlegSH's link https://devblogs.nvidia.com/parallelforall/inside-pascal/
the max registers per thread is still 255 but now there 14336 KB of register file divided among 3584 cuda cores instead of 6144 KB divided among 3072 cuda cores in maxwell.
It also state 64 cores per SM instead of 128 cores per SM in maxwell.

Since the max registers per thread is the same as maxwell does the register increase per core allow the per thread allocation to hit the 255 maximum in more situations (with more warps in flight)? or does this have something to do with Async compute?
Registers per work item tells you nothing about the number of hardware threads per SIMD.
 
I wonder what effective instruction latencies on Pascal look like, for FP32 work.

Not much more ALUs, but still an increase in throughput? Sounds like a deeper pipeline to me. Potentially decreased register pressure (and hence more avg. warps) by intentionally reduced instruction level parallelism per thread?

Or just less ALUs, but same effective register usage, hence reducing the impact of register shortage?
 
Registers per work item tells you nothing about the number of hardware threads per SIMD.
What I'm trying to process is that there is more than a doubling of registers for the chip but nowhere near a doubling of cores. Simple logic would lead you to more registers per core... but if the max registers per thread is the same then what is the increase in registers for? I think thats my confusion in a nutshell.
 
What I'm trying to process is that there is more than a doubling of registers for the chip but nowhere near a doubling of cores. Simple logic would lead you to more registers per core... but if the max registers per thread is the same then what is the increase in registers for? I think thats my confusion in a nutshell.
Didn't they do the same for GK210: same number of cores as GK110, much larger register files. Significantly higher performance for a lot of HPC workloads.
 
What I'm trying to process is that there is more than a doubling of registers for the chip but nowhere near a doubling of cores. Simple logic would lead you to more registers per core... but if the max registers per thread is the same then what is the increase in registers for? I think thats my confusion in a nutshell.

Classic occupancy problem. For complex workloads (i.e. lots of registers required per thread) the register increase means more threads in flight, more latency hiding, higher throughput etc etc.
 
Since the max registers per thread is the same as maxwell does the register increase per core allow the per thread allocation to hit the 255 maximum in more situations (with more warps in flight)? or does this have something to do with Async compute?
edit - now that I think about it I guess it could be both.
No, it has to do with number of warps you can keep in flight. At maximum of 255 registers you can only have 4 warps in flight on Maxwell, which is not enough to hide the latency of even the simplest arithmetic instructions.
 
tesla products tend to have lower clock speeds for the memory too don't they?
They might do. But GDDR5X will do that bandwidth. Why aim so low?

Also it's surprising it's not 32GB of HBM2, since memory per node is supposedly big deal for deep learning. And the rest of HPC wants lots of memory, too, in case we forget about them. And Knights Landing will have 400GB (though at mixed, inferior, bandwidths).

Maybe the 8GB modules aren't coming this year.
 
What I'm trying to process is that there is more than a doubling of registers for the chip but nowhere near a doubling of cores. Simple logic would lead you to more registers per core... but if the max registers per thread is the same then what is the increase in registers for? I think thats my confusion in a nutshell.
Just do the math. 256KB / 65k*32bit register file per SM.
Max 2048 threads per SM, or at most 255 registers per thread. If each of these would use the theoretical maximum of 255 32bit registers, you would end up with a theoretical demand of 2MB register file size.

But that's not relevant. What you care about, is what happens when you run low on registers. And that would mostly mean that you can no longer saturate all cores. So just decrease the number of cores, keep the number of threads and the size of the register file the same, and scale horizontally instead.

So even if you max out at only 4 warps due to register pressure, you only got as many cores inside a single SM as you can saturate.
 
Also it's surprising it's not 32GB of HBM2
8GB stacks haven't started production yet, AFAIK. So the first batch of P100 got to use 4GB stacks only if they want to ship in Q4 to OEMs, respectively in Q1 to customers.
 
Back
Top