Nvidia Pascal Announcement

silent_guy · Apr 5, 2016

Ailuros said:
There's a "told you so" someone owes to the above gentleman

Ha! But those words have been trademarked by Charlie.

TSMC says 16FF is 65% faster than 28nm.

If the HPC version is conservatively clocked at 1.4GHz, the consumer version should easily reach 1.6GHz. Even without an architecture change, GCN should go pretty high as well.

Adored · Apr 5, 2016

To be fair the GeForces will probably beat 1400MHz, but overall when compared oc vs oc to Maxwell I don't expect Pascal to be that far ahead on clocks.

Deleted member 2197 · Apr 5, 2016

http://www.guru3d.com/news-story/nvidia-announces-tesla-p100-data-center-gpu.html

Infinisearch · Apr 5, 2016

sebbbi said:
Great news. ALUs are great for marketing, but big+fast register files and LDS (including fast LDS atomics since Maxwell) are more important for actual compute performance.

According to OlegSH's link https://devblogs.nvidia.com/parallelforall/inside-pascal/
the max registers per thread is still 255 but now there 14336 KB of register file divided among 3584 cuda cores instead of 6144 KB divided among 3072 cuda cores in maxwell.
It also state 64 cores per SM instead of 128 cores per SM in maxwell.

Since the max registers per thread is the same as maxwell does the register increase per core allow the per thread allocation to hit the 255 maximum in more situations (with more warps in flight)? or does this have something to do with Async compute?
edit - now that I think about it I guess it could be both.

silent_guy · Apr 5, 2016

Adored said:
To be fair the GeForces will probably beat 1400MHz, but overall when compared oc vs oc to Maxwell I don't expect Pascal to be that far ahead on clocks.

I wouldn't surprise me one bit to see Pascals overclocked to 1.8GHz.

Jawed · Apr 5, 2016

iMacmatician said:
From the dev blog:
So the memory speed would be 1.4 Gbps.

That seems extraordinarily slow. HBM2 fail?

Razor1 · Apr 5, 2016

tesla products tend to have lower clock speeds for the memory too don't they?

trinibwoy · Apr 5, 2016

So nVidia spent nearly the entire transistor budget on memory and DP. Not much in there for the gaming crowd.

Really high clocks for such a beastly chip though. GP104/6 should be interesting.

Jawed · Apr 5, 2016

Infinisearch said:
According to OlegSH's link https://devblogs.nvidia.com/parallelforall/inside-pascal/
the max registers per thread is still 255 but now there 14336 KB of register file divided among 3584 cuda cores instead of 6144 KB divided among 3072 cuda cores in maxwell.
It also state 64 cores per SM instead of 128 cores per SM in maxwell.

Since the max registers per thread is the same as maxwell does the register increase per core allow the per thread allocation to hit the 255 maximum in more situations (with more warps in flight)? or does this have something to do with Async compute?

Registers per work item tells you nothing about the number of hardware threads per SIMD.

Ext3h · Apr 5, 2016

I wonder what effective instruction latencies on Pascal look like, for FP32 work.

Not much more ALUs, but still an increase in throughput? Sounds like a deeper pipeline to me. Potentially decreased register pressure (and hence more avg. warps) by intentionally reduced instruction level parallelism per thread?

Or just less ALUs, but same effective register usage, hence reducing the impact of register shortage?

Infinisearch · Apr 5, 2016

Jawed said:
Registers per work item tells you nothing about the number of hardware threads per SIMD.

What I'm trying to process is that there is more than a doubling of registers for the chip but nowhere near a doubling of cores. Simple logic would lead you to more registers per core... but if the max registers per thread is the same then what is the increase in registers for? I think thats my confusion in a nutshell.

silent_guy · Apr 5, 2016

Infinisearch said:
What I'm trying to process is that there is more than a doubling of registers for the chip but nowhere near a doubling of cores. Simple logic would lead you to more registers per core... but if the max registers per thread is the same then what is the increase in registers for? I think thats my confusion in a nutshell.

Didn't they do the same for GK210: same number of cores as GK110, much larger register files. Significantly higher performance for a lot of HPC workloads.

trinibwoy · Apr 5, 2016

Infinisearch said:
What I'm trying to process is that there is more than a doubling of registers for the chip but nowhere near a doubling of cores. Simple logic would lead you to more registers per core... but if the max registers per thread is the same then what is the increase in registers for? I think thats my confusion in a nutshell.

Classic occupancy problem. For complex workloads (i.e. lots of registers required per thread) the register increase means more threads in flight, more latency hiding, higher throughput etc etc.

MDolenc · Apr 5, 2016

Infinisearch said:
Since the max registers per thread is the same as maxwell does the register increase per core allow the per thread allocation to hit the 255 maximum in more situations (with more warps in flight)? or does this have something to do with Async compute?
edit - now that I think about it I guess it could be both.

No, it has to do with number of warps you can keep in flight. At maximum of 255 registers you can only have 4 warps in flight on Maxwell, which is not enough to hide the latency of even the simplest arithmetic instructions.

Jawed · Apr 5, 2016

Razor1 said:
tesla products tend to have lower clock speeds for the memory too don't they?

They might do. But GDDR5X will do that bandwidth. Why aim so low?

Also it's surprising it's not 32GB of HBM2, since memory per node is supposedly big deal for deep learning. And the rest of HPC wants lots of memory, too, in case we forget about them. And Knights Landing will have 400GB (though at mixed, inferior, bandwidths).

Maybe the 8GB modules aren't coming this year.

Ext3h · Apr 5, 2016

Infinisearch said:
What I'm trying to process is that there is more than a doubling of registers for the chip but nowhere near a doubling of cores. Simple logic would lead you to more registers per core... but if the max registers per thread is the same then what is the increase in registers for? I think thats my confusion in a nutshell.

Just do the math. 256KB / 65k*32bit register file per SM.
Max 2048 threads per SM, or at most 255 registers per thread. If each of these would use the theoretical maximum of 255 32bit registers, you would end up with a theoretical demand of 2MB register file size.

But that's not relevant. What you care about, is what happens when you run low on registers. And that would mostly mean that you can no longer saturate all cores. So just decrease the number of cores, keep the number of threads and the size of the register file the same, and scale horizontally instead.

So even if you max out at only 4 warps due to register pressure, you only got as many cores inside a single SM as you can saturate.

silent_guy · Apr 5, 2016

Jawed said:
That seems extraordinarily slow. HBM2 fail?

This may be specific to the HPC version. See Kepler running at 5GHz for HPC vs 7GHz for consumer. It's the same ratio, actually.

Ext3h · Apr 5, 2016

Jawed said:
Also it's surprising it's not 32GB of HBM2

8GB stacks haven't started production yet, AFAIK. So the first batch of P100 got to use 4GB stacks only if they want to ship in Q4 to OEMs, respectively in Q1 to customers.

Infinisearch · Apr 5, 2016

trinibwoy said:
Classic occupancy problem. For complex workloads (i.e. lots of registers required per thread) the register increase means more threads in flight, more latency hiding, higher throughput etc etc.

Isn't that what I said when I said quote

Infinisearch said:
allow the per thread allocation to hit the 255 maximum in more situations (with more warps in flight)?

or am I messing something up?

MDolenc · Apr 5, 2016

It's not just a problem at maximum registers... It starts way earlier then that.

Nvidia Pascal Announcement

silent_guy

Adored

Deleted member 2197

Guest

Infinisearch

silent_guy

Jawed

Razor1

trinibwoy

Meh

Jawed

Ext3h

Infinisearch

silent_guy

trinibwoy

Meh

MDolenc

Jawed

Ext3h

silent_guy

Ext3h

Infinisearch

MDolenc

Similar threads