Nvidia Pascal Announcement

Discussion in 'Architecture and Products' started by huebie, Apr 5, 2016.

Tags:
  1. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Ha! But those words have been trademarked by Charlie.

    TSMC says 16FF is 65% faster than 28nm.

    If the HPC version is conservatively clocked at 1.4GHz, the consumer version should easily reach 1.6GHz. Even without an architecture change, GCN should go pretty high as well.
     
    Lightman and Razor1 like this.
  2. Adored

    Newcomer

    Joined:
    Mar 1, 2016
    Messages:
    67
    Likes Received:
    4
    To be fair the GeForces will probably beat 1400MHz, but overall when compared oc vs oc to Maxwell I don't expect Pascal to be that far ahead on clocks.
     
  3. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,927
    Likes Received:
    1,626
  4. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    According to OlegSH's link https://devblogs.nvidia.com/parallelforall/inside-pascal/
    the max registers per thread is still 255 but now there 14336 KB of register file divided among 3584 cuda cores instead of 6144 KB divided among 3072 cuda cores in maxwell.
    It also state 64 cores per SM instead of 128 cores per SM in maxwell.

    Since the max registers per thread is the same as maxwell does the register increase per core allow the per thread allocation to hit the 255 maximum in more situations (with more warps in flight)? or does this have something to do with Async compute?
    edit - now that I think about it I guess it could be both.
     
  5. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    I wouldn't surprise me one bit to see Pascals overclocked to 1.8GHz.
     
    Razor1 and pharma like this.
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    That seems extraordinarily slow. HBM2 fail?
     
  7. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    tesla products tend to have lower clock speeds for the memory too don't they?
     
    spworley likes this.
  8. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    432
    Location:
    New York
    So nVidia spent nearly the entire transistor budget on memory and DP. Not much in there for the gaming crowd.

    Really high clocks for such a beastly chip though. GP104/6 should be interesting.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Registers per work item tells you nothing about the number of hardware threads per SIMD.
     
    Razor1 likes this.
  10. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    I wonder what effective instruction latencies on Pascal look like, for FP32 work.

    Not much more ALUs, but still an increase in throughput? Sounds like a deeper pipeline to me. Potentially decreased register pressure (and hence more avg. warps) by intentionally reduced instruction level parallelism per thread?

    Or just less ALUs, but same effective register usage, hence reducing the impact of register shortage?
     
  11. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    What I'm trying to process is that there is more than a doubling of registers for the chip but nowhere near a doubling of cores. Simple logic would lead you to more registers per core... but if the max registers per thread is the same then what is the increase in registers for? I think thats my confusion in a nutshell.
     
  12. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Didn't they do the same for GK210: same number of cores as GK110, much larger register files. Significantly higher performance for a lot of HPC workloads.
     
    nnunn likes this.
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    432
    Location:
    New York
    Classic occupancy problem. For complex workloads (i.e. lots of registers required per thread) the register increase means more threads in flight, more latency hiding, higher throughput etc etc.
     
    milk, nnunn, CSI PC and 2 others like this.
  14. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    No, it has to do with number of warps you can keep in flight. At maximum of 255 registers you can only have 4 warps in flight on Maxwell, which is not enough to hide the latency of even the simplest arithmetic instructions.
     
    nnunn and Jawed like this.
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    They might do. But GDDR5X will do that bandwidth. Why aim so low?

    Also it's surprising it's not 32GB of HBM2, since memory per node is supposedly big deal for deep learning. And the rest of HPC wants lots of memory, too, in case we forget about them. And Knights Landing will have 400GB (though at mixed, inferior, bandwidths).

    Maybe the 8GB modules aren't coming this year.
     
  16. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    Just do the math. 256KB / 65k*32bit register file per SM.
    Max 2048 threads per SM, or at most 255 registers per thread. If each of these would use the theoretical maximum of 255 32bit registers, you would end up with a theoretical demand of 2MB register file size.

    But that's not relevant. What you care about, is what happens when you run low on registers. And that would mostly mean that you can no longer saturate all cores. So just decrease the number of cores, keep the number of threads and the size of the register file the same, and scale horizontally instead.

    So even if you max out at only 4 warps due to register pressure, you only got as many cores inside a single SM as you can saturate.
     
    nnunn likes this.
  17. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    This may be specific to the HPC version. See Kepler running at 5GHz for HPC vs 7GHz for consumer. It's the same ratio, actually.
     
    pharma, Razor1 and pjbliverpool like this.
  18. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    8GB stacks haven't started production yet, AFAIK. So the first batch of P100 got to use 4GB stacks only if they want to ship in Q4 to OEMs, respectively in Q1 to customers.
     
  19. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Isn't that what I said when I said quote
    or am I messing something up?
     
  20. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    It's not just a problem at maximum registers... It starts way earlier then that.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...