Nvidia Pascal Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 25, 2014.

Tags:
Thread Status:
Not open for further replies.
  1. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,796
    Likes Received:
    2,054
    Location:
    Germany
    I smell the return of Geforce 6 ;)
     
  2. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    13,167
    Likes Received:
    3,570
    Half-precision (16bit float) once undesirable is now a selling point. Why'd they get rid of it in the first place?
     
  3. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    3,845
    Likes Received:
    329
    Location:
    35.1415,-90.056
    Back when GPU's were Graphics Processing Units, FP16 wasn't "sufficient" for blending after enough passes in a lot of cases. Now that GPU's are General Purpose Compute Units, the lower precision can be useful once again for specific cases.

    That statement isn't exactly true, but it's truly the compute side that makes it useful, and mostly the graphics side that had previously decided that it wasn't.
     
    Scott_Arm likes this.
  4. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,484
    Likes Received:
    396
    Location:
    Varna, Bulgaria
    FP16 was used back in the days when the cost of the ALU parts on the die was still a decisive design factor and going full-time with FP32 shader pipeline wasn't feasible. Today, it's more the power cost (and bandwidth) that dictates this design direction, together with the significantly expanded application field for the GPUs.
     
  5. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,780
    Likes Received:
    4,431
    3dfx would be proud.
     
  6. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    You all realise that no NVIDIA GPU ever had a FP16 Multiply-Add unit before Tegra X1, right? They were using FP16 *registers* and also had a dedicated FP16 Normalize unit, but the core ALUs were always FP32.

    I'm also not sure how removing FP16 support had anything to do with blending as the blender was always a separate unit on NVIDIA/AMD GPUs... To a large extent NV3x/NV4x never really benefited as much from FP16 as much as they could have; FP16 wasn't fast - FP32 was slow!

    While there are many power benefits, there's mostly a significant performance benefit from the fact that it's a Vec2 ALU (i.e. it is simply reading 32-bit registers as 2x16-bit). Don't focus only on the area cost of the ALU - you should consider the area/power cost of the register file too...

    (Rampage wasn't FP16. Mojo was meant to have FP16 ALUs but that was likely little more than a draft specification/wishlist lying on an architect's desk by the time 3dfx went bankrupt...)
     
    #126 Arun, Mar 18, 2015
    Last edited: Mar 18, 2015
    spworley likes this.
  7. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    On NV47/PS3 was very important to use FP16 registers where possible in order to save register space and get more threads running, which increased performance in most cases, except when one was ending up trashing the texture cache :)

    Funnily enough in those cases one could fool the GPU pretending to use more registers than necessary to scale down the number of threads to avoid thrashing the texture cache and get better performance. Oh nostalgia.. :)
     
  8. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    With no FP16 desktop hardware available for testing, this would have been equal to shipping code without even running it once. I assume nobody was crazy enough to do this.
    The GPU latency hiding depends on the active thread count. Less register memory usage means more threads can be ran simultaneously. This is the biggest performance gain from FP16. I am also waiting to get 16 bit integers eventually, since that provides similar gains. 16 bit integers are big enough for most purposes (and obviously there is no precision loss either if your value fits to 16 bits). We do manual bit packing tricks already to reduce integer register (and LDS) pressure. GCN has one cycle mask+shift instructions making packing/unpacking really fast. It also has instructions to pack/unpack a pair of FP16 values to 32 bit register. So you can emulate some of the gains for variables that are infrequently used (every pack / unpack pair is 2 extra instructions, so you don't want to do this often). Obviously this kind of emulation provides zero power savings and adds some ALU instructions (and costs developer time). Native FP16 and 16 bit integers are very much welcome.
     
  9. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,416
    Likes Received:
    178
    Location:
    Chania
    That's a good question for those green birds around here that were passionately evangelizing how "useless" it is. As already mentioned: the more power sensitive things get IHVs will look for solutions to increase efficiency for every possible use case.
     
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,117
    Likes Received:
    2,860
    Location:
    Well within 3d
    Green or red? Or which green or red, I forget who's what color now.
    The nice thing with this modern take is that the FP16 functionality is overlaid on a hardware base that is robust from a 32-bit standpoint. Use cases that need the higher precision perform normally, while cases that can use FP16 can do better. I think people would notice it more if the next GPU came out with register and ALU resources optimized to FP16 without a change in the headline register and unit counts.

    I'm curious if the 8-bit granularity extract the most recent GCN ISA document exposes is useful for any notable situations, given the rather low ceiling that could entail.
     
  11. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,416
    Likes Received:
    178
    Location:
    Chania
    I don't care about the colour as long as it's not pink; after two daughters some fathers may sense why :p

    I'm curious if Pascal will take the same path as the X1 GPU or if FP16 comes with less or no conditionals at all.
     
  12. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    6,952
    Likes Received:
    3,032
    Location:
    Pennsylvania
    The amount of sensationalist journalism posting that 10x Maxwell as "next GPU 10x as fast as Maxwell" is ridiculous. At least I cleaned up a few annoying sources on Facebook.
     
  13. spworley

    Newcomer

    Joined:
    Apr 19, 2013
    Messages:
    146
    Likes Received:
    190
    There are many science codes where the data has a very low dynamic range. Using 8 bit is enough, and in compensation you save storage, get better caching and (above all!) save power by less data movement. At GTC this week they gave some examples for 8 bit SGEMM, one of which was radio astronomy data. The win wasn't the higher ALU throughput, but the smaller data size.
     
  14. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    Effectively, i like this way of calculate performance gain.. If it was only so easy.
     
  15. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    That figure was apparently in the context of a few GPU linked together (two? four?) with NVlink, and thus working similar to SMP for CPUs ; a system with graphics cards plugged into PCIe slots is more like a cluster or "shared nothing" architecture, each GPU being its own little island.

    You go looking for a workload that needs SMP-like scaling ("scaling up" rather that "scaling out"), works with the FP16 sauce and then the figure looks plausible, if expensive.
     
  16. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    it was his "boss calculation method"... lol ( His own words )
    [​IMG]


    Ofc, in a specific, particular case who was not supported before you could end with that, but we all know, performance will not even be doubled..

    ( anyway maybe he was speak about FP64 rate, who know.. 200Gflops on Maxwell x 10 = 2Tflops rate lol )

    sorry it is a bit aggressive, but i really cant support when Nvidia boss do marketing ala Apple. I believe next year, Nvidia will use FP16 for give their Tflops data rate ( similar of the Tegra x1 presentation ), use compression algorythm for calculate memory bandwith ( similar of their 960 launch )... With all the respect i got for this company, lately, i really dont understand where they want to go with this type of marketing .. Specially when it was done in a conference for professionals...

    They can keep this type of marketing for sell smartphones or there tegra tablet..
     
    #136 lanek, Mar 22, 2015
    Last edited: Mar 22, 2015
  17. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    So he's not allowed to say to researchers that, thanks to NVLink, weight propagation between connected GPUs will be roughly improved by a factor of 10X ? Because that's how I read this particular slide.

    I only watched the video of the second keynote by the Google guy, but he said that they had to reduce the interconnectedness of the neural networks to allow GPUs to recalculate weights in parallel. So it seems that this is a relevant improvement for this particular audience.

    I appreciate your concern that researchers at such a conference may think 'OMG everything 10X faster', but maybe you should give the targeted audience (again: researchers) a bit more credit?

    Your advocating that even keynotes of scientific conferences should keep the lowest common denominator of the web viewing public into account. In that case, everybody might as well stay home and read YouTube comments.
     
    nnunn likes this.
  18. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    Beause for you, the title of the slide is not enough readable ( let alone his words ),

    But you are probably right.. Anyway, it is absolutely funny to see so much site reporting, "Pascal will be 10x faster than Maxwell .." ... so what make them think that ?
     
    #138 lanek, Mar 22, 2015
    Last edited: Mar 22, 2015
  19. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,583
    Likes Received:
    703
    Location:
    Guess...
    It's probably worth not getting too caught up in the 4x (FP16) and 10x (God knows what) performance per watt claims since the baseline 2x p/w figure is already damn impressive. Lest we forget that equates to Titan-X performance and 125W. I can live with that!

    Imagine what they can do in the 980's 165w envelope!
     
  20. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,875
    Likes Received:
    1,581
    Actually the 10x Maxwell is an estimate on his part (he said as much in the video), but believe he is drawing this estimate based on the working Pascal and NvLink parts they current have. Once you factor in HBM I really don't think the 10x is far from what we should expect, but time will tell.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...