Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. ImSpartacus

    Regular Newcomer

    Joined:
    Jun 30, 2015
    Messages:
    252
    Likes Received:
    199
    Nvidia's datacenter business has simply exploded in the past year, so I figure they now have the volume to push these kinds of risky high-margin products.

    [​IMG]
     
    Man from Atlantis, pharma and Razor1 like this.
  2. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    interestingly though they don't anyone in the AI marketplace, even Google's Tensor for training, is going to have a tough time against GV100 by the looks of it
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Is the interposer made of multiple exposed fields? The chip itself could straddle the boundary, or those regions could be stitched together with coarse enough interconnects. I recall that it is possible to get a single exposed field that is larger than a more standard stepper's reticle, just not cheaply.
     
    CSI PC and ImSpartacus like this.
  4. Ryan Smith

    Regular

    Joined:
    Mar 26, 2010
    Messages:
    611
    Likes Received:
    1,052
    Location:
    PCIe x16_1
    Winner winner. GV100's interposer has been exposed twice in order to produce a large enough interconnect area.
     
    Alexko, iMacmatician, Razor1 and 3 others like this.
  5. loekf

    Regular

    Joined:
    Jun 29, 2003
    Messages:
    613
    Likes Received:
    61
    Location:
    Nijmegen, The Netherlands
    880 mm2 is staggering for an ASIC, even bigger than Intel (cough cough...) Itaniums.

    Like a colleague once said, "Tiles're us (tm)".

    PIty they went for 960 GFLOPS, their marketing department must have missed their bonus.
     
  6. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    Hopefully we'll get more details on the SIMT improvements. Sounds like they can now also handle irreducible CFGs? Also, based on http://images.anandtech.com/doci/11360/ssp_445.jpg, it looks like Xavier has deep learning HW that is separate from the tensor cores.
     
  7. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    Given that cost plays a role, If you design the chip accordingly, would it be possible to use two separate interposers (left/right duo of HBM stakcs) as well? Since the memories do not talk to each other, i would think this might be possible.
     
    Lightman likes this.
  8. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,183
    Likes Received:
    1,840
    Location:
    Finland
    In theory I don't see a problem, assuming that you leave a space in GPU pins too between them. But in this case, it's apparently still 1 big block, double exposed
     
  9. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,971
    Likes Received:
    4,565
    The impressive part for me is more of a "oh shit I can't believe they were bullish enough to do this" than an actually technical achievement.
    Each waffer can only make very few chips and most probably the great majority of them comes out with a defect (either for disabled units or for . With a 300mm wafer they're probably getting around 60-65 dies per wafer.
    They're only making this because they got the clients to pay for >$15k per GPU, meaning a 2% yield (practically 1 good GPU per wafer) is already providing some profit.
    10% yields (6 good chips) means getting them $90K revenue, of which they're probably getting a profit of well over $80K after putting the cards together.


    The FP32 and FP64 unit increase is almost a match to the increase in die area. Unlike Pascal P100, the FP32 units don't seem to do 2*FP16 operations anymore, as the Tensor cores do that instead.
    So what they saved in smaller FP32 units and general die area from the 12FF transition, they invested in the Tensor cores.


    The Tensor cores are definitely unable to unpack the values at any position in the cubic matrixes (otherwise they would be just regular FP16 ALUs). My guess is someone can just multiply 4*4 matrixes using two 4*1 matrixes with "valid" FP16 values and the 3rd dimension could just be filled with 1s, and in the end you just read the first row (EDIT: derp, forgot how to Algebra).
    That said, this results in 30 TFLOPs (120/4) of regular FP16 FMAD operations.

    Other than being usable as dedicated FP16 units, I don't see any rendering application for the Tensor units. They could be used for AI inferencing in a game, though..

    For gaming, they'd probably be better off going back to the FP32 units capable of doing 2*FP16 operations.
    Or like what they did with consumer Pascal, just ignore FP16 altogether and just promote all FP16 variables to FP32 and call it a day. This would be risky because in the future there could be developers using a lot of FP16 in rendering, but nvidia's architectures in consumer products aren't exactly known for being extremely future-proof.
     
    #209 ToTTenTranz, May 11, 2017
    Last edited: May 11, 2017
    Lightman, BRiT and milk like this.
  10. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,929
    Likes Received:
    1,626
    DavidGraham, Lightman and Razor1 like this.
  11. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    Yeah, was not trying to specifically debate GV100 here, but future product options. For consumer products, multiple exposures on extremely large interposers as well as the interposers themselves seems prohibitively expensive. With a modular approach and proper planning, you could use a one-size-fits all interposer, once HBM itself has become more mainstream.
     
  12. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    372
    Likes Received:
    309
    Interesting stuff about Volta:
    source https://www.hpcwire.com/2017/05/10/nvidias-mammoth-volta-gpu-aims-high-ai-hpc/
     
    #212 xpea, May 11, 2017
    Last edited by a moderator: May 11, 2017
    tinokun, Alexko, ImSpartacus and 7 others like this.
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    I remember discussing off-handedly a way to get SIMT hardware to get past issues with synchronization points becoming split across the currently active and inactive sides of a diverged branch, by allowing the hardware to issue from each path round-robin back in the 2009 Fermi thread. It may not be round-robin, but the architecture now seems to be flexible enough to not permanently block further progress on threads that might be holding an operation the active path needs to make progress.

    Also contained in the Nvidia blog are mentions of a more streamlined ISA, and a shift to L0 instruction caches per SM. I think instruction buffers stopped being buffers in part because the instruction stream going into the SM would no longer be a FIFO sequence belonging to one active path, and a buffer wouldn't keep instructions around for when lanes happened to realign or hit the same code in a scenario like a common function or different iteration counts of the same loop.

    Nvidia seems to be committing more strongly to keeping up the SIMT facad, in part by correcting a major reason why Nvidia's threads weren't threads. It's a stronger departure from GCN or x86, which are more explicitly scalar+SIMD. There are some possible hints of a decoupling of the SIMT thread and hardware thread in some of AMD's concepts of lane or wave packing, but nothing clearly planned as Nvidia's imminent product.
    Perhaps it's time for debating how much of a thread their "thread" is again?
     
  14. Clukos

    Clukos Bloodborne 2 when?
    Veteran Newcomer

    Joined:
    Jun 25, 2014
    Messages:
    4,462
    Likes Received:
    3,793
  15. ImSpartacus

    Regular Newcomer

    Joined:
    Jun 30, 2015
    Messages:
    252
    Likes Received:
    199
    Yes, Anandtech reported it that way.

    "NVIDIA is genuinely building the biggest GPU they can get away with: 21.1 billion transistors, at a massive 815mm2, built on TSMC’s still green 12nm “FFN” process (the ‘n’ stands for NVIDIA; it’s a customized higher perf version of 12nm for NVIDIA)."

    http://www.anandtech.com/show/11367...v100-gpu-and-tesla-v100-accelerator-announced

    In addition, I don't think TSMC has historically had an "N" at the end of a process nickname, so it doesn't follow any conventions.
     
    Lightman likes this.
  16. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    A bit normal, who is doing big chip on TSMC on thoses process outside Nvidia ? They are ofc working closely from start to end, for fix leakage etc .. ( i dont think theres much outside collaboration on test and work on otptimization for their specific need, it still 12nmFF, its more a concern for the final production )
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The new ISA and refactored scheduler contradict some predictions for a highly iterative change from Pascal, at least for the HPC variant.
    The flexibility might not immediately impact code that has already been structured to avoid the program-killing facets of existing SIMD hardware, or possibly games due to them probably favoring higher coherence (due to APIs, optimizing for efficiency with current methods, multi-vendor considerations, avoiding program-killing elements of architecture, etc.).

    However, some of the perf-watt changes to come can come from some of those algorithms that were walled off from GPU consideration because of they were too dangerous for the more restricted architectures. Unfortunately, maintaining two different algorithm bases doesn't sound easier on top of vendor or device specific code, so unless other GPUs start doing this the full upside may not be realized for some time. Some elements could be accelerated like the impact of pixel-sync type stalls. A workgroup might be able to launch and get much of its work done despite one pixel's hitting a sync barrier, rather than a much more significant gap nearer the front end applying to dozens of other pixels--and it might even be that the hazard would have resolved itself by the time it mattered.

    The Nvidia blog's discussion of other measures of creating groups of communicating threads may also be extended by this SIMT change to lead to a more informal way of capturing parallelism dynamically, since the warps within those groups are also more flexible.

    I did track down where I first thought about workarounds for SIMT's synchronization problem--back when Nvidia coined the term and much of the architectural arrangement was newer to me.
    https://forum.beyond3d.com/posts/1363056/

    The whole range of discussion in that portion of the thread and afterwards would be interesting to review in light of what is coming a cool 8 years after. Possibly, one of the bigger "this is dumb" elements of Nvidia's marketing may not be quite as dumb.
     
    pharma likes this.
  18. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,183
    Likes Received:
    1,840
    Location:
    Finland
    More likely it's just all the 12 FFC risk production hogged by NVIDIA
     
    ToTTenTranz likes this.
  19. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    1,566
    Likes Received:
    400
    Location:
    Earth
    Why 12nm and not 10nm for manufacturing volta? Did every other tsmc customer go 10nm and nvidia wanted(paid for) 12nm for some reason?
     
  20. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,971
    Likes Received:
    4,565
    I thought it could be a different name for 10FF risk production. 10FF is expected to reach high-volume production in two of their gigafabs in H2 2017. 10FF is actually planned to come before 12FFC.
    However, transistor density in GV100 is very similar to Pascal cards so it doesn't sound like 10FF at all. TSMC claims a >50% area reduction in the 16FF+ -> 10FF transition, so a 21B transistor chip wouldn't be so big.

    TSMC's roadmap is pretty dense as it is, getting yet another completely different process doesn't sound productive for them.


    Maybe it's just 16FF+ with a few tweaks for being able to make such a huge die, and nvidia asked TSMC to call it "12FFN" because it sounds better on paper.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...