Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    More info on the NVSwitch: https://www.nextplatform.com/2018/03/27/nvidia-memory-switch-welds-together-massive-virtual-gpu/
    Summary.
     
    #1101 CSI PC, Mar 27, 2018
    Last edited: Mar 27, 2018
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,095
    Likes Received:
    2,814
    Location:
    Well within 3d
    It seems like the high bandwidth and shared memory space enabled by having NVLink connecting everything is the marketing point. At a device level I'm not sure it's fully transparent.
     
  3. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    It is meant to be a unified memory space so has to be transparent to minimise overheads I assume <[edit] listened back and he mentions all GPUs using same memory model>
    Also needs to be a neat solution for the cache coherency that is supported by Volta.
    But your right not much else can be said for now as it is a high level presentation.


    Edit:
    Past tense to current.
    Transparency level to me comes down to the NVLink protocol in the fabric switch (if truly non-blocking crossbar).
     
    #1103 CSI PC, Mar 27, 2018
    Last edited: Mar 28, 2018
  4. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,170
    Location:
    La-la land
    Memory access latency must surely be pretty rotten, GPUs talking to each other from across a 2 billion transistor switching chip connected by a bunch of (probably fairly long) cables.
     
  5. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    647
    Likes Received:
    92
    Turing hasn't yet taped out and is expected back from the fab any day now apparently. Expected launch is late Q3. So the PC gaming market will at least have something for 2018. My Source indicated that next is 7nm Ampere due sometime in H1'19 and that 7nm Gaming GPUs will be delayed given initial 7nm wafer availability and costs.
     
    Pixel, Samwell, xpea and 3 others like this.
  6. McHuj

    Veteran Regular Subscriber

    Joined:
    Jul 1, 2005
    Messages:
    1,410
    Likes Received:
    529
    Location:
    Texas
    I’m assuming you meant has taped out if it’s expected to be back any day. tape out to working silicon can be 10-12 weeks with modern finfet processes.

    Anyways, this makes me sad, it been almost 2 years since I got my Pascal. I just hope the new boards deliver.
     
  7. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    I was wondering myself about latency and was going to post that as well (it will be very low although still there and mostly masked due to what it is used with and what operations) but I am not sure it is something that will truly be made public and only those who know are clients with the DGX-2 or looking at the NVSwitch beyond this in the future.
    Reason for the amount of transistors is that a crossbar is pretty inefficient design (number of connections as done by Nvidia) when implemented as non-blocking.
    The longer cables outside the NVSwitch best bet is to look at existing NVLink latency with V100 in more normal DGX-1 implementation or maybe Quadro V100 that can do a pair NVLink.

    That said we may get more figures on their comparing the DGX-2 to say DGX-1 that very indirectly gives us a consideration if latency is a burden; one example they gave is that it is 10x faster than the DGX-1 I think for a huge training dataset due to the unified memory space and way the NVSwitch connects all the dGPUs - it has double the V100s but the performance difference is vastly more than that and here latency is a non-issue.
    Yeah such tests will be carefully chosen :) still probably the best public will see.

    Edit:
    Regarding latency and BW, possibly the biggest aspect will be the fabric crossbar switch connecting both stacked enclosures (each with 8 GPUs), forgot to mention that aspect.
     
    #1107 CSI PC, Mar 28, 2018
    Last edited: Mar 28, 2018
  8. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    647
    Likes Received:
    92
    Yes that's what I meant. And you are correct, it can take that long. I'm actually quite curious to see how they perform. The fact that NV has finally gone for distinct gaming and compute focused architectures is good for both segments.
     
  9. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Latency is also a secondary concern for parallel architectures. Doesn't seem all that different from what Epyc did with Infinity fabric and situationally it's a superior solution.

    It probably isn't a crossbar but a mesh fabric like what AMD did with Infinity. Both of those techniques likely came from DoE research with HPC. I'd imagine there is a fair chunk of SRAM on that chip to buffer all the connections that accounts for much of the transistor count. Current Xeons wouldn't have the link bandwidth (but probably could with proprietary signaling) or enough lanes. Perhaps with the upcoming scalable designs. Epyc with IF might work, but I could see that pairing being problematic and a proprietary solution more in Nvidia's interest.
     
  10. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    The full Volta memory model is supported to all remote GPU memories connected by NVSwitch. You can dereference any pointer, without doing any work in software to figure out where in the system that pointer points. You can use atomics.

    It’s not transparent to the GPUs themselves - obviously they have to have page tables that disambiguate the memory reference. But it’s all done in hardware, so from the programmer’s viewpoint it really does look like one memory.
     
  11. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    112
    Likes Received:
    129
    Have you heard anything about the process in which Turing is made? Late Q3, some products probably Q4 sounds very late for 12nm. 7nm Products should be possible in Q3 19, which gives the products less than a year life. Could it be maybe 10nm? This way Nvidia could skip the normal 7nm process and directly go to 7nm EUV.
     
  12. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Not sure Anarchist because Nvidia happily mention when using mesh and everywhere they are calling it a non-blocking crossbar switch (that is the crux it and why I say if in the other posts), along with the transistor count and requirements for a crossbar.
    With the limited information so far seems to me it is a different approach to AMD and possibly more similar to another IHV.

    Yeah seems Nvidia's appoach was indirectly from DoE and I cannot help think inspired with what IBM does internally with the Power9 fabric switch, that could manage 7TB/s aggregate on chip switch (I need to go back and look at some of the IBM's presentations).
    I would be hesitant to say which is superior as there are also aspects to NVSwitch that has advantages but it is for a dedicated and specific purpose.
     
    #1112 CSI PC, Mar 29, 2018
    Last edited: Mar 29, 2018
  13. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Not saying either is superior, just variations on the same basic idea scaled differently. Both likely have merits for different applications and different requirements for flexibility. IF restricted to PCIe signaling specs with a wider variety of hardware. NVSwitch likely limited to a specific generation of GPU with bandwidth maximized. A mesh is technically still a bunch of crossbars. IBM, Infiniband, etc are all similar network topologies for the most part.

    The non-blocking part is interesting as the network will encounter congestion eventually and it only makes sense with exclusive access or significantly over-specced bandwidth. Guessing that's only in reference to cache access on another GPU, not shared memory. Although the switch bandwidth and HBM2 bandwidth are equivalent.

    To have topologies there will be multiple hops to reach a destination eventually. The result will be some sort of NUMA configuration.
     
  14. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    You do not need NUMA or equivalent for a cascade crossbar switch, scaling comes down to how well the NVLink works, along with various other design limitations (such as will be a limit to cascade, still need a CPU:accelerator ratio for optimal performance,BW-connection limitations,etc).
    Non-blocking regarding memory consideration you raise would be all aspects of memory that can be associated with current NVLink and Volta architecture, the limitation (more to do with BW-connections-some latency) and where more information is required is how the enclosures are interconnected (still in context of crossbar switch).
    Most of this is academic anyway, because the "node" DGX-2 scaled out is enough for quite awhile in current setup and a big jump for the Nvidia performance where it is meant (HPC related analytics-databases-science-modelling-DL).

    Just to add, like I mentioned earlier a lot will not be made public so one needs to rely upon actual performance results when compared to DGX-1, such as when they used a massive dataset for training and the DGX-2 was 10x faster (Nvidia will be careful with examples but most would be in line with what the product is for anyway).


    Edit:
    Expanded upon points.
     
    #1114 CSI PC, Mar 29, 2018
    Last edited: Mar 29, 2018
  15. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,831
    Likes Received:
    1,541
    [​IMG]
    http://www.mediatechreviews.com/nvidia-dgx-2-largest-gpu/
     
  16. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,831
    Likes Received:
    1,541
    Lightman, Grall and CSI PC like this.
  17. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Nice to see another comparison on their site beyond the massive dataset training performance mentioned in the keynote, closer to more traditional workloads (except IFS benchmark still does not indicate the application).
     
    #1117 CSI PC, Mar 29, 2018
    Last edited: Mar 29, 2018
  18. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    So the reason for 6 NVSwitch per baseboard is the ideal number for ratio with NVLInk GPUs and Volta (6 bricks-ports in a single V100), while also with full non-blocking to other baseboard.
    Each NVSwitch on one baseboard seem to have the same control plane with co-ordinated switching fabrics, this allows the 1:1 hop same baseboard while aggregating each GPU across all NVSwitches, with Volta having 6 NVLink bricks (ports) each brick is connected to a single NVSwitch (hence 6 NVSwitches), meaning all GPUs on one baseboard are on all the same NVswitches together at full bandwidth.

    The architecture allows each of the individual NVSwitch to combine (possible as the 6 switches already use the same control plane) to link aggregate as a whole meaning 300GB/s to the GPU even though the GPU "lanes" is shared amongst the 6 NVSwitches to a single GPU.
    Closest network switch example would be Multi Chassis Link Aggregation, the advantage for Nvidia is that NVLink and Nvidia's associated proprietary protocol/algorithms (such as Exchange/Sort hashing functions) is used all the way through the architecture and internally from GPU device to switch to GPU device; the limitation to scaling like I mentioned earlier would be NVLink, connections (ports), cascade limit/Multi Chassis Link Aggregation equivalent limit.
    But Nvidia would realistically stay with a similar topology going forward, and it would have big advantages even for the DGX-1 or any HPC solution with more than 4 GPUs per node (point after split-balanced between NUMA CPUs), trickle down later on makes sense but how much it impacts price also needs to be seen.

    With Bisection bandwidth now public, clear it is a true non-blocking crossbar switch design also between the baseboards that house 8 GPUs each and at full BW potential of NVLink.

    Edit:
    If looking into Multi Chassis Link Aggregation I would be wary in using the wiki explanation as a complete understanding, compounded that it can be highly proprietary in implementation and more so if one has relevant Intellectual Property that is applicable to both device and switch.
     
    #1118 CSI PC, Apr 1, 2018
    Last edited: Apr 1, 2018
    pharma and iMacmatician like this.
  19. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    More public information regarding NVSwitch: https://www.nextplatform.com/2018/04/13/building-bigger-faster-gpu-clusters-using-nvswitches/
    What is missing from this is the control plane information that is between the NVSwitches.
    Summary is 2 hop same board, 3 hops max any GPU and with the bisection BW of 2.4GHz means pretty clearly 8 lanes-ports per switch connecting to the other baseboard NVSwitch for the non-blocking and low latency/hope count.
    One aspect still a consideration is GPU saturation where 1 GPU communicates simultaneously to more GPUs than the number of bricks-ports it has (6), but management of this comes down to Nvidia's NVLink/Unified Memory Space-coherent Unified Memory.

    Separately but worth noting elite partners are now pushing out information and looking to take orders for DGX-2.
    It will be interesting to see how soon a certified solution based upon the NVSwitch/baseboard is launched, several elite partners are certified to do this in the past with NVLink and Pascal/Volta.
     
    #1119 CSI PC, Apr 14, 2018
    Last edited: Apr 14, 2018
    ImSpartacus, nnunn and pharma like this.
  20. nnunn

    Newcomer

    Joined:
    Nov 27, 2014
    Messages:
    28
    Likes Received:
    23
    Webinar next Wednesday Apr 18, 2018 09:00 AM PDT:

    "NVIDIA DGX-2 - Breaking the Barriers to AI-Scale in the Enterprise"

    http://www.nvidia.com/object/webinar-portal.html
     
    CSI PC and pharma like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...