Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    I don't know how important this kind of stuff is in terms for new algorithms.
    But if it is and it sees adoption, AMD better make sure to add the same feature.
     
  2. LiXiangyang

    Newcomer

    Joined:
    Mar 4, 2013
    Messages:
    81
    Likes Received:
    47
    I have contacted a local GPU supplier in Beijing, the listed price of a Tesla V100 is abit cheaper than I thought, and it will become availble sooner as well.

    Tesla V100 PCIE will be charged about the same amount of money as Tesla P100 PCIE at launch, and about 10%-20% more expensive comparing to it is now, and it will become aviable in China next month.

    As for FP16 rate, according to their tech manager, it seems that besides Tensor core's mixed precision computation, V100 dont have 2x FP16 rate like P100 does, this is also confirmed in CUDA 9.0RC's programming guide: their FP16 rate is the same as their FP32 rate, so in V100 Nvidia move all its low precision DL stuff into tensor cores (with better precision), I think thats a good idea.
     
    #542 LiXiangyang, Aug 14, 2017
    Last edited: Aug 14, 2017
    nnunn, pharma, CarstenS and 1 other person like this.
  3. Rufus

    Newcomer

    Joined:
    Oct 25, 2006
    Messages:
    246
    Likes Received:
    60
    Volta actually fixes livelocks, not deadlocks. Take a simplified producer/consumer example like this:
    Code:
    *flag = 1;
    if(consumer) {
      while(*flag) {} // spinloop
    } else { // producer
      *flag = 0;
    }
    On existing GPUs if the producer and consumer are in the same warp, and the consumer (the while loop) executes first, the consumer will simply sit and spin forever. If the producer and consumer are on different warps everything works fine.

    For Volta this Just Works, no matter where threads are or what order the program executes in.
     
    BRiT likes this.
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,135
    Likes Received:
    2,935
    Location:
    Well within 3d
    Couldn't it be interpreted that the producer thread is waiting for the consumer thread to yield execution priority? The two threads are waiting for the other thread to do something, even if not through the same variable. The producer thread is static as far as its state goes, so would it be considered live enough for livelock?
     
  5. Rufus

    Newcomer

    Joined:
    Oct 25, 2006
    Messages:
    246
    Likes Received:
    60
    The producer waiting for the consumer to yield execution is exactly what's happening. The consumer thread will continuously read *flag, see it's 1, read it again, see it's 1. The producer never executes. The system as a whole is executing instructions so the system is live, but no thread is making any forward progress so the system is livelocked.

    Deadlock would be if all threads in the warp entirely stopped issuing instructions. This should not be possible on any GPU, unless there's a HW scheduler bug.
     
  6. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Thanks for the explanation of the difference between the two, but if we reasonably assume that most CPUs and GPUs that are on the market have no crippling scheduling bugs, doesn't that mean that everything that we call deadlock is actually a lifelock (which the exception of, say, embedded systems where a CPU with a HALT instruction that can only be woken up with an interrupt of some sort) ?

    What I mean is: deadlock or lifelock, for the user the system is halted and dead. :)
     
  7. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    I looked up the differences between deadlock and livelock:
    https://en.m.wikipedia.org/wiki/Deadlock

    IMO, from a warp point of view, it's a deadlock. There is no change is state for the warp.
    You could argue that it's a livelock from an SM point of view since other warps can still progress but that will eventually stop when available warps all complete, at which point the whole GPU is also in deadlock.
     
  8. Rufus

    Newcomer

    Joined:
    Oct 25, 2006
    Messages:
    246
    Likes Received:
    60
    Point. From a software or program point of view it's deadlocked since no state is changing. From a hardware point of view it's livelocked since instructions are being issued. I'm a hardware guy so I think more on that side.

    But whatever you call it I think that producer/consumer example is the simplest way to understand the issue.
     
    nnunn likes this.
  9. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Me too, but I don't think I've ever heard the term used during more decades than I care to admit. ;-)
     
  10. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    584
    Likes Received:
    286
    now that vega support standard swizzle, I expect the same from volta.
     
  11. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    226
    Likes Received:
    97
    does anybody know how much polygon ouputrate volta gets? and how much Pascal have?
     
  12. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    You really are obsessed with Polygons.
    There are three possibilities, since synthetic tests show only marginal differences between GP104 and GP102:
    A) There's a block in the driver to keep vertex data from flooding the pipeline, much like it was implemented in R600 in post-launch drivers, where the 50x geometry advantage over competing solutions just evaporated.
    B) There's a single hardware block limiting the throughput, which is not scaled/scalable between GP104 and GP102.
    C) There's another limitation somewhere in the pipeline, that results in the same constraints for GP102 and GP104. The only thing I can think of here is transfer rate between individual L2 partitions, since they need to share information about the triangles for Nvidias distributed geometry approach, where each SM fetches one vertex every other or every third clock cycle for processing.

    I'm open for other suggestions of course.
     
    Digidi likes this.
  13. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,019
    Likes Received:
    4,602
    This puts to rest the question of where Volta's FP16 throughput comes from. The 30 TFLOPs FP16 come from the tensor cores.

    Unless I'm reading it wrong and Volta only does 15 TFLOPs unless they're specific tensor operations..? I can't find any slide claiming 30 TFLOPs FP16 in Volta, only Anandtech's article says so in their tables.

    I guess this also means the tensor cores can't work in parallel with the FP32 units (which should be able to work FP16 as well through promotion).
     
  14. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    The Cuda 9.0 RC Doc IIRC states 64 FP16 instructions per clock per SM.
     
    pharma likes this.
  15. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,019
    Likes Received:
    4,602
    Which suggests the FP32 cuda cores are promoting FP16 to FP32.
    Then either anandtech's article about Volta is wrong or Cuda 9 isn't giving access to the tensor cores to perform non-tensor FP16 calculations.
     
    CarstenS likes this.
  16. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    Anandtech's piece was written immediately upon launch, maybe information was not as clear cut at that point. To my knowledge neither Cuda 9.0 RC Doc nor the Volta whitepaper hint at double-rate FP16 in other contexts as Pascal GP100. Table 1 on page 10 does only give number for FP32, FP64 and Tensor-FLOPS.

    Maybe - and this is just a wild guess, tensor cores will stay GV100 exklusive, while gaming-oriented brethren get 2× rate FP16 to partially compensate for the lack of tensor cores. Or Nvidia will completely ignore FP16 for gaming, which would be interesting to see once first games make use of it.
     
  17. keldor314

    Newcomer

    Joined:
    Feb 23, 2010
    Messages:
    132
    Likes Received:
    13
    Just to make things interesting, the above code may or may not actually terminate - it all has to do with whether the compiler decides to schedule the then or the else first. If it schedules the else first, then the producer will release the lock and the consumer will happily go on to the next step. Otherwise, deadlock. Since ordering of mutually exclusive branches, only meaningful in SIMT processors, is undefined, who knows what will happen. People have reported that in Cuda the else clause is in fact executed first. Unless the optimizer decides to... Yeah, there's a reason that inter-warp locks produce very brittle code in the very best case.
     
    silent_guy likes this.
  18. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    372
    Likes Received:
    309
    fellix, Alexko, tinokun and 8 others like this.
  19. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    372
    Likes Received:
    309
    Toms Hardware just published his ow article about Volta presentation at Hot Chips:
    http://www.tomshardware.com/news/nvidia-volta-gv100-gpu-ai,35297.html
    [​IMG]
    [​IMG]
     
    Grall, Heinrich4, Malo and 2 others like this.
  20. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    372
    Likes Received:
    309
    Good podcast about Volta and CUDA C++ (interesting stuff starts around 20 minutes):
    https://player.fm/series/cppcast/volta-and-cuda-c-with-olivier-giroux

     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...