Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,807
    Likes Received:
    2,073
    Location:
    Germany
    Thanks, I know Jonah. :) But I think this might also be a question of the question asked, since he is very good at answering exactly what you ask him.

    It is my understanding that operands for any kind of ALU do not pop out of thin air, but must be fetched from somewhere or computed beforehand. I do not think that FP16 units are fed by a register file of their own. And a bit is a bit.
     
    #421 CarstenS, Jun 8, 2017
    Last edited: Jun 8, 2017
  2. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    365
    Likes Received:
    257
  3. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Do you mean: make a single FP64 core also capable to do multiple FP16 ops?

    Everything is possible. The same pro and con arguments apply. The GTC Inside Volta slides don't hint at anything AFAICS.
     
    Anarchist4000 and CarstenS like this.
  4. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,807
    Likes Received:
    2,073
    Location:
    Germany
    What I am trying to get my head around is this:
    Looking at it at a quarter-SM-Level, you have

    - L0 I-Cache
    - Warp Scheduler (Input select)
    - Dispatch (output select)
    - 1× 64 KiB Register File (probably sized to feed 8× FP64 at full speed)
    - 8× Load-/Store Units (probably sized to feed 8× FP64 at full speed)
    - 8×FP64 FMA
    - TONS of additional adders and multipliers in both INT and FP fashion.
    and then you go in there and add another shitload of FP16 multipliers and adders.

    And on top of that, you have all those things connect with minimum latency to the same register file ports. For my layman's mind this just does not make sense and I am trying to get a kind of understanding that does not make my head hurt. :)
     
    nnunn likes this.
  5. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    That was what Jonah was specifically mentioning in the past as not being possible with P100 due to the massive pressure on register/BW.
    But then this has changed with Volta, but changed enough I would think Nvidia would mention so if possible *shrug*.

    Cheers
     
    #425 CSI PC, Jun 8, 2017
    Last edited: Jun 8, 2017
  6. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    940
    Likes Received:
    36
    Location:
    LA, California
    GTC slide 50 labels the output of summing fp32 accumulators and fp16 full precision products with "Convert to FP32 result". This slide is not exactly clarity incarnate, so this may well be a stretch, but if the accumulated value is already FP32, why would a conversion be necessary?

    Would it make any sense to use integer accumulators (not software visible) for the fp16 x fp16 products? It seems like all possible fp16 x fp16 products and sums of 4 such products would fit in an 82 bit fixed-point accumulator. How would adding such wide integers compare to adding fp32 values (which, from what I understand, involves mantissa shifts and rounding logic), in terms of area, power, and latency? I'm not a HW person, so it's totally non-obvious to me. I'm just wondering if something along those lines might be a valid reason not to use other FP32 SM resources for at least some of the accumulation process.

    Also, does anyone know if the 4x4 matrix ops are actually programmer visible at the thread level? Or is it just a (warp-level) 16x16 matrix FMA?
     
  7. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Not sure you can get an accurate answer for now as it is classified as a Preview Feature in CUDA 9, so there will be more coming and changes.
    I have only seen it mentioned at just a warp level for now as WMMA API (Warp Matrix Multiply Accumulate ) even in the separate CUDA presentation.
    Cheers
     
    #427 CSI PC, Jun 8, 2017
    Last edited: Jun 9, 2017
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,137
    Likes Received:
    2,939
    Location:
    Well within 3d
    I'm not sure how the following would be interpreted:
    http://techreport.com/review/28499/amd-radeon-fury-x-architecture-revealed/2
    It's worded as if Fiji's sizing was up to the limit of the interposer's reticle. I thought AMD might have fudged things by having the chips on the interposer overhang onto areas not exposed for 65nm patterning.

    Some thoughts I had on the customized units were that while there is the truism that more silicon driven more slowly is better, more and longer wires is consistently not.
    Perhaps more accurate accounting needs to take that into account, in order to differentiate between two tightly optimized units versus one larger unit and whether the tradeoff in extra sequencing, signal travel, leakage, and other pitfalls of complexity can shift the balance. The granularity of power gating is usually coarser such as the SIMD block level, and its effectiveness may be hampered if the gating had to be integrated at a sub-unit granularity and on a block that cannot fully idle.

    Knowing that only a specific sequence of operations will occur in a physical space can remove a lot of mystery as to what wires need to go where. For example, if it's known that the adder phase isn't ever going to forward its results to the adder inputs, the option, its wires, and the multiplexing in the path can be removed.

    One item I think might be a win with a dedicated unit is designing it to minimize the impact of data amplification.
    If Nvidia's description is accurate as to whether there are 64 multiplications in parallel, each element is used 4 times--and in this thread there is the claim that it's simpler and more efficient to go from 16x4 to 16x16.
    The casual use of the word "forwarding" implies crossing the edges of pipeline stages, and if using the general units it follows that their pipeline latches, forwarding networks, and any lane-crossing paths stand to grow by up to an order of magnitude.

    I would have thought the SMs were already under some pressure for wiring congestion, if the design is optimized for density and the clock speeds Nvidia's GPUs have reached for. Volta's plain single-lane FMAs operate at a 4 cycle latency, which might not have happened if several adders, additional cross-lane permutations, and 4-16x the bypassing were hanging off of it.
    Not knowing how many stages could be drive stages, whatever number of stages that need to latch the 4-16x more 32-bit values would expand the lane.
    It's why I'm leery of using the existing critical path for methods that could generate KBs of extra context and run thousands of extra wires into the existing paths. That's more layers of logic, and I have doubts that the SMs are so free of congestion that they can swallow that many more wires without losing density, adding repeaters or more logic, and possibly losing a significant amount of clock speed.

    Potentially, a tensor unit could calculate what it could for shifting and control words on the 16 input elements first, then use bespoke logic that can only duplicate and broadcast the operands for this specific operation to the multipliers within a clock cycle.
    Since there is no mystery as to where the results must go physically, the 64 outputs from the multipliers could have physically direct and short paths into the adders--whatever form they might take.
    I would figure that the operation depth would be reduced with adders that took more than 2 inputs, and by adding things in parallel to the point that there's 32 or 16 values that need to move to the next pipe stage.

    A less than fully pipelined tensor unit or some kind of skewed pipelining might let the tensor unit dispense with intermediate storage. I'm not sure if the whole matrix operation could truly be fit into a general FP unit's clock cycle, but the dedicated tensor unit might not need the FP unit's clock cycle--and the FP unit's clock cycle wouldn't need to fit the tensor.
    Any non-standard timings or behaviors could be aided by the unit being separate and having dedicated sequencing logic, rather than expanding the general-purpose pipeline's sequencing behavior that already covers the behaviors of standard instructions.
     
  9. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Yes, I was wondering about that too. The information is a bit contradictory.

    http://www.memcon.com/pdfs/proceedings2014/NET104.pdf

    When you look at the slides above, you can see that the HBM ball out area is 6mm x3.2mm. The 3.2 is considerably less than the 5.5mm width of the die.

    Now check out for this die shot:

    www.flickr.com/photos/130561288@N04/28917073604/in/photostream

    Notice how the gold colored section that goes underneath the HBM die doesn't cover the HBM die completely.

    It's possible that this gold colored section marks the region where lithography was applied.
     
  10. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    They did. The interposer actually measures 1010mm² or something in that range. but the exposed area is still within the reticle limit (26x32mm). The data connections for the HBM stacks are at the very edge of that area, the stacks itself are partially outside. That can be seen pretty well in the photo silent_guy has shown.

    @silent_guy:
    Where do you see a contradiction?
     
  11. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Thats the full core though. With the forwarding involved and accumulation, much of the pressure should be removed. Remember that an accumulator won't actually touch the register file and that is the part dealing with FP32 sized data. A single vec2 operand could sustain the entire operation. What I've been proposing is dedicated FP16 muls at double speed forwarding to any FP32 accumulators that can be found. L0 would be like a forwarding network and possibly already shared between units. The arrangement might already be FP64, FP32, etc lanes adjacent to each other with short, direct connections.

    It wouldn't have to be multiple. A single FP32/16 FMA op should be relatively straightforward. FP32 capable accumulators would seem the concern.

    Not register file ports, but simple single ported latches. Electrically they're identical, but emphasis on simple. Multiple, independent register files might make more sense. Accumulation and forwarding wouldn't be writing out results as they ultimately end up elsewhere. At a future time there would be an instruction to move the data back into the main RF.

    As mentioned above, pressure and bandwidth shouldn't be a concern here.
     
  12. ninelven

    Veteran

    Joined:
    Dec 27, 2002
    Messages:
    1,709
    Likes Received:
    122
    Bro... the math just does not add up.
     
  13. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    The hot chips slides, where they say that they exceeded reticle limits.
     
  14. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,261
    Likes Received:
    1,949
    Location:
    Finland
    The interposer exceeds reticle limit, but not the part where lithography was applied, which fits inside reticle limit. The rest is just blank silicon.
     
    Alexko and Lightman like this.
  15. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    The discussion was about what Nvidia does now (ok seems to be multiple tangents in the thread but my response was regarding existing operation).
    How is the pressure removed?
    It is similar to doing DP4A (Int8 on FP32 CUDA cores); how can you not use the whole core if wanting to improve throughput by running FP16 operation over FP64 core?
    Even DP2A uses all the FP32 core.

    Jonah explicitely said you cannot use the FP64 cores in the way mentioned by the 2 earlier posts in the P100 due to massive register/BW pressure, that is ignoring the next ideal stage of considering using both FP64 and FP32 simultaneously for packed FP16 operation/computation
    It really feels there is a blurring between Nvidia and AMD here; are we talking about what is actually being done or a theoretical alternative that for whatever reason neither has done?
    Cheers
     
    #435 CSI PC, Jun 9, 2017
    Last edited: Jun 9, 2017
  16. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    The pressure would be removed IF you were caching the operands or sourcing fewer of them. The FP32 portion accounts for half of the total pressure without accumulation. 32x FP16 operands and 16x FP32 for the adder. With accumulation ONLY 32x FP16 elements would need to be sourced using the equivalent of 2 ports. The hard part of the operation is the FP32 addition, so getting the FP32 and FP64 cores wired in might make sense there.

    FP64/32 cores should be able to perform unpacked FP16 math with very little modification. The only issue should come down to rounding not meeting the spec. With that understanding they should be able to avoid the pressure as they don't need FP64 values. Big difference between two or three FP16 and FP64 operands being sourced. The question is what the hardware can do versus what is being exposed. With Pascal there wasn't any emphasis on FP16 multiplication feeding into FP32 adders/accumulators for Tensor operations which blurs the line. They are now performing an operation wanting both FP16 and FP32 cores concurrently. Ideally without going through the register file. The warp level matrix operations for Tensors to my understanding don't require any new hardware for execution. Just the ability to forward the result, ideally through an internal cache. I'm unsure if Pascal's operand caches are shared across units. LDS could probably do this, but I'm not sure that's the optimal method.

    I'm not an expert in tensor math, but if you are accumulating in FP32 and multiplying in FP16, that FP32 result probably isn't being fed into the multiplication. Which would indicate the elements are likely cumulative and there is no need to immediately sum a row before continuing. Accumulate each element and sum when you finish.
     
    #436 Anarchist4000, Jun 9, 2017
    Last edited: Jun 9, 2017
  17. kalelovil

    Regular

    Joined:
    Sep 8, 2011
    Messages:
    555
    Likes Received:
    93
    http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/3
     
  18. ImSpartacus

    Regular Newcomer

    Joined:
    Jun 30, 2015
    Messages:
    252
    Likes Received:
    199
    Fudzilla apparently got some kind of tip.

    http://www.fudzilla.com/news/graphics/43873-next-geforce-doesn-t-use-hbm-2

    Unless Nvidia feels the need for a 700-esque Pascal refresh, "the next GeForce" will be Volta, probably GV104.

    This rumor would synergize well with the previous rumor that GV104 would drop in 2017 while other consumer Volta offerings would release in 2018 (presumably including GV102 with some of 2018's GDDR6 on a 384-bit bus).
     
    ieldra and pharma like this.
  19. doompc

    Joined:
    Mar 19, 2015
    Messages:
    7
    Likes Received:
    6
  20. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,013
    Likes Received:
    1,690
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...