Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,933
    Likes Received:
    2,263
    Location:
    Germany
    Regarding „packed math“: Whether or not it will be enabled in the Volta-based consumer Geforces at launch, IMHO Nvidia just cannot risk not having it ready in hardware as an emergency measure if it should see a sudden uptake in some heavily benchmarked titles. Usage model is coming from console and mobile already, so at least cross-platform devs should be able to make use of it soon. Sorry for stating the obivous.
     
  2. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,526
    Likes Received:
    2,213
    I tend to think some people (consumers) might want to be involved in AI/Deep Learning/Inferencing and would be nice if Geforce Volta's gave them that ability. Having tensor cores would be quite appropriate.
     
  3. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    As a rule silicon is most efficient near threshold voltages due to power being a square of the voltage. There could be other reasons that help, but in general more silicon is better. Same reason mobile chips will run fully enabled parts at lower clocks.

    Assumption that tensor and SIMDs are an even split. Even a 25% increase in execution units could decrease power at similar performance levels.

    Fiji was 600mm2 with 4 stacks of HBM1. Memory was a bit smaller, but stayed within limits. Exceeding those limits, even if for FP64, implies they needed more area. Otherwise just shrink the chip a bit so the memory fits.

    In this case they are used concurrently. The Tensors are a mix of FP16 muls and FP32 adders that already existed. 30 TFLOPs of existing FP16 should already be enough to drive the tensor units. So why duplicate logic you already have just to leave it idle? Workloads seem unlikely to use both concurrently in this case and the tensor numbers don't seem to imply concurrency.

    The new instructions are warp level instructions as opposed to per lane. One multiplication opposed to many in parallel with dependencies. My whole point here is that the tensor ops are executing on top of existing hardware with some modifications. So the Vec2s are the FP16 muls and extra adders from the INT32 pipeline the FP32 accumulation of a Tensor core. The instruction sequence allows for a few liberties to be taken increasing performance.
     
    nnunn likes this.
  4. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    I can imagine some will, ( it is not so easy for "consumers" , more for university peoples who could "work at home" ),.. honestly, even if this was the case, i really doubt that Nvidia will do it. In fact i give it a 10% chance to see this happend. Maybe on a Titan at 2000$, and even there ...

    I say that, but honestly in Japan, china, many Asia country, theres allready scholar program where at 10 years, boys and girls, are doing robotics and other things related.. so why not ? ( And it is where i look my country and see they have just started to think set course for computers ( maybe C++ ) on primary school ( 14 years )
     
    #404 lanek, Jun 7, 2017
    Last edited: Jun 7, 2017
  5. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    2,060
    Likes Received:
    926
    Location:
    Earth
    I imagine those new thin laptops with 1080 would be good choice for learning about ai/ml/dnn... If stuff gets serious any single gpu solution is anyway too slow to get practical results. Perhaps nvidias docker+cloud approach is decent as not everyone wants to build custom farms.
     
  6. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    What will it get destroyed in the server market by ? Vega is no faster than a Quadro GP100. Which has higher memory bandwidth and comparable FP32/FP16 resources (but no DP4A) You are assuming it will have half the hardware idling, we still don't have any concrete information on mixed Tensor + traditional pipeline usage. IIRC GV100 is 815mm^2? Vega 10 was estimated to be around 530. 53% larger die, something like 8x FP64 rates, and 5x higher throughput in DL. Compares quite favorably to me.

    It is starting to seem like you are deliberately ignoring nvidia's very clear comments regarding this, they are entirely separate. Tensor cores do not use existing FP32 units, because it was explicitly stated that full throughput FP32 would saturate only half the dispatch capacity per cycle, the remaining half can be used for all other instructions/units.

    There are two tensor cores per 8 FP64 units, 16FP32, 16 INT. Each can execute a 3 (4x4 matrix, or 16wide vector) operand FMA instruction in one cycle judging by how this was written. I do not understand your insistence on denying it is, it appears at times as though your end goal is to suggest tensor cores are a banal addition that can easily done with existing logic re: vega, when it is abundantly clear this is not so. You have been in such a rush to make these statements that you apparently forwent reading what limited details there are available and mistakenly believed this was doing so called tensor products.

    upload_2017-6-8_0-32-7.png

    To reiterate...

    The multiplication of two 4x4 matrices will result in a third 4x4 matrix with 16 elements, each element consists of 4 FMA operations, 16 elements = 64 operations. The data in matrix C can be loaded into the accumulator from the beginning, and each FMA of FP16 elements of matrices A and B are fed into that accumulator directly.

    Edit:

    Am not certain how having four (4 pairs of FP16 mult) operands needing to be accumulated in one cycle will work, as it is a unary operator usually, it may well be pipelined, the question is how deep, but the wording of this suggests this is not the case and i am also very curious as to how this is handled in terms of dispatch as this clearly far exceeds the capacity of the AWS on paper
     
    xpea and pharma like this.
  7. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    How is that relevant to this discussion? We're talking about a pretty small amount of area.

    A split how? In terms of mm2? In terms of usage? In terms of number of units? I'm confused.

    Fiji stayed within what limits?

    At more than 1000mm2, Fiji's interposer exceeded the limits of today's lithography machines, so they needed double exposure for the interposer. The core die did not. Why GV100 would be any different?

    It's certainly an option.

    But there are good arguments to not do it this way: additional power consumption for the non-tensor FP16 and FP32 cases, simplicity of the design, ease of adding or removing a tensor core to an SM (or replace it with an even faster integer equivalent for the inference versions.)

    You seem to dismiss a separate core out of hand as if it's some ridiculous option. There are pros and cons for both.
     
    xpea, nnunn, CSI PC and 2 others like this.
  8. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    The more interesting idea is a DL oriented chip with *more* tensor resources. They could potentially strip vec2 fp16, dp4a, fp64 and double the tensor core numbers and push close to 1/4 petaflop
     
  9. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    You have a quote for Vega's tensor flops? Not the FP16 figures, but when using all the hardware in a pipelined fashion for tensor operations on an architecture that hasn't been detailed yet? Right now you seem to be comparing apples and oranges.

    I'm not ignoring them, you just don't understand what they mean. The quotes you've provided say nothing about Tensors. In fact they said the INT32 pipeline was the other half of that dispatch capacity. Or probably dual issued FP16 where Vec2s aren't required. That's exactly what I proposed for Vega a while ago where the programmer didn't have to pack anything. It just seems silly to add limited FP32 cores to replace FP32 cores that already exist when they won't in all likelihood be running concurrently.

    You'll have to explain this abundantly clear part. Because the Nvidia statements run counter to what you've been saying. I'm not sure they say what you think they say. They explicitly state one thing, like FP32 and INT32 concurrently, and you come up with something completely different.

    So you're suggesting one cycle to load values into an accumulator, 4 to process the multiplication thanks to dependencies on the adder, one to write out the value of the accumulator, then repeat that process? As the products of the matrices are being added to sequential matricies, it would seem far simpler to just stay decomposed and add up the components once you finish all the multiplications. Completely avoiding the dependencies. One cycle per multiplication as opposed to the 4-6 cycles you propose or interleaving operations with more complicated data paths.

    There is no such thing as a four input adder. At best it's a series of dependent adders hidden in one really long clock cycle. The only equivalent of a multiple input adder that comes to mind is quantum computing or analog computation involving opamps. I don't foresee either of those in Volta.

    Small relative to what though? The entire chip or the area dedicated to logic? It's relevant because it should provide a means to a more efficient chip.

    I've never seen anything about Fiji using a double exposure on the interposer. My understanding was that it was as large as conventionally possible which defined the chip dimensions. If that wasn't the case Fiji wouldn't have had any trade-offs.

    What consumes extra power though? The non-tensor cores are there regardless, so might as well make use of them. This does seem ridiculous to me as it's being proposed to disable FP32/16 cores so more FP32/16 cores can be added. The whole concept of the tensor core from my view is that a single tensor operation is being executed across all the blocks concurrently. Use the FP16 units for the multiplication, forward the result to the FP32+INT cores for adding/accumulation, then repeat. The only difference is that instead of running 16 threads across 16 hardware lanes an entire matrix operation is being completed in a single cycle using all of them and piplining sequential operations across the blocks with specialized paths. That's why it seems ridiculous to me to replace hardware that already exists with more hardware.
     
  10. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    Here we go again with the imaginary support for matrix math ala Volta.

    You accused me of not understanding when you started waffling about tensor operations and 3 dimensional arrays as well. Read what was posted for you. @CSI PC posted a response from nvidia employee clearly stating the other half of dispatch capacity was available to ALL OTHER INSTRUCTIONS

    You'll have to explain this abundantly clear part. Because the Nvidia statements run counter to what you've been saying. I'm not sure they say what you think they say. They explicitly state one thing, like FP32 and INT32 concurrently, and you come up with something completely different.
    [/QUOTE]
    I don't see how you can conclude that anything I stated contradicts the above, then again after having seen your numerous posts detailing how banal an implementation of tensor products would be on Vega I am hard pressed to find this surprising.

    Multplication is clearly done in one cycle, I don't know what you're on about, that probably makes two of us.

    You don't say...

    Use the FP16 units? What fp16 units? You mean the FP32 units capable of packed fp16 that you refuse to recognize? You have stretched your argument so thin with this long series of misunderstandings, false assumptions and general aversity towards anything nvidia related that you are poking holes in it all by yourself.

    They're all idiots, tensor cores are a waste of space. Vega can match Gv100 no problem.

    Gotcha.
     
    xpea likes this.
  11. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    I don't see how exactly I am supposed to explain the 'abundantly clear' part to you. Would you like me to read this out to you over VoIP?

    https://forum.beyond3d.com/posts/1984172/

    or would you prefer I function as your forum secretary to keep you up to date on the numerous posts you neglect to read on your path to the conclusion you reached well before looking at any actual information ?
     
  12. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Titan is a strong possibility as for Pascal it has been more directed as a 'cheaper' high end support inferencing product to Gx100 for labs/universities/etc.
    If I remember the Titan Pascal was physically shown 1st at Stanford by Jen-Hsun in the context of his being there for DL.
    But the headache is that Nvidia do like to differentiate DL operation/instruction/compute capability between Gx100 and GX102.

    What will be interesting is the price of the dedicated DL Volta 150W FHHL GV100 and what has been disabled and whether this will have complete DL functionality and scope (such as Int8 inferencing) or if we can expect a GV102 version as well.
    The GV100 FHHL is a nice single slot size with good TDP albeit more focused towards DL:

    [​IMG]


    Cheers
     
  13. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,933
    Likes Received:
    2,263
    Location:
    Germany
    I can only remember that they said explicitly, that GV100-SMs have one dispatcher each, followed by „the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput.

    At least in the blog inside volta I linke above, they do not mention anything else in this regard.
     
    Anarchist4000 and ieldra like this.
  14. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    Yeah in the blog they mention nothing else, if you see the post by @CSI PC I linked above you will find senior engineer at nvidia responding to questions
     
    CSI PC and CarstenS like this.
  15. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,933
    Likes Received:
    2,263
    Location:
    Germany
    Found it - thx!

    But this only talks about scheduling and dispatch. What's probably the catch here is that for ... what can I call them... non-auxiliary calculations maybe, you'd have to have free cycles in register access as well, which might be a problem, since that 32-Warp-in-two-steps-of-16 is mainly due to register file ports economization. IOW, the register file is full-time employed serving the FP32-ALUs their data on FMA operations.

    Is my understanding correct?

    edit: Thinking about it, the new Warp-at-once scheduling/dispatch in conjunction with the doubled Load/Store capability seems to be the reason for the latency reduction for "core FMA math operations". While at every other cycle (where scheduling/dispatch would run dry, you can do mostly stuff that needs no or very little register accesses or you could switch both off in order to save power.
     
    #415 CarstenS, Jun 8, 2017
    Last edited: Jun 8, 2017
  16. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    You do realize matrix operations have been supported for some time on GPUs? In fact every GPU since probably the 90s when hardware T&L became available would support them.

    What waffling? Show me where there is ANY evidence of these claims you keep making? There is an entire thread spawned with detailed breakdowns of the FMA operations and where all these tensor flops come from. Including the 8 flops per lane possibility I suggested that accomplishes exactly with what AMD has listed as a capability.

    So "so on" is the mythical tensor plus FP32 scheduling? Because this quote you keep referencing says absolutely nothing to support the claim you keep making. So please do read it back to me and show where this in any way reinforces your claim. Along with this impossible four input adder that is required and the basis for why operations run as they do. What I've been proposing is the very type of operation these software libraries set out to optimize.
     
  17. BRiT

    BRiT Verified (╯°□°)╯
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    15,426
    Likes Received:
    13,897
    Location:
    Cleveland
    Come on now, there is no need to get chippy and start insulting one another. Please keep things on a technical level.

    If some one doesnt see things exactly your way, acting like children isn't going to win them over, but having higher quality technical discussion might if you try explaining thungs in a different manner.
     
  18. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    I estimated the total FP16 area in GP100 to be in the range of 3.5% of the total die size.

    Look at slide 8 of the Fiji presentation at HotChips 2015 (https://www.hotchips.org/wp-content...-GPU-Epub/HC27.25.520-Fury-Macri-AMD-GPU2.pdf) : "Larger than reticle interposer".

    Macri said in a Fiji pre-launch interview that "double exposure was possible but might be cost prohibitive". Maybe Fiji was cost prohibitive? Can you hear Totz' head explode? :)

    On a similar note: this could explain why Vega has only 2 HBM2 stacks. It stays very comfortably within single exposure territory. Even 4 stacks might have stayed within the limit.

    On one hand, the extra power from making a narrow function, a pure FP16 MAD, more general. This will always cost you one way or the other.

    On the other hand, you should be able to save a considerable amount of logic and power for the pure tensor core case as well. If you know you're always going to add 4 FP16 numbers and, ultimately, are always going to add them into an FP32, there should be a plenty of optimizations possible in terms of taking short cuts with normalization etc. For example, for just a 4 way FP16 adder, you only need 1 max function among 4 exponents instead of multiple 2 way ones. There's no way you won't have similar optimizations elsewhere.

    Speaking of which: in my earlier post, I somehow assumed that the tensor cores needed 2x the number of multipliers of the regular packed FP16. But that should be 4x, doesn't it? Or am I missing something?

    That makes the case for a full dedicated tensor core stronger.

    Use the existing FP16 units *and add 3 extra units*. Versus: just use 4 extra units that are highly optimized for just one function. The latter doesn't seem too ridiculous to me.
     
  19. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,933
    Likes Received:
    2,263
    Location:
    Germany
    What if you tried to shoehorn the FP64 units (as well) into doing Tensor-stuff?
     
  20. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Like I said earlier Jonah Alben briefly mentioned it puts incredible pressure on register/BW to do any such operations with the FP64 core that way and why it is just dedicated, nor do we see FP16 operations using both FP32 and FP64 cores because of this - he mentioned this when the P100 was launched was no way to make it work.
    Comes down to whether the changes to Volta would be enough to do all of that.

    Edit:
    To clarify this is who Jonah is:
    Cheers
     
    #420 CSI PC, Jun 8, 2017
    Last edited: Jun 8, 2017
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...