Nvidia Volta Speculation Thread

Thanks, I know Jonah. :) But I think this might also be a question of the question asked, since he is very good at answering exactly what you ask him.

It is my understanding that operands for any kind of ALU do not pop out of thin air, but must be fetched from somewhere or computed beforehand. I do not think that FP16 units are fed by a register file of their own. And a bit is a bit.
 
Last edited:
What I am trying to get my head around is this:
Looking at it at a quarter-SM-Level, you have

- L0 I-Cache
- Warp Scheduler (Input select)
- Dispatch (output select)
- 1× 64 KiB Register File (probably sized to feed 8× FP64 at full speed)
- 8× Load-/Store Units (probably sized to feed 8× FP64 at full speed)
- 8×FP64 FMA
- TONS of additional adders and multipliers in both INT and FP fashion.
and then you go in there and add another shitload of FP16 multipliers and adders.

And on top of that, you have all those things connect with minimum latency to the same register file ports. For my layman's mind this just does not make sense and I am trying to get a kind of understanding that does not make my head hurt. :)
 
Do you mean: make a single FP64 core also capable to do multiple FP16 ops?

Everything is possible. The same pro and con arguments apply. The GTC Inside Volta slides don't hint at anything AFAICS.
That was what Jonah was specifically mentioning in the past as not being possible with P100 due to the massive pressure on register/BW.
But then this has changed with Volta, but changed enough I would think Nvidia would mention so if possible *shrug*.

Cheers
 
Last edited:
GTC slide 50 labels the output of summing fp32 accumulators and fp16 full precision products with "Convert to FP32 result". This slide is not exactly clarity incarnate, so this may well be a stretch, but if the accumulated value is already FP32, why would a conversion be necessary?

Would it make any sense to use integer accumulators (not software visible) for the fp16 x fp16 products? It seems like all possible fp16 x fp16 products and sums of 4 such products would fit in an 82 bit fixed-point accumulator. How would adding such wide integers compare to adding fp32 values (which, from what I understand, involves mantissa shifts and rounding logic), in terms of area, power, and latency? I'm not a HW person, so it's totally non-obvious to me. I'm just wondering if something along those lines might be a valid reason not to use other FP32 SM resources for at least some of the accumulation process.

Also, does anyone know if the 4x4 matrix ops are actually programmer visible at the thread level? Or is it just a (warp-level) 16x16 matrix FMA?
 
GTC slide 50 labels the output of summing fp32 accumulators and fp16 full precision products with "Convert to FP32 result". This slide is not exactly clarity incarnate, so this may well be a stretch, but if the accumulated value is already FP32, why would a conversion be necessary?

Would it make any sense to use integer accumulators (not software visible) for the fp16 x fp16 products? It seems like all possible fp16 x fp16 products and sums of 4 such products would fit in an 82 bit fixed-point accumulator. How would adding such wide integers compare to adding fp32 values (which, from what I understand, involves mantissa shifts and rounding logic), in terms of area, power, and latency? I'm not a HW person, so it's totally non-obvious to me. I'm just wondering if something along those lines might be a valid reason not to use other FP32 SM resources for at least some of the accumulation process.

Also, does anyone know if the 4x4 matrix ops are actually programmer visible at the thread level? Or is it just a (warp-level) 16x16 matrix FMA?

Not sure you can get an accurate answer for now as it is classified as a Preview Feature in CUDA 9, so there will be more coming and changes.
I have only seen it mentioned at just a warp level for now as WMMA API (Warp Matrix Multiply Accumulate ) even in the separate CUDA presentation.
Cheers
 
Last edited:
Look at slide 8 of the Fiji presentation at HotChips 2015 (https://www.hotchips.org/wp-content...-GPU-Epub/HC27.25.520-Fury-Macri-AMD-GPU2.pdf) : "Larger than reticle interposer".

Macri said in a Fiji pre-launch interview that "double exposure was possible but might be cost prohibitive". Maybe Fiji was cost prohibitive? Can you hear Totz' head explode? :)
I'm not sure how the following would be interpreted:
http://techreport.com/review/28499/amd-radeon-fury-x-architecture-revealed/2
The reason why Fiji isn't any larger, he said, is that AMD was up against a size limitation: the interposer that sits beneath the GPU and the DRAM stacks is fabricated just like a chip, and as a result, the interposer can only be as large as the reticle used in the photolithography process. (Larger interposers might be possible with multiple exposures, but they'd likely not be cost-effective.) In an HBM solution, the GPU has to be small enough to allow space on the interposer for the HBM stacks. Koduri explained that Fiji is very close to its maximum possible size, within something like four square millimeters.
It's worded as if Fiji's sizing was up to the limit of the interposer's reticle. I thought AMD might have fudged things by having the chips on the interposer overhang onto areas not exposed for 65nm patterning.

On one hand, the extra power from making a narrow function, a pure FP16 MAD, more general. This will always cost you one way or the other.

On the other hand, you should be able to save a considerable amount of logic and power for the pure tensor core case as well. If you know you're always going to add 4 FP16 numbers and, ultimately, are always going to add them into an FP32, there should be a plenty of optimizations possible in terms of taking short cuts with normalization etc. For example, for just a 4 way FP16 adder, you only need 1 max function among 4 exponents instead of multiple 2 way ones. There's no way you won't have similar optimizations elsewhere.

Some thoughts I had on the customized units were that while there is the truism that more silicon driven more slowly is better, more and longer wires is consistently not.
Perhaps more accurate accounting needs to take that into account, in order to differentiate between two tightly optimized units versus one larger unit and whether the tradeoff in extra sequencing, signal travel, leakage, and other pitfalls of complexity can shift the balance. The granularity of power gating is usually coarser such as the SIMD block level, and its effectiveness may be hampered if the gating had to be integrated at a sub-unit granularity and on a block that cannot fully idle.

Knowing that only a specific sequence of operations will occur in a physical space can remove a lot of mystery as to what wires need to go where. For example, if it's known that the adder phase isn't ever going to forward its results to the adder inputs, the option, its wires, and the multiplexing in the path can be removed.

One item I think might be a win with a dedicated unit is designing it to minimize the impact of data amplification.
If Nvidia's description is accurate as to whether there are 64 multiplications in parallel, each element is used 4 times--and in this thread there is the claim that it's simpler and more efficient to go from 16x4 to 16x16.
The casual use of the word "forwarding" implies crossing the edges of pipeline stages, and if using the general units it follows that their pipeline latches, forwarding networks, and any lane-crossing paths stand to grow by up to an order of magnitude.

I would have thought the SMs were already under some pressure for wiring congestion, if the design is optimized for density and the clock speeds Nvidia's GPUs have reached for. Volta's plain single-lane FMAs operate at a 4 cycle latency, which might not have happened if several adders, additional cross-lane permutations, and 4-16x the bypassing were hanging off of it.
Not knowing how many stages could be drive stages, whatever number of stages that need to latch the 4-16x more 32-bit values would expand the lane.
It's why I'm leery of using the existing critical path for methods that could generate KBs of extra context and run thousands of extra wires into the existing paths. That's more layers of logic, and I have doubts that the SMs are so free of congestion that they can swallow that many more wires without losing density, adding repeaters or more logic, and possibly losing a significant amount of clock speed.

Potentially, a tensor unit could calculate what it could for shifting and control words on the 16 input elements first, then use bespoke logic that can only duplicate and broadcast the operands for this specific operation to the multipliers within a clock cycle.
Since there is no mystery as to where the results must go physically, the 64 outputs from the multipliers could have physically direct and short paths into the adders--whatever form they might take.
I would figure that the operation depth would be reduced with adders that took more than 2 inputs, and by adding things in parallel to the point that there's 32 or 16 values that need to move to the next pipe stage.

A less than fully pipelined tensor unit or some kind of skewed pipelining might let the tensor unit dispense with intermediate storage. I'm not sure if the whole matrix operation could truly be fit into a general FP unit's clock cycle, but the dedicated tensor unit might not need the FP unit's clock cycle--and the FP unit's clock cycle wouldn't need to fit the tensor.
Any non-standard timings or behaviors could be aided by the unit being separate and having dedicated sequencing logic, rather than expanding the general-purpose pipeline's sequencing behavior that already covers the behaviors of standard instructions.
 
I'm not sure how the following would be interpreted:
http://techreport.com/review/28499/amd-radeon-fury-x-architecture-revealed/2

It's worded as if Fiji's sizing was up to the limit of the interposer's reticle. I thought AMD might have fudged things by having the chips on the interposer overhang onto areas not exposed for 65nm patterning.
Yes, I was wondering about that too. The information is a bit contradictory.

http://www.memcon.com/pdfs/proceedings2014/NET104.pdf

When you look at the slides above, you can see that the HBM ball out area is 6mm x3.2mm. The 3.2 is considerably less than the 5.5mm width of the die.

Now check out for this die shot:

www.flickr.com/photos/130561288@N04/28917073604/in/photostream

Notice how the gold colored section that goes underneath the HBM die doesn't cover the HBM die completely.

It's possible that this gold colored section marks the region where lithography was applied.
 
I'm not sure how the following would be interpreted:
http://techreport.com/review/28499/amd-radeon-fury-x-architecture-revealed/2

It's worded as if Fiji's sizing was up to the limit of the interposer's reticle. I thought AMD might have fudged things by having the chips on the interposer overhang onto areas not exposed for 65nm patterning.
They did. The interposer actually measures 1010mm² or something in that range. but the exposed area is still within the reticle limit (26x32mm). The data connections for the HBM stacks are at the very edge of that area, the stacks itself are partially outside. That can be seen pretty well in the photo silent_guy has shown.

@silent_guy:
Where do you see a contradiction?
 
Like I said earlier Jonah Alben briefly mentioned it puts incredible pressure on register/BW to do any such operations with the FP64 core that way and why it is just dedicated, nor do we see FP16 operations using both FP32 and FP64 cores because of this - he mentioned this when the P100 was launched was no way to make it work.
Comes down to whether the changes to Volta would be enough to do all of that.
Thats the full core though. With the forwarding involved and accumulation, much of the pressure should be removed. Remember that an accumulator won't actually touch the register file and that is the part dealing with FP32 sized data. A single vec2 operand could sustain the entire operation. What I've been proposing is dedicated FP16 muls at double speed forwarding to any FP32 accumulators that can be found. L0 would be like a forwarding network and possibly already shared between units. The arrangement might already be FP64, FP32, etc lanes adjacent to each other with short, direct connections.

Do you mean: make a single FP64 core also capable to do multiple FP16 ops?
It wouldn't have to be multiple. A single FP32/16 FMA op should be relatively straightforward. FP32 capable accumulators would seem the concern.

And on top of that, you have all those things connect with minimum latency to the same register file ports. For my layman's mind this just does not make sense and I am trying to get a kind of understanding that does not make my head hurt.
Not register file ports, but simple single ported latches. Electrically they're identical, but emphasis on simple. Multiple, independent register files might make more sense. Accumulation and forwarding wouldn't be writing out results as they ultimately end up elsewhere. At a future time there would be an instruction to move the data back into the main RF.

That was what Jonah was specifically mentioning in the past as not being possible with P100 due to the massive pressure on register/BW.
But then this has changed with Volta, but changed enough I would think Nvidia would mention so if possible *shrug*.

Cheers
As mentioned above, pressure and bandwidth shouldn't be a concern here.
 
Thats the full core though. With the forwarding involved and accumulation, much of the pressure should be removed. Remember that an accumulator won't actually touch the register file and that is the part dealing with FP32 sized data. A single vec2 operand could sustain the entire operation. What I've been proposing is dedicated FP16 muls at double speed forwarding to any FP32 accumulators that can be found. L0 would be like a forwarding network and possibly already shared between units. The arrangement might already be FP64, FP32, etc lanes adjacent to each other with short, direct connections.
.
The discussion was about what Nvidia does now (ok seems to be multiple tangents in the thread but my response was regarding existing operation).
How is the pressure removed?
It is similar to doing DP4A (Int8 on FP32 CUDA cores); how can you not use the whole core if wanting to improve throughput by running FP16 operation over FP64 core?
Even DP2A uses all the FP32 core.

Jonah explicitely said you cannot use the FP64 cores in the way mentioned by the 2 earlier posts in the P100 due to massive register/BW pressure, that is ignoring the next ideal stage of considering using both FP64 and FP32 simultaneously for packed FP16 operation/computation
It really feels there is a blurring between Nvidia and AMD here; are we talking about what is actually being done or a theoretical alternative that for whatever reason neither has done?
Cheers
 
Last edited:
The discussion was about what Nvidia does now (ok seems to be multiple tangents in the thread but my response was regarding existing operation).
How is the pressure removed?
It is similar to doing DP4A (Int8 on FP32 CUDA cores); how can you not use the whole core if wanting to improve throughput by running FP16 operation over FP64 core?
Even DP2A uses all the FP32 core.
The pressure would be removed IF you were caching the operands or sourcing fewer of them. The FP32 portion accounts for half of the total pressure without accumulation. 32x FP16 operands and 16x FP32 for the adder. With accumulation ONLY 32x FP16 elements would need to be sourced using the equivalent of 2 ports. The hard part of the operation is the FP32 addition, so getting the FP32 and FP64 cores wired in might make sense there.

Jonah explicitely said you cannot use the FP64 cores in the way mentioned by the 2 earlier posts in the P100 due to massive register/BW pressure, that is ignoring the next ideal stage of considering using both FP64 and FP32 simultaneously for packed FP16 operation/computation
It really feels there is a blurring between Nvidia and AMD here; are we talking about what is actually being done or a theoretical alternative that for whatever reason neither has done?
FP64/32 cores should be able to perform unpacked FP16 math with very little modification. The only issue should come down to rounding not meeting the spec. With that understanding they should be able to avoid the pressure as they don't need FP64 values. Big difference between two or three FP16 and FP64 operands being sourced. The question is what the hardware can do versus what is being exposed. With Pascal there wasn't any emphasis on FP16 multiplication feeding into FP32 adders/accumulators for Tensor operations which blurs the line. They are now performing an operation wanting both FP16 and FP32 cores concurrently. Ideally without going through the register file. The warp level matrix operations for Tensors to my understanding don't require any new hardware for execution. Just the ability to forward the result, ideally through an internal cache. I'm unsure if Pascal's operand caches are shared across units. LDS could probably do this, but I'm not sure that's the optimal method.

I'm not an expert in tensor math, but if you are accumulating in FP32 and multiplying in FP16, that FP32 result probably isn't being fed into the multiplication. Which would indicate the elements are likely cumulative and there is no need to immediately sum a row before continuing. Accumulate each element and sum when you finish.
 
Last edited:
Look at slide 8 of the Fiji presentation at HotChips 2015 (https://www.hotchips.org/wp-content...-GPU-Epub/HC27.25.520-Fury-Macri-AMD-GPU2.pdf) : "Larger than reticle interposer".

Macri said in a Fiji pre-launch interview that "double exposure was possible but might be cost prohibitive". Maybe Fiji was cost prohibitive? Can you hear Totz' head explode? :)

Ananadtech said:
The actual interposer die is believed to exceed the reticle limit of the 65nm process AMD is using to have it built, and as a result the interposer is carefully constructed so that only the areas that need connectivity receive metal layers.
http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/3
 
Fudzilla apparently got some kind of tip.

http://www.fudzilla.com/news/graphics/43873-next-geforce-doesn-t-use-hbm-2

Our well-informed sources tell us that the next Geforce will not use HBM 2 memory. It is too early for that, and the HBM 2 is still expensive. This is, of course, when you ask Nvidia, as AMD is committed to make the HBM 2 GPU - codenamed Vega for more than a year now. Back with "Maxwell", Nvidia committed to a better memory compression path and continued to do so with Pascal.

The next Geforce - and its actual codename is still secret - will use GDDR5X memory as the best solution around. We can only speculate that the card is even Volta architecture Geforce VbG. The big chip that would replace the 1080 ti could end up with the Gx104 codename. It is still too early for the rumored GDDR6, that will arrive next year at the earliest.

Unless Nvidia feels the need for a 700-esque Pascal refresh, "the next GeForce" will be Volta, probably GV104.

This rumor would synergize well with the previous rumor that GV104 would drop in 2017 while other consumer Volta offerings would release in 2018 (presumably including GV102 with some of 2018's GDDR6 on a 384-bit bus).
 
Back
Top