Nvidia Volta Speculation Thread

FP 16 calculations are part of nV's mixed precision capabilities in Pascal, another words the ALU's can do it, they are not removing that in Volta. Why would they remove that in an already done pipeline and just do it with Tensor cores? They already have it, they don't need to mention it again. They even added it to maxwell tegra x1.
Presumably because they take up space and make scheduling a bit more complicated and weren't needed with the Tensor cores. In regards to gaming performance it's a different question as the vast majority of consumer Pascals won't have the capability. So that FP16 vs FP32 rate I've been quoting would be the highest flop rate for their respective parts.

As for Volta there is still very little information there and it's somewhat ambiguous. No guarantee consumer parts get the double rate FP16 like Pascal. They'd be in serious trouble if they didn't so I'd imagine it's included. That said all the official literature I've seen only lists FP32/64 and tensor ops. Tensor ops on cores they keep drawing as distinct units, although that seems unlikely to be the case. As I pointed out above, I have a feeling the tensor cores re-purpose the SM hardware and Nvidia hasn't shown that. That makes far more sense with the parallel INT32 pipeline.

Getting 64 parallel multiplies sourced from 16-element operands, followed by a full-precision accumulate of those multiplies and a third 16-element operand into a clock cycle efficiently and at the target clocks sounds non-trivial.
Technically they only need 20 operands per clock with some relatively simple broadcasts. Sixteen elements repeated sixteen times. The entire operation should only require 2x16 values to be read in and broadcasted. With accumulators and forwarding, operand bandwidth shouldn't be a problem at all. Keep in mind an accumulate doesn't actually need an operand as the result gets forwarded. You just have to wait a cycle before reading it which is where the single clock cycle is questionable. The tensor operations likely wouldn't be writing out any data until flushed and I'd almost guarantee it's a pipelined operation that is probably alternating accumulators. So in a single clock cycle there are 64 multiplications and 128 partial accumulations from the previous multiplications.

D=A*B+C is the normal FMA operation, but accumulation would be D+=A*B from an operand standpoint.
 
Technically they only need 20 operands per clock with some relatively simple broadcasts.
I guess I'm not sure if I'm using the same terminology.
If going by operands in terms of the instruction or the SM's view of things, the unit reads from the register file that the other SIMD units read from, so I count 3 16-wide read operands.
If by operand you mean looking at it in terms of the matrix/SIMD elements, it's 48.
We've discussed this in the other thread, but I am not sure how to characterize how simple these broadcasts are, given the magnitude of the data amplification being mooted.

Sixteen elements repeated sixteen times.
This doesn't strike me as a good start towards a modest power increase.

With accumulators and forwarding, operand bandwidth shouldn't be a problem at all.
If the design starts to serialize execution across clock boundaries, and using the existing SIMD pipelines, it requires accounting for the additional storage and forwarding width.
Keep in mind an accumulate doesn't actually need an operand as the result gets forwarded. You just have to wait a cycle before reading it which is where the single clock cycle is questionable. The tensor operations likely wouldn't be writing out any data until flushed and I'd almost guarantee it's a pipelined operation that is probably alternating accumulators. So in a single clock cycle there are 64 multiplications and 128 partial accumulations from the previous multiplications.
How many stages deep is this? I would like to step through the process to get an idea of the pipeline.
64 32-bit outputs from the multiplier per stage, and how many 32-bit inputs to the 128 adders?
Since this proposed process isn't single-cycle, I'm trying to get an estimate on the amount of inter-stage storage this is on top of the broadcasts that may be crossing clock boundaries as well.
 
Higher performance doesn't imply the packed math though. It could simply come down to a wider processor with improved clocks. We already know the chip is a good deal larger with more SIMDs.
It is about the relative value of that performance where for P100 previously the training had to be FP32 and now training with latest cuDNN is FP16 with Caffe2.
The values involved do not correlate in any way to Tensor cores, and we can probably assume the clock speed is not going to be greater for V100 relative to the P100.
The charts I provided are nearly a perfect example of a Vec2 FP16 operation with the additional cores, the V100 has 2.4x more performance than the P100 using FP32 with Caffe2.
The V100 has 42% more FP32/FP16 cores, Vec2 provides a near 2x performance increase in this situation.

This is why I explicitely chose the training real world performances and not inferencing results as it provides the clearest indicator.
Also historically Nvidia has mentioned the potential use of FP16 beyond DL with P100 and applied to imaging applications (more scientific segment) they were also targetting.
Cheers
 
Last edited:
I guess I'm not sure if I'm using the same terminology.
If going by operands in terms of the instruction or the SM's view of things, the unit reads from the register file that the other SIMD units read from, so I count 3 16-wide read operands.
If by operand you mean looking at it in terms of the matrix/SIMD elements, it's 48.
We've discussed this in the other thread, but I am not sure how to characterize how simple these broadcasts are, given the magnitude of the data amplification being mooted.


This doesn't strike me as a good start towards a modest power increase.


If the design starts to serialize execution across clock boundaries, and using the existing SIMD pipelines, it requires accounting for the additional storage and forwarding width.

How many stages deep is this? I would like to step through the process to get an idea of the pipeline.
64 32-bit outputs from the multiplier per stage, and how many 32-bit inputs to the 128 adders?
Since this proposed process isn't single-cycle, I'm trying to get an estimate on the amount of inter-stage storage this is on top of the broadcasts that may be crossing clock boundaries as well.
Man there were benchmarks comparing GV100 to GP100 in a neural net oriented benchmark using FP16 posted in this very thread.

It was made quite clear in the devblog that tensor and FP32/64 resources are separate
 
It even says to in the foot notes

This of course again is apples and oranges, since we all deemed GP100 to also support 2×FP16, right?


But that's not what I was getting at. I am talking about shared circuits between those „cores", i.e. the Tensor-part re-using some of the adders and/or multipliers from the „regular“ FP32 and/or FP64-ALUs.

Unfortunately with regards to training the P100 with Caffe2 and cuDNN6 was only FP32 and not FP16, now with latest cuDNN and Volta it can do FP16 training with Caffe2; probably why Nvidia was careful to select that particular real world test to showcase V100.
The problem historically with sharing in the context you mention was pressure on the register and associated bandwidth (something Jonah Alben at Nvidia briefly touched upon with launch of P100), and then there are the challenges mentioned by 3dilettante.
Would be impressive and nice if there is some overlap, but if there is I would assume it would only have exposure by the Nvidia libraries probably with constraints and only some of them rather than by a CUDA or C/C++ coded operation.
Cheers
 
Last edited:
This doesn't strike me as a good start towards a modest power increase.
That's the simplest form of the math operation. Sixteen elements multiplied by sixteen elements. So 256 FMAs or 512 operations for the full tensor. They are still chunking through that to fit it within the existing RF hardware.

I guess I'm not sure if I'm using the same terminology.
If going by operands in terms of the instruction or the SM's view of things, the unit reads from the register file that the other SIMD units read from, so I count 3 16-wide read operands.
If by operand you mean looking at it in terms of the matrix/SIMD elements, it's 48.
We've discussed this in the other thread, but I am not sure how to characterize how simple these broadcasts are, given the magnitude of the data amplification being mooted.
The operation should only require 32 elements to be read in total from the RF. Some discrepancy for flushing the cache, but the accumulators to my understanding will be performing many consecutive adds so the effect is minimal.

If the design starts to serialize execution across clock boundaries, and using the existing SIMD pipelines, it requires accounting for the additional storage and forwarding width.
It would be a very basic serialization to get around the mismatched precision and increase throughput. I'm not suggesting a micro op serialization like on CPUs.

How many stages deep is this? I would like to step through the process to get an idea of the pipeline.
64 32-bit outputs from the multiplier per stage, and how many 32-bit inputs to the 128 adders?
Since this proposed process isn't single-cycle, I'm trying to get an estimate on the amount of inter-stage storage this is on top of the broadcasts that may be crossing clock boundaries as well.
Just two stages as a clockspeed optimization. As few as 4 32bit registers per lane should do it and would only require simple latches and maybe very small crossbars. Two of those are required for any accumulation step, which is a bit different than an adder. Accumulation is a one operand function, unlike an add. Just need to hold onto a value long enough to complete the add so the multiplication can continue. Not that far off of the micro op setup CPUs use, but simplified for GPU. The temporal adds are the real flop gain on this setup and the clockspeed is open for some interpretation.

It was made quite clear in the devblog that tensor and FP32/64 resources are separate
If that's the case, there's a good chance they will get destroyed in the server market. Having half the hardware idling at any given time won't work out very well for efficiency. Although it might explain why a chip nearly double Vegas size is only marginally faster on theoretical numbers. Look at it this way if V100 has 30 FP16 TFLOPs, that's one quarter of the tensor capability for half the workload. Run those at double rate and you have 60 TFLOPs for half the operation. Run the adds and suddenly 120 TFLOPs of tensor compute. All within a traditional pipeline with a lot of adders. You could write values to the RF, but power consumption will go up significantly. Coincidentally they specified V100 as 120 TFLOPs of tensor operations.
 
Presumably because they take up space and make scheduling a bit more complicated and weren't needed with the Tensor cores. In regards to gaming performance it's a different question as the vast majority of consumer Pascals won't have the capability. So that FP16 vs FP32 rate I've been quoting would be the highest flop rate for their respective parts.

Consumer Pascal doesn't have but Teslas do. Not sure about Volta, but if its something they can leverage, I suspect it will have it, but you have to understand for vast majority of the shaders, 32 bit is kinda required, so things like geometry calculations it isn't needed, but nV has no issue about those with their architecture. Also most game engines aren't going to cater to the 1% of the 1% cards. Vega is going to be short lived as well, its pretty obvious at this point Volta will be only 2 quarters away from when Vega will be available.
 
Last edited:
If that's the case, there's a good chance they will get destroyed in the server market. Having half the hardware idling at any given time won't work out very well for efficiency. Although it might explain why a chip nearly double Vegas size is only marginally faster on theoretical numbers. Look at it this way if V100 has 30 FP16 TFLOPs, that's one quarter of the tensor capability for half the workload. Run those at double rate and you have 60 TFLOPs for half the operation. Run the adds and suddenly 120 TFLOPs of tensor compute. All within a traditional pipeline with a lot of adders. You could write values to the RF, but power consumption will go up significantly. Coincidentally they specified V100 as 120 TFLOPs of tensor operations.

It has always been separate cores for Nvidia regarding FP64 and FP32, not sure why this is an issue considering the P100 and V100 have a 1:2 ratio to DP and AMD is still to truly present their DP Vega GPU and needs to be seen what TFLOPs compute they manage.
Not sure why it would be an issue with HW idling in such a context, especially as both P100 and V100 are being implemented in the 1000s and 10k of thousands in a supercomputer.
That is one reason Gx102 is actually the fastest FP32 and potentially fastest inferencing GPU around from Nvidia, and I doubt this will change with GV102 that I would expect again to be faster in this context than the GV100.

But in terms of instruction issues, you can get greater utlisation out of FP32/Int32 or a mixture of other instructions (needs to be seen if also applies to Tensor cores but should do).
I linked the following quote before but may had missed it that has a theoretical example:
So, the new architecture allows to run FP32 instructions at full rate, and use remaining 50% of issue slots to execute all other type of istructions - INT32 for index/cycle calculations, load/store, branches, SFU, FP64 and so on. And unlike Maxwell/Pascal, full GPU utilization doesn't need to pack pairs of coissued instructions into the same thread - each next cycle can execute instructions from differemnt thread, so one thread perfroming series of FP32 instructions and other thread perfroming series of INT32 instructions, will load both blocks by 100%

is my understanding correct?


Olivier Giroux
Bulat Ziganshin20 days ago
That is correct.
Olivier is a senior Nvidia engineer.

Cheers
 
Last edited:
If that's the case, there's a good chance they will get destroyed in the server market. Having half the hardware idling at any given time won't work out very well for efficiency.
I'd argue just the opposite.

It's pretty much a given that 2 exclusive functional blocks where one is operational and the other not will be more power efficient than 1 multi-function block.

This is especially important since a lot of use cases will be using one functional block almost exclusively, and never the other. It'd be a large drag on perf/W for no good reason.

Of course, it costs area, but that doesn't matter in this case.
 
I thought the FP64 and FP32 cores being separate was for power efficiency. I think this was discussed in regards to PowerVR's similar decisions with separate FP16 and FP32 cores. Perhaps a similar decision was made by nVidia for their tensor cores. Of course, it's not die size area efficient, but a power/area balance must be made somehow. Or, I'm just talking out of my ass as a complete layman. I do remember reading about this discussion probably in the mobile forum though.
 
I thought the FP64 and FP32 cores being separate was for power efficiency. I think this was discussed in regards to PowerVR's similar decisions with separate FP16 and FP32 cores. Perhaps a similar decision was made by nVidia for their tensor cores. Of course, it's not die size area efficient, but a power/area balance must be made somehow. Or, I'm just talking out of my ass as a complete layman. I do remember reading about this discussion probably in the mobile forum though.
I see silent guy beat me to it. I type too slow on my phone.
 
That is one reason Gx102 is actually the fastest FP32 and potentially fastest inferencing GPU around from Nvidia, and I doubt this will change with GV102 that I would expect again to be faster in this context than the GV100.
Doesn't that reinforce my point though? Tensor cores are primarily FP32 adders with some FP16. Unlike FP64 where it was mutually exclusive and a significant area cost. The only difference I can see between Tensor and regular SM is that one had more FP32 adders.

Data flow shouldn't be an issue with Tensors. It's basically matrix math which both parts accelerate somewhat natively. I'm not saying Nvidia's engineer is wrong, but FP64 and tensor are bad comparisons in this case. The advantage would be reworking the clocks and forwarding, but given exclusive use that should be trivial to implement on the existing hardware as the forwarding should already be there.

It's pretty much a given that 2 exclusive functional blocks where one is operational and the other not will be more power efficient than 1 multi-function block.
Except when there is a significant amount of overlap. If the standard hardware has packed math, the Tensors would be duplicating that same hardware and speculatively adding three adders while stripping other functions. They would save more power by doubling the execution capacity and clocking a bit lower. Nothing else there should use more power.

There is no reason the standard hardware couldn't accelerate Tensors as well. Those numbers don't appear to show up with Nvidia's figures. Nothing the Tensors appear to be doing would offer much performance over what the standard SM does. The advantage comes from L0 caching, density allowing more units, and probably some clock adjustments. Clock adjustments which apparently didn't occur as Tensor flops are an even multiple of FP32. Not 15-30% ahead or whatever the gain would be.
 
Except when there is a significant amount of overlap. If the standard hardware has packed math, the Tensors would be duplicating that same hardware and speculatively adding three adders while stripping other functions. They would save more power by doubling the execution capacity and clocking a bit lower. Nothing else there should use more power.

There is no reason the standard hardware couldn't accelerate Tensors as well. Those numbers don't appear to show up with Nvidia's figures. Nothing the Tensors appear to be doing would offer much performance over what the standard SM does. The advantage comes from L0 caching, density allowing more units, and probably some clock adjustments. Clock adjustments which apparently didn't occur as Tensor flops are an even multiple of FP32. Not 15-30% ahead or whatever the gain would be.


There is a reason why Google and nV went with tensor units for these types of calculations. Its not like they are dumb ;), if nV had the "packed math" in previous architectures, which they do, and could just tweak it to get the same amount of calculations as introducing tensor units at the same power consumption, I am sure they would have. There has to be a reason why they did what they did, and its not like they just threw tensor cores in there in the last minute, it was planned 5 years ago to do it in Volta.
 
Except when there is a significant amount of overlap. If the standard hardware has packed math, the Tensors would be duplicating that same hardware and speculatively adding three adders while stripping other functions.
Yes, but that's irrelevant when area doesn't matter.

They would save more power by doubling the execution capacity and clocking a bit lower. Nothing else there should use more power.
I don't see why you'd save more power by doubling the executing capacity. Definitely not the non-FP16 operations.
 
Yes, but that's irrelevant when area doesn't matter.

I don't see why you'd save more power by doubling the executing capacity. Definitely not the non-FP16 operations.
Power curves. Twice the logic/performance with less than twice the power. Same reason larger GPUs always have better performance/watt at similar performance levels. So area definitely matters if you are cramming them into a server farm. You could easily be looking at 50-100% higher performance.

Seeing as how they pushed the die size to the limits with double exposure, area obviously is an issue.
 
Power curves. Twice the logic/performance with less than twice the power. Same reason larger GPUs always have better performance/watt at similar performance levels. So area definitely matters if you are cramming them into a server farm. You could easily be looking at 50-100% higher performance.

Seeing as how they pushed the die size to the limits with double exposure, area obviously is an issue.


That doesn't always happen, thats based on architecture and were in the power envelope the final frequencies hit.
 
Power curves. Twice the logic/performance with less than twice the power.
I don't follow.

Same reason larger GPUs always have better performance/watt at similar performance levels.
There could be various reasons why larger GPUs have better perf/W. The non-core doesn't scale along with the core, the cache hit rates may be better, ...

So area definitely matters if you are cramming them into a server farm. You could easily be looking at 50-100% higher performance.
Let's look at area and start from the GP100 and GP104 die shots.

You can see 30 and 20 functional blocks. Within those functional blocks, there are 2 sub-blocks. 1 sub-block has the same area for both: 2.6mm2. Those must be the TEX units. The other sub-blocks must be the SMs themselves. For GP104, it’s 4.1mm2. For GP100, it’s 7.5mm2.

We know that GP100 has double the register files than GP104, but let’s assume that the delta of 3.4mm2 is entirely due to 2x FP16 and 1/2 x FP64 support.

FP32 requires a 24x24 multiplier. FP16 an 11x11 multiplier. FP64 a 53x53 multiplier. This is a good first approximation of their relative sizes. FP64 ~ 4.8x FP32, and FP32 ~ 4.9x FP16. Let’s round that down to 4x to compensate for the barrel shifters which are more like 2x. This makes one FP64 16 times larger than one FP16. In those 3.4mm2, we need to fit 64 FP64 and 256 FP16, or an equivalent of (1024 + 256) = 1280 FP16. This gives a total area per FP16 of 0.00266 mm2 per FP16, or 0.68mm2 per SM, or 20.4 mm2 for gp100.

GV100 has 84 SMs (but half the size) so that becomes 28.6 mm2 for the packed FP16 cores.

Nvidia now had the decision to add another 28.6mm2 or to add two of them for the tensor cores. For a total die size of 815mm2, that’s 3.5% area decision. This is a worst case number, since we assume zero area register files, and, in practice, it should be possible to make optimizations in the 4-input adder.

I don’t see the big deal, and I have no idea where your 50-100% higher performance is coming from.

Seeing as how they pushed the die size to the limits with double exposure, area obviously is an issue.
What double exposure?
You need a double exposure for the interposer, but you'd need that even for a 600mm2 core chip.
 
Last edited:
Doesn't that reinforce my point though? Tensor cores are primarily FP32 adders with some FP16. Unlike FP64 where it was mutually exclusive and a significant area cost. The only difference I can see between Tensor and regular SM is that one had more FP32 adders.

Data flow shouldn't be an issue with Tensors. It's basically matrix math which both parts accelerate somewhat natively. I'm not saying Nvidia's engineer is wrong, but FP64 and tensor are bad comparisons in this case. The advantage would be reworking the clocks and forwarding, but given exclusive use that should be trivial to implement on the existing hardware as the forwarding should already be there.
.
Maybe I misunderstood your point, but I thought you were saying how having different cores for FP32 and FP64 and Tensore cores with idling/not used was bad even though it is working out fine for them since Kepler.
Not sure how you feel Tensor cores fits into this competely replacing traditional SGEMM/SGEMMEx/HGEMM, especially as it must use new instructions and operations and importantly different data structure along with the fact the P100 is not just Deep Learning with FP16 but also for scientific imaging application and that is not even considering all the libraries Nvidia has; there is a place for both and there is some potential for overlap in use of instructions/operations but Tensor cores do not replace the need for Vec2 with the FP32 cores because it cannot be used for all libraries or instructions along with it being another level of coding-data requirement not all will want to change to in one go.

I am coming from this post of yours.
If that's the case, there's a good chance they will get destroyed in the server market. Having half the hardware idling at any given time won't work out very well for efficiency. Although it might explain why a chip nearly double Vegas size is only marginally faster on theoretical numbers. Look at it this way if V100 has 30 FP16 TFLOPs, that's one quarter of the tensor capability for half the workload. Run those at double rate and you have 60 TFLOPs for half the operation. Run the adds and suddenly 120 TFLOPs of tensor compute. All within a traditional pipeline with a lot of adders. You could write values to the RF, but power consumption will go up significantly. Coincidentally they specified V100 as 120 TFLOPs of tensor operations.

Nvidia model structure makes sense to me albeit at some point FP16 full throughput operations must make their way down to the lower models.
So in reality there are two top GPUs; one that is with FP64 and 1:2 ratio, and the other that removes FP64 resulting in a smaller die that actually has greater performance where DP is not required.
Tensor cores or Vec2 does not necessarily need to be exclusive and different point to the above.
We may end up seeing GV102 has Vec2 FP16 on the FP32 cores (debatable becauase it seems Nvidia still wants reasonable differentiation for its Gx100 range for now) and maybe Int8 Tensor cores related instruction/operation rather than FP16 - depends if int8 inferencing has large enough traction.
Anyway at some point Nvidia will need to trickle down the Vec2 with full throughput.
Cheers
 
Last edited:
Power curves. Twice the logic/performance with less than twice the power. Same reason larger GPUs always have better performance/watt at similar performance levels. So area definitely matters if you are cramming them into a server farm. You could easily be looking at 50-100% higher performance.

Seeing as how they pushed the die size to the limits with double exposure, area obviously is an issue.
But they do not need to push to the limit (size and power demand) if they remove DP cores.
610mm was the limit when P100 launched but the GP102 uses around 470mm2.
10.6 TFLOPs FP32 with the NVLink P100 @ 300W or 9.3 TFLOPs FP32 PCIe P100 model @250W, for P40 this is 12.1 TFLOPs FP32 @ 250W with its smaller die.


IMO only the V100 with its 815mm2 die will be beyond 610mm2.
Cheers
 
Last edited:
They don't need the tensor cores for consumer Volta either. 80 SMs for the Volta Ti and 84 SMs for the Volta Titan will be plenty enough for a generational leap over Pascal.
 
Back
Top