Nvidia Pascal Announcement

From what I can see on the surface, I disagree. But I don't know how real neural networkers use their machines - if they do training on one rig, inferencing on the other, so they can continue training a new iteration on the first one?

What I do know is, the harder the competition, the less able you will be to carry around dead weight, i.e. specialized products. With a chip nearing the size TSMC apparently can built comfortably, you probably would have to sacrifice something in order to expand the FP32 cores to to 4×INT8 as well.
Although that is going against what Intel is finding out with their clients, as I quoted in an article by NextPlatform, I should had done a search before posting to find confirmation and put them all together.
Specifically Knights Landing and its role not only with HPC-analytics but also Deep Learning-science.
Nvidia is making the situation more complex/expensive and going the opposite route to Intel, and Knights Landing is a serious threat IMO to Nvidia.
It will be interesting to see how this all pans out by the end of the year, but my point is they will have the chance to squeeze Nvidia a bit now at this end of the market.
Cheers
 
Last edited:
Unfortunately, that was almost 2 weeks ago, now back to 329.99. That's around 430-440 EUR shipped to where I live.

It's unclear to me why GP100 has dedicated double precision ALUs but not dedicated FP16 ALUs. Is it due to area, clock speeds, data routing, operand collection, register file organisation or something else?
One more consideration: re-usability of existing building blocks in different ASICs.

I forgot: How many extra bits would adders and multipliers in a FP32 unit need to achieve half-rate FP64?
 
It's unclear to me why GP100 has dedicated double precision ALUs but not dedicated FP16 ALUs. Is it due to area, clock speeds, data routing, operand collection, register file organisation or something else?
Maybe I misunderstand but why would you want dedicated FP16 Cuda cores when it can be combined in a mixed precision FP32/FP16x2 core?
They cannot do the same mixed precision with the FP64 Cuda cores because of register-bandwidth and also possibly aspects of efficiency using this core, it would be logical Volta is improving on the mixed precision cores.
Dedicated FP16 Cuda cores would mean less space for FP32 or FP64 ones.

Worth considering the mixed precision Cuda core has been around since Tegra X1 with FP16x2 ops, so this route to FP16 has been in development for awhile.
Just as a reference from the Tegra X1 whitepaper:
while Tegra X1 also includes native support for FP16 Fused Multiple-Add (FMA) operations in addition to FP32 and FP64. To provide double rate FP16 throughput, Tegra X1 supports 2- wide vector FP16 operations, for example a 2-wide vector FMA instruction would read three 32-bit source registers A, B and C, each containing one 16b element in the upper half of the register and a second in the lower half, and then compute two 16b results (A*B+C), pack the results into the high and low halves of a 32 bit output which is then written into a 32-bit output register. In addition to vector FMA, vector ADD and MUL are also supported.

Cheers
 
what?
Why not explain how you think they can implement dedicated FP16 cuda cores without reducing FP64 and FP32, or how they can create a mixed precision FP64 without existing limitations of the register bandwidth or efficiency.
 
Last edited:
I forgot: How many extra bits would adders and multipliers in a FP32 unit need to achieve half-rate FP64?
The multiplier would need to support full double width on at least one operand and the output to achieve that, with 2 loop iterations. Only quad rate would be achievable without blowing up the multiplier array.
All adders would need to be scaled for full double mantissa length (so full 52bit + overflow), looping would be pretty much pointless for these, given the trade off between an additional accumulation register vs making the adder broader. You also get additional gates spread throughout the unit to route between the different precision paths. Either that - or your essentially duplicate half of the FPU for each single precision supported, and only reuse elements like the large multiplier array and perhaps the LUTs for trigonometric functions.

I don't know what Nvidia did to achieve double FP16 rate with the FP32 FPU - but given that Nvidia didn't boast so much with power efficiency with the GP100, and that a lot of die area is also wasted on the the gigantic register files and the double precision FPUs, I would think they went with the complicated but smaller approach of reusing as much of the logic inside each FPU as possible. Even if that means sacrificing a bit of FP32 efficiency.

Oh, and it should be pretty obvious: Mixing 64bit also in complicates the FPU even further. Even if you go with the max integration, you still end up with a larger and significantly less efficient unit.
 
what?
Why not explain how you think they can implement dedicated FP16 cuda cores without reducing FP64 and FP32, or how they can create a mixed precision FP64 without existing limitations of the register bandwidth or efficiency.
I haven't asserted anything. "It's unclear to me why GP100 has dedicated double precision ALUs but not dedicated FP16 ALUs."
 
I haven't asserted anything. "It's unclear to me why GP100 has dedicated double precision ALUs but not dedicated FP16 ALUs."
Again why would you have dedicated FP16 when it makes more sense to combine it in a mixed precision FP32/FP16x2 Cuda core (otherwise you must reduce the other Cuda core types to allow for double the number of FP16 cores to FP32 ones along with the complexity it adds to the design-register-cache) as outlined by the Tegra X1 and other Nvidia whitepapers?
There is a greater cost in making the FP64 a mixed precision Cuda core, and one that Nvidia has said in the past is not currently possible, probably for the reasons I brielfy mentioned when you look at the operation for Tegra X1 along with efficiency challenges.
Cheers
 
Last edited:
A 64 bit multiplier could easily be extended to handle packed 2x32b. But this is not power efficient. Multiplier complexity scales quadratically to bit count (assuming int mul, fp mul mantissa bits scales quadratically, and exponent bits scale n log n). This is why you have dedicated 32 bit ALUs.

Similarly packed 2x16b support doesn't complicate the 32 bit ALU much. So this doesn't hurt the most common GPU use case. It would of course be more power efficient to have separate 16 bit ALUs (PowerVR mobile chips have them), but 2x16b is already (up to) doubling the perf/watt over running the same (16 bit) math as 32 bit float. I believe this is a good compromise for desktop GPUs. Mobile GPUs on the other hand are running mostly OpenGL ES shaders (mediump is dominant).
 
A 64 bit multiplier could easily be extended to handle packed 2x32b. But this is not power efficient. Multiplier complexity scales quadratically to bit count (assuming int mul, fp mul mantissa bits scales quadratically, and exponent bits scale n log n). This is why you have dedicated 32 bit ALUs.

Similarly packed 2x16b support doesn't complicate the 32 bit ALU much. So this doesn't hurt the most common GPU use case. It would of course be more power efficient to have separate 16 bit ALUs (PowerVR mobile chips have them), but 2x16b is already (up to) doubling the perf/watt over running the same (16 bit) math as 32 bit float. I believe this is a good compromise for desktop GPUs. Mobile GPUs on the other hand are running mostly OpenGL ES shaders (mediump is dominant).
Just curious how does AMD with GCN handle mixed precision and has this changed with Polaris?
Thanks
 
A 64 bit multiplier could easily be extended to handle packed 2x32b. But this is not power efficient. Multiplier complexity scales quadratically to bit count (assuming int mul, fp mul mantissa bits scales quadratically, and exponent bits scale n log n). This is why you have dedicated 32 bit ALUs.

Similarly packed 2x16b support doesn't complicate the 32 bit ALU much. So this doesn't hurt the most common GPU use case. It would of course be more power efficient to have separate 16 bit ALUs (PowerVR mobile chips have them), but 2x16b is already (up to) doubling the perf/watt over running the same (16 bit) math as 32 bit float. I believe this is a good compromise for desktop GPUs. Mobile GPUs on the other hand are running mostly OpenGL ES shaders (mediump is dominant).
Well we have not seen any practical results for P100 Pascal in terms of register bandwidth and its limits, so this means we can only go by what a senior Nvidia engineer says and that it is register bandwidth limited in doing a mixed precision operation with a FP64 Cuda core.
Which considering how this is just an evolution of Maxwell in terms of the Streaming Multiprocessors-register-cache-etc, would not be a far stretch to consider mixed precision for now is limited to the FP32 Cuda cores and like I mention it is Volta that is truly designed as a mixed precision core GPU.
Further exacerbated by how the FPnX2 works with source registers per operation and simultaneously using the dedicated FP32 Cuda cores.

Unfortunately we only see undocumented/unreported aspects of a Kepler/Maxwell/Pascal design on the Nvidia devblogs (this is where it was proved the fp16x2 and dp4a traits of Pascal GP104), and unfortunately I doubt we will see anyone posting tests with a P100 anytime soon.
Cheers
 
Is there any evidence that GP100 is power-limited at full throughput double precision?
That's a really interesting and perceptive question. The energy needed to compute a FP64 FMA is about four times the energy needed for FP32. So 1/2 rate FP64 ALU's could in theory use twice the wattage of FP32! But is that increase minor compared to the significant overhead of data transfer and static RAM register memory access? My immediate guess is "the power difference is ignorable" but there's evidence that it's not. Modern high core count Xeons have to power gate and downclock 20% or more when running AVX code, showing the ALUs are using a large fraction of the Xeon's power budget. A GPU is even more ALU dense so it should be even more sensitive.

An easy way to test this is to take a Kepler Titan, Quadro, or Tesla (with unlocked 1/3 FP64 rate) , and run say both SGEMM and DGEMM and look at the wall socket power use. Any power difference will be even more distinct in P100 with its 1/2 FP64 rate.
 
Modern high core count Xeons have to power gate and downclock 20% or more when running AVX code, showing the ALUs are using a large fraction of the Xeon's power budget. A GPU is even more ALU dense so it should be even more sensitive.
That's not a valid comparison. Those Xeons will downclock regardless if you use int, float, double ops of course.
Don't forget without simd, these chips operate on 64bit data (max) (and with simd but without avx 128bit) - alus, register files, etc. So if you use avx, it's not just the alus which use more energy, you've got data transfer increasing just as well as that's now all using 256bit values.
You could, of course, theoretically try to measure power consumption with avx as well trying to figure out if doubles use more energy than floats (albeit you'd had to disable all that dynamic clocking stuff...). It's half rate too...
 
It's unclear to me why GP100 has dedicated double precision ALUs but not dedicated FP16 ALUs. Is it due to area, clock speeds, data routing, operand collection, register file organisation or something else?
Having thought about it, i spotted your trap. GP100 has no dedicated FP16 because it has dedicated FP64 units.

And since we're approaching logic riddles here:
Is there any evidence that GP100 is power-limited at full throughput double precision?
Feeding out of the same set of registers endlessly? Rasterizers, ROPs and TMUs running empty?
 
That's a really interesting and perceptive question. The energy needed to compute a FP64 FMA is about four times the energy needed for FP32. So 1/2 rate FP64 ALU's could in theory use twice the wattage of FP32! But is that increase minor compared to the significant overhead of data transfer and static RAM register memory access? My immediate guess is "the power difference is ignorable" but there's evidence that it's not. Modern high core count Xeons have to power gate and downclock 20% or more when running AVX code, showing the ALUs are using a large fraction of the Xeon's power budget. A GPU is even more ALU dense so it should be even more sensitive.

An easy way to test this is to take a Kepler Titan, Quadro, or Tesla (with unlocked 1/3 FP64 rate) , and run say both SGEMM and DGEMM and look at the wall socket power use. Any power difference will be even more distinct in P100 with its 1/2 FP64 rate.
I've certainly noted, in the past, the huge throttling that AVX on Intel's big Xeons requires - but that has such narrow SIMDs (8-wide in GPU SP terms) and such a different application target it's not the best comparison.

Do we have performance per watt numbers for Knights Landing? That is a conventional multi-precision SIMD ALU equivalent to 16-wide in GPU SP terms.

I wonder if an optimal GP100 DGEMM will need less tricky data movement, e.g. it may be possible to implement DGEMM without having to manually cache operands in shared memory, which is usually essential for SGEMM performance. If so, that could save power.

In the end, I don't believe SGEMM/DGEMM operate at the power-virus level (though they will definitely be pretty heavy). They don't stress the memory hierarchy beyond L1. One might argue that other high arithmetic-intensity kernels would make for a better comparison of double versus single precision, but I don't know what they'd be.

I think this leads back to the general question "what are your kernels?" Ratios of compute/RF/shared/cache/main-memory usage have a large effect on power usage. And then there's the argument that some kernels are easier to implement on either of AVX or CUDA and the ratios of compute versus memory-hierarchy usage could end up quite different when comparing the two approaches.

We could end up observing that Intel isn't targetting FLOPS per watt as much as it's targetting reduced man-years optimisation-effort to get from crappy HPC code to something that isn't embarrassing. So NVidia does the noble thing with performance per watt, but Intel just buys bums on seats.
 
Back
Top