Nvidia Pascal Speculation Thread

Status
Not open for further replies.
Impossible? Wow, strong words here. :)

SP:DP ratio clearly does not have to be a power of two. Cell for example has an 1:7 ratio. I'm sure there's a bunch of others too.


Im really not sure the "CELL" architectures is worth as an example for anytthing ..

I dont know what change have been made between Pascal and Kepler in this sense ( lets forget Maxwell in this purpose )...
 
Impossible? Wow, strong words here. :)

SP:DP ratio clearly does not have to be a power of two. Cell for example has an 1:7 ratio. I'm sure there's a bunch of others too.

Not a power of two, but you can't divide 128 by 3:) You can have a bit over 1:3 or under 1:3, just not the exact as long as you stay with 128 SP SMMs.

Going by the maxwell SMM diagram at http://www.hardware.fr/articles/928-23/gm204-smm-nouveautes.html, one way to get to 1:3 would be to share a 32 wide DP execution unit between 3 of the single precision SMM partitions. I have no idea whether that's a good idea, but it seems possible at least.

Then it would be just 96 FP32 SP per SMM or if you use the FP64 Units for FP32 also, then you are at 1:4 again. Won't happen, as Pascal should be just a evolution of Maxwell. Some Parts like Shaders were changed in Maxwell already, while Pascal should go more for stuff like proper Async Support, HBM and so on. 128 FP32 per SMM will stay.
My assumption of n times 4 FP64 Units per SMM is based on this hardware.fr diagramm. Now they have 4 FP64 Units per SMM and they can add whatever multipliers of this. So nearest to 1:3 would bei 44:128.
 
If there's no difference between the Maxwell ULP GPU FP16 implementation and Pascal's FP16, I don't see why they would really "need" in absolute terms to not use dedicated FP64 SPs. The above presentation with the supposed 4 TFLOPs DP must be ancient by the way. But assuming for dumb speculative math's sake, synthesis for a FP64 SP@1GHz under 16FF+ for 2000 units should be at less than 30mm2 all together. What am I missing?
 
Well that's another option - 32 wide DP shared between 2 partitions and which can also be used for SP, giving 1:3. I don't find your arbitrary multiple of 4 theory very convincing though - if you had e.g. 22 wide execution for DP shared across 2 partitions, you'd have to be able to issue from 2 warps per cycle to fill the execution units, and there wouldn't be a static mapping between execution lane and index inside the warp. That sounds complicated.
 
while Pascal should go more for stuff like proper Async Support
There is no indicator for this. Async execution of pure compute load works since Kepler (when using CUDA, not the 3D queue), but mixed dispatch and draw call load isn't supported by neither the frontend nor by the SM/X/M.

And neither has Nvidia announced any improvements in this area for Pascal.
 
Page 7 of this NVIDIA presentation has a DP performance and bandwidth roadmap for Tesla GPUs.

Pascal: ~4000 DP GFLOPS, ~1000 GB/s
Volta: ~7000 DP GFLOPS, ~1200 GB/s
(GFLOPS and bandwidth seem to be accurate to 2 and 3 significant figures respectively)

My guess is 1:2 DP for the relevant chips for ~8000 SP GFLOPS on Pascal, which would give ~32 (enabled) "SMP"s at ~980 MHz, ~36 SMPs at ~870 MHz, and ~40 SMPs at ~780 MHz (assuming 128 SP CCs per "SMP," this also counts enabled SMPs only).

I think your interpretation of the chart is wrong. The captions (Pascal, Volta, ...) in the chart indicate the computing power and memory bandwidth and not the squares. The squares a markers for the years.

This was my initial guess:
http://www.forum-3dcenter.org/vbulletin/showpost.php?p=10852347&postcount=1368

I tweaked my calculation a bit and with 48 GFlop/W FP32 computing power, a 1:3 ratio of FP64 to FP32 units and about 225W TDP, you get about 10.8 TFlop/s FP32 and 3.6 TFlop/s FP64 computing power, which exactly matches with the chart on page 7.

I also assume, we see a similar organisation of compute units like GK210 (192 FP32 units with 64 dedicated FP64 units) , the GK110 refresh with larger caches and register files. If you assume 1:2 or 1:4, the FP64 computing power doesn't match with the chart on page 7.
 
I think your interpretation of the chart is wrong. The captions (Pascal, Volta, ...) in the chart indicate the computing power and memory bandwidth and not the squares. The squares a markers for the years.
That interpretation doesn't make sense for the existing chips.

Code:
Part   DP GFLOPS  DP GFLOPS  DP GFLOPS
       (Square)   (Caption)  (Actual)
K80         1900       2200       1864 (base), 2912 (max boost)
K40         1400       2000       1430 (base), 1680 (max boost)
K20         1200       1500       1173
M2090        600       1100        666

Part   GB/s       GB/s        GB/s
       (Square)   (Caption)   (Actual)
K40          290        360        288
K20          210        270        208
M2090        170        230        177

Square and Caption estimates are from rough pixel counting.

EDIT: Fixed K80 DP GFLOPS base value, it should be 1864 not 1870.
 
Last edited:
Not a power of two, but you can't divide 128 by 3:) You can have a bit over 1:3 or under 1:3, just not the exact as long as you stay with 128 SP SMMs.
Yeah, sure! But, you could theoretically have 128 SMMs that each produce a DP result in 3 cycles instead of 1, right...? :p
 
Yeah, sure! But, you could theoretically have 128 SMMs that each produce a DP result in 3 cycles instead of 1, right...? :p

There would be no point in that. If you use the same ALUs for the job, you should either be doing 1/4 (waste half internal bandwidth when doing fp64, use the existing MUL resources) or 1/2 (use bandwidth efficiently, have twice the mul resources that you'd need for fp32).
 
There would be no point in that. If you use the same ALUs for the job, you should either be doing 1/4 (waste half internal bandwidth when doing fp64, use the existing MUL resources) or 1/2 (use bandwidth efficiently, have twice the mul resources that you'd need for fp32).
In fact, there would be. Think 1:2 in term of width and MUL array utilization, but count in that the MUL array still has a higher latency when operating on fp64, same as all other adders necessary later in the FPU.
 
In fact, there would be. Think 1:2 in term of width and MUL array utilization, but count in that the MUL array still has a higher latency when operating on fp64, same as all other adders necessary later in the FPU.
OK, I feel stupid now.

~3:1 makes sense, but for a different reason. And that is if you don't use 2 quarters of a 52bit MUL array, but only two 27bit MUL arrays with a single loop (One pass \, one pass /, one final full width addition). It's the latency of the following IEEE 754 specific exponent addition and shift circuit which is mostly constant. So It's technically a 4:1, but if the tail is long enough, it looks like 3:1 since fp64 gets the same (almost) static penalty as fp32.

If you can cut the latency on the tail, the 4:1 becomes more obvious, while 2:1 means wasting die space on the MUL array, but being loop-free. And actually even closer to 1.5:1, if you don't reduce the tail for SP. Low latency in the tail goes at the cost of pipelining options, potentially requiring partially dedicated DP and SP backends.

Architectures with a ratio worse than 4:1 aren't looping more in the MUL array, the cost comes from reusing resources in the backend. Full width for SP, looping operations or even reused function units for DP.

Oh, and that even number ratio? Most likely to simplify scheduling to the SMMs, allowing fp64 ops only at fixed rate. Saves a lot of hassle if you can rely on having a virtually fixed pipeline length at a time (^= no simultaneous mixed operation), otherwise you would need to handle stalls.

Hey, FPUs aren't actually that complicated :D
 
You could also do 1:3 by having 11 FP64 units and looping over 3 cycles (wasting 1/33th of the FP64 ALUs with clock gating). Not pretty, not likely, but not impossible.

Also I still suspect this is finally the generation where NVIDIA makes a HPC-only chip without a rasteriser. They've already gone in that direction with K80 by making a new chip with a bigger register file etc... The next logical step is to optimise it further by removing 3D-only subsystems, and optimising the flagship 3D GPU by keeping a very low FP64 ratio (and smaller local memory than the HPC chip, and possibly not including NVLink)
 
I still suspect this is finally the generation where NVIDIA makes a HPC-only chip without a rasteriser. They've already gone in that direction with K80 by making a new chip with a bigger register file etc...
Is it possible GK210 already removed the raster/video specific units and used the die savings for the doubled shared memory and register files? That would explain why the compute-only K80 is the only SKU that uses GK210.
 
It is possible in Terms of technical possibilites, but in Terms of costs i would not bet on it. ;) There may be a slow Transition from a pure rasterizer to a more elegant compute chip.
But Keep in mind that some of the 3D-functionalities, are still a good choice for HPC too (e.g. the TAU/TFU). What you can strike with a red pencil is definitly the UVD, Display-scanout, ROPs (not 100% sure about that, but i haven't seen any Code using that), some Things on the frontend...

ps: sorry for small and big letters, my MS Edge Browser sets them to high or low automatically X-D
 
Status
Not open for further replies.
Back
Top