Nvidia Pascal Speculation Thread

Tridam · Nov 23, 2015

psurge said:
Going by the maxwell SMM diagram at http://www.hardware.fr/articles/928-23/gm204-smm-nouveautes.html, one way to get to 1:3 would be to share a 32 wide DP execution unit between 3 of the single precision SMM partitions. I have no idea whether that's a good idea, but it seems possible at least.

Please note that Nvidia has never been crystal clear about where those DP units sit. I did my best to represent what they were willing to share. And some details are missing to keep those diagrams readable, each 32-wide SP unit should actually be 2 16-wide units.

RecessionCone · Nov 23, 2015

huebie said:
It is possible in Terms of technical possibilites, but in Terms of costs i would not bet on it. There may be a slow Transition from a pure rasterizer to a more elegant compute chip.
But Keep in mind that some of the 3D-functionalities, are still a good choice for HPC too (e.g. the TAU/TFU). What you can strike with a red pencil is definitly the UVD, Display-scanout, ROPs (not 100% sure about that, but i haven't seen any Code using that), some Things on the frontend...

ps: sorry for small and big letters, my MS Edge Browser sets them to high or low automatically X-D

Don't atomic memory ops execute in the ROP?

fellix · Nov 23, 2015

Data/texture sampling will probably remain the last dedicated HW feature of all to be removed from the pipeline. It's still so much more efficient than a pure kernel substitute. I wouldn't mind a higher quality AF alternative though, since all of the latest GPU architectures now offer direct access to the texture cache by the shaders.

huebie · Nov 23, 2015

RecessionCone said:
Don't atomic memory ops execute in the ROP?

I have no clue

I'm not a programmer... Edit: But since atomitcs are heavily using VRAM-Access it makes sense to me.

fellix said:
Data/texture sampling will probably remain the last dedicated HW feature of all to be removed from the pipeline. It's still so much more efficient than a pure kernel substitute. I wouldn't mind a higher quality AF alternative though, since all of the latest GPU architectures now offer direct access to the texture cache by the shaders.

Agreed.
Does anybody know what happend to the SFUs in Transition from Kepler to Maxwell? Haven't read anything about it. What exactly was in the SFU in GK110 (and below)?? Some sort of extensions like SSE on CPUs?

fellix · Nov 23, 2015

huebie said:
Agreed.
Does anybody know what happend to the SFUs in Transition from Kepler to Maxwell? Haven't read anything about it. What exactly was in the SFU in GK110 (and below)?? Some sort of extensions like SSE on CPUs?

The SFU units are dedicated to handle higher-order non-pipelined operations, like RCP, RSQ, SIN, COS, MOV, attribute interpolation and similar. In some past NV architectures, the SFU was able to execute a parallel FMUL op in certain conditions.

psurge · Nov 23, 2015

Tridam said:
Please note that Nvidia has never been crystal clear about where those DP units sit. I did my best to represent what they were willing to share. And some details are missing to keep those diagrams readable, each 32-wide SP unit should actually be 2 16-wide units.

Noted, and interesting - do you know if each 16 wide unit executes a different warp over 2 cycles? If so, and a warp only has a few active threads, can a warp be processed in just 1 cycle?

Regarding DP... I initially thought that having 1 DP unit per partition would be simpler than having to arbitrate for 2 shared units. But ... it seems like even if your DP:SP ratio is very low, you probably want to do all your register reads for a warp at once, rather than as needed (otherwise, wouldn't there be interference with register file reads for subsequent SP ops?). And I guess then your are in the situation where you have a bunch of source data and you want to do a computation which will take a long time (because you have very few DP units). That starts looking like a texture lookup, so it seems like a good idea to ship off your DP source register values and switch warps just like you would ship off addresses and gradients when doing a texture lookup (I don't really know what happens for texture lookups, so this is yet another assumption on my part), and you could presumably use the same datapaths for DP and texturing. So in the end, I feel like your diagram is very plausible.

But... assuming my speculation has any relation to reality (and I am not a HW person, so that is very questionable), then I also think this approach probably makes a lot less sense if the DP:SP throughput ratio is high.

Jawed · Nov 23, 2015

Doesn't Maxwell have an operand collector?

Razor1 · Nov 23, 2015

Fermi had them so they probably do

Ryan Smith · Nov 23, 2015

huebie said:
Does anybody know what happend to the SFUs in Transition from Kepler to Maxwell?

The SFUs are still there and haven't really changed. There are 8 per SMM partition.

http://images.anandtech.com/doci/7764/SMMrecolored.png

psurge · Nov 23, 2015

No idea. I guess my point is that you likely want something along those lines for long/unpredictable latency ops, because you want to be able to execute other stuff in the mean time and not have to deal with unpredictable contention for register file ports. But you might want to just feed register file lookups directly to ALUs for other kinds of operations. If my mental model of things is busted though, please let me know

xpea · Nov 25, 2015

I don't know if my interpretation is correct but from the last SC15 leaked slide, it looks like big HPC Pascal will have 1:2:4 ratio for FP64:32:16 precision:
http://cdn.wccftech.com/wp-content/uploads/2015/11/NVIDIA-Pascal-GPU-Mixed-Precision.jpg
Unified ALUs or dedicated FP64 ones, the question is still pending...

my guess is that GP100 will have new mixed precision ALUs to replace old Kepler series and compete with intel KNL, when GP102 will be the new gaming flagship with only FP32/16 Maxwell X1 derivated ALUs. Titan may be GP100 or GP102 depending on AMD pressure

source: http://wccftech.com/nvidia-pascal-volta-gpus-sc15/ (yes I know

)

CarstenS · Nov 25, 2015

My guess would be, that we will see more dedicated units again in the future (for FP64 at least, 16/32 should integrate nicely) - reason is in the posted picture. In the past, there was much more area pressure, while today the focus shifts more and more toward power pressure.

xpea · Nov 25, 2015

CarstenS said:
My guess would be, that we will see more dedicated units again in the future (for FP64 at least, 16/32 should integrate nicely) - reason is in the posted picture. In the past, there was much more area pressure, while today the focus shifts more and more toward power pressure.

So you think HPC Pascal will have 1:2 ratio on FP64:FP32 ALUs ? :runaway:

I may be wrong but I think a single mixed precision FP16/32/64 ALU will cost less die/logic/registers than one FP64 + two FP16/32 ALUs.

Ailuros · Nov 25, 2015

A wee bit more die area for quite a bit less power consumption.

itisravi · Nov 25, 2015

I am thinking Pascal will have 4096 FP32 and 2048 FP64 units. (128 FP32 & 64 FP64 per SM).

Since DP GFlops = 2 x Freq x FP64 units and assuming ~1000 MHz frequency, to achieve 4000+ GFLOPs, it must have 2000+ FP64 units. Therefore, 2048 FP64 units make most sense.

For Volta, it should be 6144 FP32 and 3072 FP64 units. With a higher peak frequency of 1140 MHz, it will be able to hit 7 TFlops.

Now as for the FP64:FP32 ratio, if it is 1:3, then

1. Each SM must move back to "192 FP32 + 64 FP64" like K40/K80, as it will be difficult to implement 1:3 with 128 FP32 units.
2. Pascal will have 6144 cuda cores. I don't know if Nvidia would want to build such a big chip with a new process, considering the yield issues.
3. Volta will have 9216 cores, which I don't think is even possible with the 16nm process.

Looking from the perspective of consumer GPUs, it would make more sense for Nvidia to go from Maxwell (3072 cores) to Pascal (4096 cores) and then as the process matures, to Volta (6192 cores).

Voxilla · Nov 25, 2015

itisravi said:
I am thinking Pascal will have 4096 FP32 and 2048 FP64 units. (128 FP32 & 64 FP64 per SM).

Since DP GFlops = 2 x Freq x FP64 units and assuming ~1000 MHz frequency, to achieve 4000+ GFLOPs, it must have 2000+ FP64 units. Therefore, 2048 FP64 units make most sense.

For Volta, it should be 6144 FP32 and 3072 FP64 units. With a higher peak frequency of 1140 MHz, it will be able to hit 7 TFlops.

Now as for the FP64:FP32 ratio, if it is 1:3, then

1. Each SM must move back to "192 FP32 + 64 FP64" like K40/K80, as it will be difficult to implement 1:3 with 128 FP32 units.
2. Pascal will have 6144 cuda cores. I don't know if Nvidia would want to build such a big chip with a new process, considering the yield issues.
3. Volta will have 9216 cores, which I don't think is even possible with the 16nm process.

Looking from the perspective of consumer GPUs, it would make more sense for Nvidia to go from Maxwell (3072 cores) to Pascal (4096 cores) and then as the process matures, to Volta (6192 cores).

We know Pascal is 4 TFlop/s DP.
Also it has been stated FP16 would be 4x Maxwell.
That implies 12 TFlop/s SP and 24 TFlop/s FP16

48x SM each doing 128 FP32 / 64 FP64 (in 1.5 cycles ?) and 256 FP16 could do that.
So basically twice Maxwell with flexible FP32 / FP16 / FP64
(alternatively 32x SM with 192 FP32 / 384 FP16 dedicated 64 FP64 (in 1 cycle))

For anything less (like 4096 cores) the 1TB/s bandwidth seems total overkill.

superjoeyprof · Nov 25, 2015

4 TFLOPS DP is planed with the top Tesla, but it will be a dual GPU solution like the Tesla K80.

Voxilla · Nov 25, 2015

superjoeyprof said:
4 TFLOPS DP is planed with the top Tesla, but it will be a dual GPU solution like the Tesla K80.

Ok, that is interesting. Then we only need a Pascal with 12 TFlop/s FP32 and 2 TFlop/s FP64.
That makes it more feasible with:
48x SM each doing 128 FP32 / 64 FP64 (in 3 cycles) and 256 FP16

iMacmatician · Nov 25, 2015

superjoeyprof said:
4 TFLOPS DP is planed with the top Tesla, but it will be a dual GPU solution like the Tesla K80.

Does the 1 TB/s memory bandwidth in the slide also refer to a dual-GPU card?

Voxilla said:
Ok, that is interesting. Then we only need a Pascal with 12 TFlop/s FP32 and 2 TFlop/s FP64.
That makes it more feasible with:
48x SM each doing 128 FP32 / 64 FP64 (in 3 cycles) and 256 FP16

The Tesla K80 has fewer enabled SMXs and a lower base clock than the K40, resulting in much lower FLOPS per GPU (at base clock) for the former. Since the slide from the last page seems to use the base clock of the K80 for the DP GFLOPS, I think the 4 DP TFLOPS for Pascal is also from the base clock. If the relevant Pascal chip is anywhere near Big Kepler in terms of power consumption, I would expect it to have closer to 3 DP TFLOPS in a single-GPU Tesla configuration, unless supejoeyprof's 4 TFLOPS value is unrelated to the value in the slide (in that case disregard this paragraph).

OlegSH · Nov 25, 2015

superjoeyprof said:
4 TFLOPS DP is planed with the top Tesla, but it will be a dual GPU solution like the Tesla K80.

Why it doesn't match HBM2 bandwidth then?
If DP rate was 1/4 with 2 GPUs, bandwidth would be rated at 2 TBytes/second with 2xHBM2. K80 bandwidth perfectly matches it's flops in the charts on page 7 - http://www.ecmwf.int/sites/default/files/HPC-WS-Posey_0.pdf

Nvidia Pascal Speculation Thread

Tridam

RecessionCone

fellix

huebie

fellix

psurge

Jawed

Razor1

Ryan Smith

psurge

xpea

CarstenS

Moderator

xpea

Ailuros

Epsilon plus three

itisravi

Voxilla

superjoeyprof

Voxilla

iMacmatician

OlegSH

Similar threads