Nvidia Pascal Speculation Thread

Ext3h · Nov 10, 2015

Psycho said:
Huh? Hawaii is 2:1 http://www.amd.com/en-us/products/graphics/workstation/firepro-3d/9100 (295x2 is 8:1 naturally)

Oh, my fault. Completely forgot about the FirePro series, respectively that AMD named the chips the same despite using a different configuration for the FPU (being able to scale arbitrarily from 2:1 to 16:1 at the cost of slightly increased transistor count).

But still, 2:1 isn't optimal in terms of SP performance / die size or power consumption.

Nakai · Nov 10, 2015

Ext3h said:
Out of these, Kepler was actually the one closest to optimum. It's just that a 64bit FMAD/FMUL costs 4x as much hardware resources as the corresponding 32bit operation, you can't cheat around that. There is some additional, mostly data width independent overhead for IEEE 754 specific edge case handling.

If anything, SP to DP ratio is going to get worse than 4:1 in Pascal, not better. They might even be tempted to still handle DP with dedicated units, that would be the only way they could achieve a better ratio.

Going 2:1 on a mixed mode FPU is effectively wasting resources while in SP mode - about 50% actually. Going 8:1 or worse is indicating "software emulation" (not actually, just running FP in multiple rounds through the integer ALU).

And you didn't just called for the 295X2 as a reference for DP performance?

That's a dual GPU card, and if you really want to go that way:
The champion in terms of DP performance is still AMDs old Tahiti X2/New Zealand/Malta series, setting the bar at 2 SP or 1/2 DP per CU and cycle, with the 2 1/2 years old 7990 only gotten beaten recently by Intel's Knights Landing - and not even by much, just ~50%.

Multiple-Precision FPU

The problem with Kepler was, that it has limited register bandwidth.
The throughput for FP32 ops is 128 FMAs/clk per SM, although one Kepler SM could achieve 192 FMAs/clk at max.
https://forum.beyond3d.com/posts/1644206/

The register bandwidth was doubled from Fermi, which was 64 FP32-FMAs/clk (and hence 128 FP32-FMAs/clk). Therefore delivering a ratio of 3:1 for FP32:FP64 for GK110 was a wise decision, as a higher FP64 throughput would be restricted.
For NV, I really don't think that they implemented fully independent FP64 units with Kepler, although engineers are stating that. That would be a waste of resources.

If you have want to compare FP32 and FP64 units and their corresponding resource utilization, you need to take any scheduling and routing logic into account. For a real life comparison, just take AMDs GCN architecture. Pitcairn's CUs were roughly about 4.5mm² (1:16-FP64:FP32). Tahiti had ~5.5mm² (1:4 - FP64MUL:FP32MUL; 1:2 - FP64ADD:FP32ADD). For Hawaii there are no reliable die shots present, up until now, but we can use Tahiti as a base and go from that. I don't think that a Hawaii CU is drastically bigger, than a Tahiti CU.

I don't want to compare videocards, but actual GPUs. If you want to compare videocards, right now AMD delivers the videocard with the highest FP64 throughput per watt, which is the Firepro S9150 delivering over 10 FP64-GFlops/Watt. NV needs to overcome that.

NV will target 4:2:1 - FP16:FP32:FP64, everything else would indicateds that NV doesn't care about double precision.

fellix · Nov 10, 2015

Nakai said:
For NV, I really don't think that they implemented fully independent FP64 units with Kepler, although engineers are stating that. That would be a waste of resources.

Not in case of the smaller Kepler chips, where dedicated FP64 units would allow for more compact SMX layout by reducing the count, though comparing the GK104 and GK110 multi-processors, the savings aren't that much evident.

Ailuros · Nov 10, 2015

Grall said:
Apple's volumes are ginormous. If NV were to sell like 80 million GPUs every quarter they would be shitting their pants of sheer surprise and excitement. Alas, that's not the case though.

Either way SoCs/Apple or whatever else are not a point or metric to compare yields for GPU chips. If NV could get away with just 100 square millimeter chips and nothing else it would reason enough to rejoice for them. Are yields the same for cases for you'd have 5x times or more the chip complexity/transistor count?

For the record's sake Apple isn't selling diametrically more chips under 16FF+ compared to 28nm at it's start and Apple used only one foundry for the latter. Hardly any chance even as an indication that yields could be any better with 16FF+ then at the 28nm start. Then comes cost which is more in the double "ouch" category for 16FF+.

Nakai · Nov 13, 2015

Ext3h said:
Oh, my fault. Completely forgot about the FirePro series, respectively that AMD named the chips the same despite using a different configuration for the FPU (being able to scale arbitrarily from 2:1 to 16:1 at the cost of slightly increased transistor count).

But still, 2:1 isn't optimal in terms of SP performance / die size or power consumption.

Let's wait...a day of good sleep, will always help.

With NV it's always like that, if they don't do some marketing about their next-gen DP-Performance, their next-gen will just have a lack of DP-Performance. I think they will feature enough DP-Performance to be competetive and to provide a reasonable jump from GK180/GK210, which would be a factor of around 2~3x.
How they achieve that?
If Pascal is not a huge jump from Maxwell, and Maxwell possesses dedictaed FP64-units, Pascal will have that, too. I think, if they include Mixed-Precision, it will only have an impact on the FP32 units. So in my opinion, GP100 could have something 5000~6000 Mixed-Precision SPs and a factor of DP units of the range of 1000~2000.
Volta then could bring a true change in the design of the StreamProcessors.

I am just speculating...maybe this would be the other possibility. We have to understand that NV introduced a new GPU microarchitecture with Maxwell and HBM and Finfet will be a stress on NV, too. So, maybe it would be better for NV, if they would stick with the simplest option, which would include the Mixed-Precision units from GM20B (Tegra X) and add some independent DP functionality.

Ailuros · Nov 13, 2015

Nakai said:
I am just speculating...maybe this would be the other possibility. We have to understand that NV introduced a new GPU microarchitecture with Maxwell and HBM and Finfet will be a stress on NV, too. So, maybe it would be better for NV, if they would stick with the simplest option, which would include the Mixed-Precision units from GM20B (Tegra X) and add some independent DP functionality.

You get twice the FP16 rate on GM20B only if f.e. the ops are the same. That sounds too naive to me as an implementation and I doubt they'd opt for the high end for something that works only under conditionals (wouldn't be the first time I'm wrong though...). Anyway for the far impossible I'll do a 180 degree twist and suggest that Pascal will have dedicated FP16 and FP64 SPs like the ULP SoC Rogue from IMG

(oh and yes it's obviously just a joke...)

Ext3h · Nov 13, 2015

Ailuros said:
You get twice the FP16 rate on GM20B only if f.e. the ops are the same. That sounds too naive to me as an implementation and I doubt they'd opt for the high end for something that works only under conditionals (wouldn't be the first time I'm wrong though...).

That sounds like VLIW2, single instruction but two additional data words.

tunafish · Nov 13, 2015

Given that data movement (even from registers) dominate the cost of computation so much, I wonder if a design where each "SP" is a 64-bit unit with 3 operands and one result and which can execute one DP, two SP or four HP instructions per clock in some kind of limited VLIW setup would make any sense.

Ext3h · Nov 13, 2015

tunafish said:
Given that data movement (even from registers) dominate the cost of computation so much, I wonder if a design where each "SP" is a 64-bit unit with 3 operands and one result and which can execute one DP, two SP or four HP instructions per clock in some kind of limited VLIW setup would make any sense.

You and Nakai just gave me an idea...

What if each of these SPs actually is 256 wide, 192 bit in, 64 bit out. (That 192 bits rings a bell? Good. I didn't saw the correlation until I did the math while writing this.) The compiler is able to aggregate operations from two threads in a workgroup into a single SP VLIW2 instruction. For HP, that's actually a VLIW4 op, but still only issued by two threads. Only for DP, two threads actually need to issue their instructions in sequence.

That would, amongst others, explain why Nvidia likes 192bit memory interfaces so much...

3dilettante · Nov 13, 2015

GDDR5 chips have 32 bit data buses, and are managed in pairs by a memory controller, yielding 64 bits. Depending on how factors like board complexity, target bandwidth, and capacity balance each other, jumping between powers of two in terms of memory controllers in a chip can be undesirable.

The controllers feed into cache lines of fixed length, and the SIMD lanes are behind multiple interfaces and layers of abstraction.
The one known example of 2x FP16 throughput uses a shared op because it avoids ripping out existing issue, register access, and scheduling logic.
The predicate bits remain the same, there's no need to double decoding, and the instruction won't do something that will require ripping out the register file since a doubled operation uses the same register ID for anything.
Mixing "threads" in the absence a simplifying assumption like a shared op involves revamping the hardware model presented to the system, since the hardware isn't increasing the number of bits for predication or other elements like scheduling slots.

Jawed · Nov 14, 2015

Also, don't forget NVidia does lots of stuff in the compiler using knowledge of instruction and register throughputs, latencies and conflict mechanisms.

Nakai · Nov 14, 2015

Ailuros said:
You get twice the FP16 rate on GM20B only if f.e. the ops are the same. That sounds too naive to me as an implementation and I doubt they'd opt for the high end for something that works only under conditionals (wouldn't be the first time I'm wrong though...). Anyway for the far impossible I'll do a 180 degree twist and suggest that Pascal will have dedicated FP16 and FP64 SPs like the ULP SoC Rogue from IMG (oh and yes it's obviously just a joke...)

Of course the ops needs to be the same, since these are SIMD units. And of course these kind of splits should be implemented via VLIW-style ops. Does Maxwell have an array of dedicated FP64 units or are these an "enhanced" FP32 arrays, with the possibility to execute FP64 ops (with lower throughput)? If Maxwell has dedicated FP64 units, and Pascal is only a small step from Maxwell architecture-wise, it appears that Pascal will use a combination of FP64 and Mixed-Precision units. If they use pure Mixed-Precision units (with 4:2:1), and they include, for example, 6000SPs within their design, these corresponds to 3000 FP64-SPs(2:1), 6000 FP32-SPs, and 12000 FP16-SPs (4:2). So they need to include 3000 huge mixed-precision FP64 units. GK110 had SP

P of 3:1, which were 960 FP64-SPs and 2880 FP32-SPs. Of coure, I don't think these are fully "dedicated" FP64-SPs. So is the step from 960 FP64-SPs to 3000 mixed-precision FP64-SPs too huge? If they go for true mixed-precision FP64 units, they need "VLIW4"-ops in order to achieve their maximum FP16 throughput. Do they want that? I don't know...

Razor1 · Nov 14, 2015

There are rumors that pascal will have no penalty or slow down for using mixed precision, if they are true then I don't think you will have different units.

fellix · Nov 15, 2015

Razor1 said:
There are rumors that pascal will have no penalty or slow down for using mixed precision, if they are true then I don't think you will have different units.

Probably nothing different than how it's implemented in Tegra X1, using vec2 packing by re-using the FP32 ALUs with all the implied practical limitations.

A1xLLcqAgt0qc2RyMz0y · Nov 16, 2015

Nvidia Pascal win:

Along with touting the number of major HPC applications that are now GPU accelerated and the performance impact of that process, NVIDIA’s other major focus at SC15 is to announce their next US government contract win. This time the National Oceanic and Atmospheric Administration (NOAA) is tapping NVIDIA to build a next-gen research cluster. The system, which doesn’t currently have a name, is on a smaller scale than the likes of Summit & Sierra, and will be comprised of 760 GPUs. The cluster will be operational next year, and giving the timing and the wording as a “next-generation” cluster, it’s reasonable to assume that this will be Pascal powered

http://www.anandtech.com/show/9791/...ation-to-build-tesla-weather-research-cluster

pharma · Nov 18, 2015

Nvidia talks Pascal 16GB Memory at 1TB/S bandwidth

At the Japanese edition of NVIDIAs GPU Technology Conference, NVIDIA revealed some details behind its 2016 graphics architecture, codenamed Pascal.

The Pascal GPU is fabbed in the Taiwan Semiconductor Manufacturing Company (TSMC) based on the new 16nm FinFET process, a process that uses the popular stacking method that NAND these days uses as well. It should result in significant power savings.

Pascal will bring support for up to 32GB of HBM2 memory. Initial Pascal will launch with 16GB HBM2 memoryfrom SK.Hynix and Samsung. The 16GB HBM SDRAM (packed in four stacked 4GB HBM2 chips) will offer 1TB/s in bandwidth.

"Pascal will also be available in multi-GPU packaging, replacing the Tesla K80 (NVIDIA skipped Maxwell-gen dual-GPU Tesla). Combined figures are very interesting to compare – 24GB GDDR5 and 480GB/s bandwidth should be replaced with 32GB HBM2 and 2TB/s bandwidth, mutually connected through NVLink rather than PCIe. The NVLink will enable up to 80GB/s, which should replace PLX PCIe Gen3 bridge chips that can only support 16GB/s (8GB/s per GPU). This part should be ‘warm up’ for 2018 and the Volta architecture".

http://www.guru3d.com/news-story/nvidia-talks-pascal-16gb-of-memory-and-1tbs-bandwidth.html

pjbliverpool · Nov 18, 2015

Only a warm up for Volta

iMacmatician · Nov 18, 2015

Page 7 of this NVIDIA presentation has a DP performance and bandwidth roadmap for Tesla GPUs.

Pascal: ~4000 DP GFLOPS, ~1000 GB/s
Volta: ~7000 DP GFLOPS, ~1200 GB/s
(GFLOPS and bandwidth seem to be accurate to 2 and 3 significant figures respectively)

My guess is 1:2 DP for the relevant chips for ~8000 SP GFLOPS on Pascal, which would give ~32 (enabled) "SMP"s at ~980 MHz, ~36 SMPs at ~870 MHz, and ~40 SMPs at ~780 MHz (assuming 128 SP CCs per "SMP," this also counts enabled SMPs only).

pjbliverpool · Nov 18, 2015

iMacmatician said:
Page 7 of this NVIDIA presentation has a DP performance and bandwidth roadmap for Tesla GPUs.

Pascal: ~4000 DP GFLOPS, ~1000 GB/s
Volta: ~7000 DP GFLOPS, ~1200 GB/s
(GFLOPS and bandwidth seem to be accurate to 2 and 3 significant figures respectively)

My guess is 1:2 DP for the relevant chips for ~8000 SP GFLOPS on Pascal, which would give ~32 (enabled) "SMP"s at ~980 MHz, ~36 SMPs at ~870 MHz, and ~40 SMPs at ~780 MHz (assuming 128 SP CCs per "SMP," this also counts enabled SMPs only).

1/3rd more SM's and raw SP FLOPs than the Titan-X doesn't sound like anywhere near enough for a high end Pascal given the process change. 1:3 DP sounds more realistic to me.

iMacmatician · Nov 18, 2015

Can Pascal support 1:3 DP? I was under the impression that Pascal SMs are similar to Maxwell SMs and Maxwell only supports 1:2^n DP as far as I am aware. If so, then I agree that 1:3 makes a lot more sense.

Also, Teslas have lower clock speeds than GeForces, at least for the Fermi and later big chips, so regarding my above speculation I would expect a corresponding TITAN to reach 9-9.5 SP TFLOPS.

EDIT: "only supports 1:2^n DP" refers to Maxwell, not Pascal.

Nvidia Pascal Speculation Thread

Ext3h

Nakai

fellix

Ailuros

Epsilon plus three

Nakai

Ailuros

Epsilon plus three

Ext3h

tunafish

Ext3h

3dilettante

Jawed

Nakai

Razor1

fellix

A1xLLcqAgt0qc2RyMz0y

pharma

pjbliverpool

B3D Scallywag

iMacmatician

pjbliverpool

B3D Scallywag

iMacmatician

Similar threads