Nvidia Pascal Speculation Thread

Status
Not open for further replies.
Oh, my fault. Completely forgot about the FirePro series, respectively that AMD named the chips the same despite using a different configuration for the FPU (being able to scale arbitrarily from 2:1 to 16:1 at the cost of slightly increased transistor count).

But still, 2:1 isn't optimal in terms of SP performance / die size or power consumption.
 
Out of these, Kepler was actually the one closest to optimum. It's just that a 64bit FMAD/FMUL costs 4x as much hardware resources as the corresponding 32bit operation, you can't cheat around that. There is some additional, mostly data width independent overhead for IEEE 754 specific edge case handling.

If anything, SP to DP ratio is going to get worse than 4:1 in Pascal, not better. They might even be tempted to still handle DP with dedicated units, that would be the only way they could achieve a better ratio.

Going 2:1 on a mixed mode FPU is effectively wasting resources while in SP mode - about 50% actually. Going 8:1 or worse is indicating "software emulation" (not actually, just running FP in multiple rounds through the integer ALU).

And you didn't just called for the 295X2 as a reference for DP performance?

That's a dual GPU card, and if you really want to go that way:
The champion in terms of DP performance is still AMDs old Tahiti X2/New Zealand/Malta series, setting the bar at 2 SP or 1/2 DP per CU and cycle, with the 2 1/2 years old 7990 only gotten beaten recently by Intel's Knights Landing - and not even by much, just ~50%.

Multiple-Precision FPU

The problem with Kepler was, that it has limited register bandwidth.
The throughput for FP32 ops is 128 FMAs/clk per SM, although one Kepler SM could achieve 192 FMAs/clk at max.
https://forum.beyond3d.com/posts/1644206/

The register bandwidth was doubled from Fermi, which was 64 FP32-FMAs/clk (and hence 128 FP32-FMAs/clk). Therefore delivering a ratio of 3:1 for FP32:FP64 for GK110 was a wise decision, as a higher FP64 throughput would be restricted.
For NV, I really don't think that they implemented fully independent FP64 units with Kepler, although engineers are stating that. That would be a waste of resources.

If you have want to compare FP32 and FP64 units and their corresponding resource utilization, you need to take any scheduling and routing logic into account. For a real life comparison, just take AMDs GCN architecture. Pitcairn's CUs were roughly about 4.5mm² (1:16-FP64:FP32). Tahiti had ~5.5mm² (1:4 - FP64MUL:FP32MUL; 1:2 - FP64ADD:FP32ADD). For Hawaii there are no reliable die shots present, up until now, but we can use Tahiti as a base and go from that. I don't think that a Hawaii CU is drastically bigger, than a Tahiti CU.

I don't want to compare videocards, but actual GPUs. If you want to compare videocards, right now AMD delivers the videocard with the highest FP64 throughput per watt, which is the Firepro S9150 delivering over 10 FP64-GFlops/Watt. NV needs to overcome that.

NV will target 4:2:1 - FP16:FP32:FP64, everything else would indicateds that NV doesn't care about double precision.
 
For NV, I really don't think that they implemented fully independent FP64 units with Kepler, although engineers are stating that. That would be a waste of resources.
Not in case of the smaller Kepler chips, where dedicated FP64 units would allow for more compact SMX layout by reducing the count, though comparing the GK104 and GK110 multi-processors, the savings aren't that much evident.
 
Apple's volumes are ginormous. If NV were to sell like 80 million GPUs every quarter they would be shitting their pants of sheer surprise and excitement. Alas, that's not the case though.

Either way SoCs/Apple or whatever else are not a point or metric to compare yields for GPU chips. If NV could get away with just 100 square millimeter chips and nothing else it would reason enough to rejoice for them. Are yields the same for cases for you'd have 5x times or more the chip complexity/transistor count?

For the record's sake Apple isn't selling diametrically more chips under 16FF+ compared to 28nm at it's start and Apple used only one foundry for the latter. Hardly any chance even as an indication that yields could be any better with 16FF+ then at the 28nm start. Then comes cost which is more in the double "ouch" category for 16FF+.
 
Oh, my fault. Completely forgot about the FirePro series, respectively that AMD named the chips the same despite using a different configuration for the FPU (being able to scale arbitrarily from 2:1 to 16:1 at the cost of slightly increased transistor count).

But still, 2:1 isn't optimal in terms of SP performance / die size or power consumption.

Let's wait...a day of good sleep, will always help.

With NV it's always like that, if they don't do some marketing about their next-gen DP-Performance, their next-gen will just have a lack of DP-Performance. I think they will feature enough DP-Performance to be competetive and to provide a reasonable jump from GK180/GK210, which would be a factor of around 2~3x.
How they achieve that?
If Pascal is not a huge jump from Maxwell, and Maxwell possesses dedictaed FP64-units, Pascal will have that, too. I think, if they include Mixed-Precision, it will only have an impact on the FP32 units. So in my opinion, GP100 could have something 5000~6000 Mixed-Precision SPs and a factor of DP units of the range of 1000~2000.
Volta then could bring a true change in the design of the StreamProcessors.

I am just speculating...maybe this would be the other possibility. We have to understand that NV introduced a new GPU microarchitecture with Maxwell and HBM and Finfet will be a stress on NV, too. So, maybe it would be better for NV, if they would stick with the simplest option, which would include the Mixed-Precision units from GM20B (Tegra X) and add some independent DP functionality.
 
I am just speculating...maybe this would be the other possibility. We have to understand that NV introduced a new GPU microarchitecture with Maxwell and HBM and Finfet will be a stress on NV, too. So, maybe it would be better for NV, if they would stick with the simplest option, which would include the Mixed-Precision units from GM20B (Tegra X) and add some independent DP functionality.

You get twice the FP16 rate on GM20B only if f.e. the ops are the same. That sounds too naive to me as an implementation and I doubt they'd opt for the high end for something that works only under conditionals (wouldn't be the first time I'm wrong though...). Anyway for the far impossible I'll do a 180 degree twist and suggest that Pascal will have dedicated FP16 and FP64 SPs like the ULP SoC Rogue from IMG :p (oh and yes it's obviously just a joke...)
 
Last edited:
You get twice the FP16 rate on GM20B only if f.e. the ops are the same. That sounds too naive to me as an implementation and I doubt they'd opt for the high end for something that works only under conditionals (wouldn't be the first time I'm wrong though...).
That sounds like VLIW2, single instruction but two additional data words.
 
Given that data movement (even from registers) dominate the cost of computation so much, I wonder if a design where each "SP" is a 64-bit unit with 3 operands and one result and which can execute one DP, two SP or four HP instructions per clock in some kind of limited VLIW setup would make any sense.
 
Given that data movement (even from registers) dominate the cost of computation so much, I wonder if a design where each "SP" is a 64-bit unit with 3 operands and one result and which can execute one DP, two SP or four HP instructions per clock in some kind of limited VLIW setup would make any sense.
You and Nakai just gave me an idea...

What if each of these SPs actually is 256 wide, 192 bit in, 64 bit out. (That 192 bits rings a bell? Good. I didn't saw the correlation until I did the math while writing this.) The compiler is able to aggregate operations from two threads in a workgroup into a single SP VLIW2 instruction. For HP, that's actually a VLIW4 op, but still only issued by two threads. Only for DP, two threads actually need to issue their instructions in sequence.

That would, amongst others, explain why Nvidia likes 192bit memory interfaces so much...
 
GDDR5 chips have 32 bit data buses, and are managed in pairs by a memory controller, yielding 64 bits. Depending on how factors like board complexity, target bandwidth, and capacity balance each other, jumping between powers of two in terms of memory controllers in a chip can be undesirable.

The controllers feed into cache lines of fixed length, and the SIMD lanes are behind multiple interfaces and layers of abstraction.
The one known example of 2x FP16 throughput uses a shared op because it avoids ripping out existing issue, register access, and scheduling logic.
The predicate bits remain the same, there's no need to double decoding, and the instruction won't do something that will require ripping out the register file since a doubled operation uses the same register ID for anything.
Mixing "threads" in the absence a simplifying assumption like a shared op involves revamping the hardware model presented to the system, since the hardware isn't increasing the number of bits for predication or other elements like scheduling slots.
 
Also, don't forget NVidia does lots of stuff in the compiler using knowledge of instruction and register throughputs, latencies and conflict mechanisms.
 
You get twice the FP16 rate on GM20B only if f.e. the ops are the same. That sounds too naive to me as an implementation and I doubt they'd opt for the high end for something that works only under conditionals (wouldn't be the first time I'm wrong though...). Anyway for the far impossible I'll do a 180 degree twist and suggest that Pascal will have dedicated FP16 and FP64 SPs like the ULP SoC Rogue from IMG :p (oh and yes it's obviously just a joke...)

Of course the ops needs to be the same, since these are SIMD units. And of course these kind of splits should be implemented via VLIW-style ops. Does Maxwell have an array of dedicated FP64 units or are these an "enhanced" FP32 arrays, with the possibility to execute FP64 ops (with lower throughput)? If Maxwell has dedicated FP64 units, and Pascal is only a small step from Maxwell architecture-wise, it appears that Pascal will use a combination of FP64 and Mixed-Precision units. If they use pure Mixed-Precision units (with 4:2:1), and they include, for example, 6000SPs within their design, these corresponds to 3000 FP64-SPs(2:1), 6000 FP32-SPs, and 12000 FP16-SPs (4:2). So they need to include 3000 huge mixed-precision FP64 units. GK110 had SP:DP of 3:1, which were 960 FP64-SPs and 2880 FP32-SPs. Of coure, I don't think these are fully "dedicated" FP64-SPs. So is the step from 960 FP64-SPs to 3000 mixed-precision FP64-SPs too huge? If they go for true mixed-precision FP64 units, they need "VLIW4"-ops in order to achieve their maximum FP16 throughput. Do they want that? I don't know...
 
There are rumors that pascal will have no penalty or slow down for using mixed precision, if they are true then I don't think you will have different units.
 
There are rumors that pascal will have no penalty or slow down for using mixed precision, if they are true then I don't think you will have different units.
Probably nothing different than how it's implemented in Tegra X1, using vec2 packing by re-using the FP32 ALUs with all the implied practical limitations.
 
Nvidia Pascal win:

Along with touting the number of major HPC applications that are now GPU accelerated and the performance impact of that process, NVIDIA’s other major focus at SC15 is to announce their next US government contract win. This time the National Oceanic and Atmospheric Administration (NOAA) is tapping NVIDIA to build a next-gen research cluster. The system, which doesn’t currently have a name, is on a smaller scale than the likes of Summit & Sierra, and will be comprised of 760 GPUs. The cluster will be operational next year, and giving the timing and the wording as a “next-generation” cluster, it’s reasonable to assume that this will be Pascal powered

http://www.anandtech.com/show/9791/...ation-to-build-tesla-weather-research-cluster
 
Nvidia talks Pascal 16GB Memory at 1TB/S bandwidth

At the Japanese edition of NVIDIAs GPU Technology Conference, NVIDIA revealed some details behind its 2016 graphics architecture, codenamed Pascal.

The Pascal GPU is fabbed in the Taiwan Semiconductor Manufacturing Company (TSMC) based on the new 16nm FinFET process, a process that uses the popular stacking method that NAND these days uses as well. It should result in significant power savings.

Pascal will bring support for up to 32GB of HBM2 memory. Initial Pascal will launch with 16GB HBM2 memoryfrom SK.Hynix and Samsung. The 16GB HBM SDRAM (packed in four stacked 4GB HBM2 chips) will offer 1TB/s in bandwidth.

"Pascal will also be available in multi-GPU packaging, replacing the Tesla K80 (NVIDIA skipped Maxwell-gen dual-GPU Tesla). Combined figures are very interesting to compare – 24GB GDDR5 and 480GB/s bandwidth should be replaced with 32GB HBM2 and 2TB/s bandwidth, mutually connected through NVLink rather than PCIe. The NVLink will enable up to 80GB/s, which should replace PLX PCIe Gen3 bridge chips that can only support 16GB/s (8GB/s per GPU). This part should be ‘warm up’ for 2018 and the Volta architecture".
http://www.guru3d.com/news-story/nvidia-talks-pascal-16gb-of-memory-and-1tbs-bandwidth.html
 
Page 7 of this NVIDIA presentation has a DP performance and bandwidth roadmap for Tesla GPUs.

Pascal: ~4000 DP GFLOPS, ~1000 GB/s
Volta: ~7000 DP GFLOPS, ~1200 GB/s
(GFLOPS and bandwidth seem to be accurate to 2 and 3 significant figures respectively)

My guess is 1:2 DP for the relevant chips for ~8000 SP GFLOPS on Pascal, which would give ~32 (enabled) "SMP"s at ~980 MHz, ~36 SMPs at ~870 MHz, and ~40 SMPs at ~780 MHz (assuming 128 SP CCs per "SMP," this also counts enabled SMPs only).
 
Page 7 of this NVIDIA presentation has a DP performance and bandwidth roadmap for Tesla GPUs.

Pascal: ~4000 DP GFLOPS, ~1000 GB/s
Volta: ~7000 DP GFLOPS, ~1200 GB/s
(GFLOPS and bandwidth seem to be accurate to 2 and 3 significant figures respectively)

My guess is 1:2 DP for the relevant chips for ~8000 SP GFLOPS on Pascal, which would give ~32 (enabled) "SMP"s at ~980 MHz, ~36 SMPs at ~870 MHz, and ~40 SMPs at ~780 MHz (assuming 128 SP CCs per "SMP," this also counts enabled SMPs only).

1/3rd more SM's and raw SP FLOPs than the Titan-X doesn't sound like anywhere near enough for a high end Pascal given the process change. 1:3 DP sounds more realistic to me.
 
Can Pascal support 1:3 DP? I was under the impression that Pascal SMs are similar to Maxwell SMs and Maxwell only supports 1:2^n DP as far as I am aware. If so, then I agree that 1:3 makes a lot more sense.

Also, Teslas have lower clock speeds than GeForces, at least for the Fermi and later big chips, so regarding my above speculation I would expect a corresponding TITAN to reach 9-9.5 SP TFLOPS.

EDIT: "only supports 1:2^n DP" refers to Maxwell, not Pascal.
 
Last edited:
Status
Not open for further replies.
Back
Top