Nvidia Pascal Speculation Thread

Status
Not open for further replies.
Half-precision (16bit float) once undesirable is now a selling point. Why'd they get rid of it in the first place?
Back when GPU's were Graphics Processing Units, FP16 wasn't "sufficient" for blending after enough passes in a lot of cases. Now that GPU's are General Purpose Compute Units, the lower precision can be useful once again for specific cases.

That statement isn't exactly true, but it's truly the compute side that makes it useful, and mostly the graphics side that had previously decided that it wasn't.
 
FP16 was used back in the days when the cost of the ALU parts on the die was still a decisive design factor and going full-time with FP32 shader pipeline wasn't feasible. Today, it's more the power cost (and bandwidth) that dictates this design direction, together with the significantly expanded application field for the GPUs.
 
You all realise that no NVIDIA GPU ever had a FP16 Multiply-Add unit before Tegra X1, right? They were using FP16 *registers* and also had a dedicated FP16 Normalize unit, but the core ALUs were always FP32.

I'm also not sure how removing FP16 support had anything to do with blending as the blender was always a separate unit on NVIDIA/AMD GPUs... To a large extent NV3x/NV4x never really benefited as much from FP16 as much as they could have; FP16 wasn't fast - FP32 was slow!

While there are many power benefits, there's mostly a significant performance benefit from the fact that it's a Vec2 ALU (i.e. it is simply reading 32-bit registers as 2x16-bit). Don't focus only on the area cost of the ALU - you should consider the area/power cost of the register file too...

(Rampage wasn't FP16. Mojo was meant to have FP16 ALUs but that was likely little more than a draft specification/wishlist lying on an architect's desk by the time 3dfx went bankrupt...)
 
Last edited:
On NV47/PS3 was very important to use FP16 registers where possible in order to save register space and get more threads running, which increased performance in most cases, except when one was ending up trashing the texture cache :)

Funnily enough in those cases one could fool the GPU pretending to use more registers than necessary to scale down the number of threads to avoid thrashing the texture cache and get better performance. Oh nostalgia.. :)
 
Theoretically games could already support this, since d3d11.1 supports specifying variables which are allowed to have less precision (the lesser precision isn't guaranteed, so this should always work as the driver is free to ignore this). Doubtful though anyone already does this...
With no FP16 desktop hardware available for testing, this would have been equal to shipping code without even running it once. I assume nobody was crazy enough to do this.
While there are many power benefits, there's mostly a significant performance benefit from the fact that it's a Vec2 ALU (i.e. it is simply reading 32-bit registers as 2x16-bit). Don't focus only on the area cost of the ALU - you should consider the area/power cost of the register file too...
The GPU latency hiding depends on the active thread count. Less register memory usage means more threads can be ran simultaneously. This is the biggest performance gain from FP16. I am also waiting to get 16 bit integers eventually, since that provides similar gains. 16 bit integers are big enough for most purposes (and obviously there is no precision loss either if your value fits to 16 bits). We do manual bit packing tricks already to reduce integer register (and LDS) pressure. GCN has one cycle mask+shift instructions making packing/unpacking really fast. It also has instructions to pack/unpack a pair of FP16 values to 32 bit register. So you can emulate some of the gains for variables that are infrequently used (every pack / unpack pair is 2 extra instructions, so you don't want to do this often). Obviously this kind of emulation provides zero power savings and adds some ALU instructions (and costs developer time). Native FP16 and 16 bit integers are very much welcome.
 
Half-precision (16bit float) once undesirable is now a selling point. Why'd they get rid of it in the first place?

That's a good question for those green birds around here that were passionately evangelizing how "useless" it is. As already mentioned: the more power sensitive things get IHVs will look for solutions to increase efficiency for every possible use case.
 
That's a good question for those green birds around here that were passionately evangelizing how "useless" it is. As already mentioned: the more power sensitive things get IHVs will look for solutions to increase efficiency for every possible use case.
Green or red? Or which green or red, I forget who's what color now.
The nice thing with this modern take is that the FP16 functionality is overlaid on a hardware base that is robust from a 32-bit standpoint. Use cases that need the higher precision perform normally, while cases that can use FP16 can do better. I think people would notice it more if the next GPU came out with register and ALU resources optimized to FP16 without a change in the headline register and unit counts.

I'm curious if the 8-bit granularity extract the most recent GCN ISA document exposes is useful for any notable situations, given the rather low ceiling that could entail.
 
Green or red? Or which green or red, I forget who's what color now.

I don't care about the colour as long as it's not pink; after two daughters some fathers may sense why :p

The nice thing with this modern take is that the FP16 functionality is overlaid on a hardware base that is robust from a 32-bit standpoint. Use cases that need the higher precision perform normally, while cases that can use FP16 can do better. I think people would notice it more if the next GPU came out with register and ALU resources optimized to FP16 without a change in the headline register and unit counts.

I'm curious if the 8-bit granularity extract the most recent GCN ISA document exposes is useful for any notable situations, given the rather low ceiling that could entail.

I'm curious if Pascal will take the same path as the X1 GPU or if FP16 comes with less or no conditionals at all.
 
The amount of sensationalist journalism posting that 10x Maxwell as "next GPU 10x as fast as Maxwell" is ridiculous. At least I cleaned up a few annoying sources on Facebook.
 
I'm curious if the 8-bit granularity extract the most recent GCN ISA document exposes is useful for any notable situations, given the rather low ceiling that could entail.

There are many science codes where the data has a very low dynamic range. Using 8 bit is enough, and in compensation you save storage, get better caching and (above all!) save power by less data movement. At GTC this week they gave some examples for 8 bit SGEMM, one of which was radio astronomy data. The win wasn't the higher ALU throughput, but the smaller data size.
 
The amount of sensationalist journalism posting that 10x Maxwell as "next GPU 10x as fast as Maxwell" is ridiculous. At least I cleaned up a few annoying sources on Facebook.

Effectively, i like this way of calculate performance gain.. If it was only so easy.
 
That figure was apparently in the context of a few GPU linked together (two? four?) with NVlink, and thus working similar to SMP for CPUs ; a system with graphics cards plugged into PCIe slots is more like a cluster or "shared nothing" architecture, each GPU being its own little island.

You go looking for a workload that needs SMP-like scaling ("scaling up" rather that "scaling out"), works with the FP16 sauce and then the figure looks plausible, if expensive.
 
That figure was apparently in the context of a few GPU linked together (two? four?) with NVlink, and thus working similar to SMP for CPUs ; a system with graphics cards plugged into PCIe slots is more like a cluster or "shared nothing" architecture, each GPU being its own little island.

You go looking for a workload that needs SMP-like scaling ("scaling up" rather that "scaling out"), works with the FP16 sauce and then the figure looks plausible, if expensive.

it was his "boss calculation method"... lol ( His own words )



Ofc, in a specific, particular case who was not supported before you could end with that, but we all know, performance will not even be doubled..

( anyway maybe he was speak about FP64 rate, who know.. 200Gflops on Maxwell x 10 = 2Tflops rate lol )

sorry it is a bit aggressive, but i really cant support when Nvidia boss do marketing ala Apple. I believe next year, Nvidia will use FP16 for give their Tflops data rate ( similar of the Tegra x1 presentation ), use compression algorythm for calculate memory bandwith ( similar of their 960 launch )... With all the respect i got for this company, lately, i really dont understand where they want to go with this type of marketing .. Specially when it was done in a conference for professionals...

They can keep this type of marketing for sell smartphones or there tegra tablet..
 
Last edited:
With all the respect i got for this company, lately, i really dont understand where they want to go with this type of marketing .. Specially when it was done in a conference for professionals...
So he's not allowed to say to researchers that, thanks to NVLink, weight propagation between connected GPUs will be roughly improved by a factor of 10X ? Because that's how I read this particular slide.

I only watched the video of the second keynote by the Google guy, but he said that they had to reduce the interconnectedness of the neural networks to allow GPUs to recalculate weights in parallel. So it seems that this is a relevant improvement for this particular audience.

I appreciate your concern that researchers at such a conference may think 'OMG everything 10X faster', but maybe you should give the targeted audience (again: researchers) a bit more credit?

Your advocating that even keynotes of scientific conferences should keep the lowest common denominator of the web viewing public into account. In that case, everybody might as well stay home and read YouTube comments.
 
So he's not allowed to say to researchers that, thanks to NVLink, weight propagation between connected GPUs will be roughly improved by a factor of 10X ? Because that's how I read this particular slide.

I only watched the video of the second keynote by the Google guy, but he said that they had to reduce the interconnectedness of the neural networks to allow GPUs to recalculate weights in parallel. So it seems that this is a relevant improvement for this particular audience.

I appreciate your concern that researchers at such a conference may think 'OMG everything 10X faster', but maybe you should give the targeted audience (again: researchers) a bit more credit?

Your advocating that even keynotes of scientific conferences should keep the lowest common denominator of the web viewing public into account. In that case, everybody might as well stay home and read YouTube comments.

Beause for you, the title of the slide is not enough readable ( let alone his words ),

But you are probably right.. Anyway, it is absolutely funny to see so much site reporting, "Pascal will be 10x faster than Maxwell .." ... so what make them think that ?
 
Last edited:
It's probably worth not getting too caught up in the 4x (FP16) and 10x (God knows what) performance per watt claims since the baseline 2x p/w figure is already damn impressive. Lest we forget that equates to Titan-X performance and 125W. I can live with that!

Imagine what they can do in the 980's 165w envelope!
 
Actually the 10x Maxwell is an estimate on his part (he said as much in the video), but believe he is drawing this estimate based on the working Pascal and NvLink parts they current have. Once you factor in HBM I really don't think the 10x is far from what we should expect, but time will tell.
 
Status
Not open for further replies.
Back
Top