Back when GPU's were Graphics Processing Units, FP16 wasn't "sufficient" for blending after enough passes in a lot of cases. Now that GPU's are General Purpose Compute Units, the lower precision can be useful once again for specific cases.Half-precision (16bit float) once undesirable is now a selling point. Why'd they get rid of it in the first place?
With no FP16 desktop hardware available for testing, this would have been equal to shipping code without even running it once. I assume nobody was crazy enough to do this.Theoretically games could already support this, since d3d11.1 supports specifying variables which are allowed to have less precision (the lesser precision isn't guaranteed, so this should always work as the driver is free to ignore this). Doubtful though anyone already does this...
The GPU latency hiding depends on the active thread count. Less register memory usage means more threads can be ran simultaneously. This is the biggest performance gain from FP16. I am also waiting to get 16 bit integers eventually, since that provides similar gains. 16 bit integers are big enough for most purposes (and obviously there is no precision loss either if your value fits to 16 bits). We do manual bit packing tricks already to reduce integer register (and LDS) pressure. GCN has one cycle mask+shift instructions making packing/unpacking really fast. It also has instructions to pack/unpack a pair of FP16 values to 32 bit register. So you can emulate some of the gains for variables that are infrequently used (every pack / unpack pair is 2 extra instructions, so you don't want to do this often). Obviously this kind of emulation provides zero power savings and adds some ALU instructions (and costs developer time). Native FP16 and 16 bit integers are very much welcome.While there are many power benefits, there's mostly a significant performance benefit from the fact that it's a Vec2 ALU (i.e. it is simply reading 32-bit registers as 2x16-bit). Don't focus only on the area cost of the ALU - you should consider the area/power cost of the register file too...
Half-precision (16bit float) once undesirable is now a selling point. Why'd they get rid of it in the first place?
Green or red? Or which green or red, I forget who's what color now.That's a good question for those green birds around here that were passionately evangelizing how "useless" it is. As already mentioned: the more power sensitive things get IHVs will look for solutions to increase efficiency for every possible use case.
Green or red? Or which green or red, I forget who's what color now.
The nice thing with this modern take is that the FP16 functionality is overlaid on a hardware base that is robust from a 32-bit standpoint. Use cases that need the higher precision perform normally, while cases that can use FP16 can do better. I think people would notice it more if the next GPU came out with register and ALU resources optimized to FP16 without a change in the headline register and unit counts.
I'm curious if the 8-bit granularity extract the most recent GCN ISA document exposes is useful for any notable situations, given the rather low ceiling that could entail.
I'm curious if the 8-bit granularity extract the most recent GCN ISA document exposes is useful for any notable situations, given the rather low ceiling that could entail.
The amount of sensationalist journalism posting that 10x Maxwell as "next GPU 10x as fast as Maxwell" is ridiculous. At least I cleaned up a few annoying sources on Facebook.
That figure was apparently in the context of a few GPU linked together (two? four?) with NVlink, and thus working similar to SMP for CPUs ; a system with graphics cards plugged into PCIe slots is more like a cluster or "shared nothing" architecture, each GPU being its own little island.
You go looking for a workload that needs SMP-like scaling ("scaling up" rather that "scaling out"), works with the FP16 sauce and then the figure looks plausible, if expensive.
So he's not allowed to say to researchers that, thanks to NVLink, weight propagation between connected GPUs will be roughly improved by a factor of 10X ? Because that's how I read this particular slide.With all the respect i got for this company, lately, i really dont understand where they want to go with this type of marketing .. Specially when it was done in a conference for professionals...
So he's not allowed to say to researchers that, thanks to NVLink, weight propagation between connected GPUs will be roughly improved by a factor of 10X ? Because that's how I read this particular slide.
I only watched the video of the second keynote by the Google guy, but he said that they had to reduce the interconnectedness of the neural networks to allow GPUs to recalculate weights in parallel. So it seems that this is a relevant improvement for this particular audience.
I appreciate your concern that researchers at such a conference may think 'OMG everything 10X faster', but maybe you should give the targeted audience (again: researchers) a bit more credit?
Your advocating that even keynotes of scientific conferences should keep the lowest common denominator of the web viewing public into account. In that case, everybody might as well stay home and read YouTube comments.