Alessio1989
Regular
Another thing I do not understand is why GPU manufacturers moves only now to FP16 support after more then a half decade fighting on DP benchmarks. I do not believe that is just a hardware design choice (mining anyone? xD)
I hope mining fads are finally gone. They want to use one architecture for all markets, mobile included. FP16 was always very useful there.
For graphics, it seems to be a mobile/power thing. For compute, the only case I've heard is neural networks, but that seems to be a really big deal.Another thing I do not understand is why GPU manufacturers moves only now to FP16 support after more then a half decade fighting on DP benchmarks. I do not believe that is just a hardware design choice (mining anyone? xD)
If I'm understanding him correctly, Alessio is asking why not make everything a 64-bit based and stash 2 32-bit values in a single register. I think that would only make sense if you can then use those 2 32-bit values at the same time, otherwise you have a way overbuilt 64-bit system that hardly ever gets used. But that would introduce quite a bit of constraints to optimize for.I'm not sure if that's also what your posting implied silent_guy but I don't think current architectures can fetch an additional half-register (16 bits) to fully populate a 32-bit-register that's already half used with 16 bit data. So you'd have to re-fetch everything with would negate the power saving effect of 16 bit usage. Without any hard data I can imagine that being able to do that (fetching and populating 16 bit portions of registers independently) your data paths and control would be quite a bit more complex.
I know it was mostly related to mobile before, in fact D3D 11.1 and Windows 8.0 introduced HLSL minimum precision (FP16/FP10) mostly for Windows Phone mobile https://msdn.microsoft.com/en-us/library/hh404562.aspx#Min_Precision.For graphics, it seems to be a mobile/power thing. For compute, the only case I've heard is neural networks, but that seems to be a really big deal.
This.If I'm understanding him correctly, Alessio is asking why not make everything a 64-bit based and stash 2 32-bit values in a single register. I think that would only make sense if you can then use those 2 32-bit values at the same time, otherwise you have a way overbuilt 64-bit system that hardly ever gets used. But that would introduce quite a bit of constraints to optimize for.
For graphics, it seems to be a mobile/power thing.
Another thing I do not understand is why GPU manufacturers moves only now to FP16 support after more then a half decade fighting on DP benchmarks. I do not believe that is just a hardware design choice (mining anyone? xD)
Those are of course specifically crafted to show differences between fp16 and fp32. Noone is arguing that everything should be forced to fp16, but if you've got some algorithm you know would give an "exact enough" result with fp16 why burn 80% (*) more power or so by doing it with fp32 (and in todays environments, that either translates directly into lower battery life or 40% lower performance)?Like these which show a dramatic image quality reduction?
Now, only?! I do not understand why it should exist in the first place...
My hunch is that you will suddenly understand it loud and clear once it's supported in a superior way by AMD.Now, only?! I do not understand why it should exist in the first place...
These points bears repeating. Pathological tests have limited real world relevance.Those are of course specifically crafted to show differences between fp16 and fp32. Noone is arguing that everything should be forced to fp16, but if you've got some algorithm you know would give an "exact enough" result with fp16 why burn 80% (*) more power or so by doing it with fp32 (and in todays environments, that either translates directly into lower battery life or 40% lower performance)?
Mining is generally cryptographic hashing, and AMD's initial GPU popularity for things like Litecoin came from its better integer handling.Another thing I do not understand is why GPU manufacturers moves only now to FP16 support after more then a half decade fighting on DP benchmarks. I do not believe that is just a hardware design choice (mining anyone? xD)
Not only there, but massively larger scale installations are becoming more and more TCO constricted, which in the long term boils down to power (density, consumption, cooling, electricity bill).However, the physical node stagnation and the much lower tolerance for power consumption in mobile means the implementations need to start making changes that compromise the uniformity and generality of the hardware to make up for the fact that their silicon isn't helping them along.
In a full-rate FP16 case, you would need four 16-bit values to fill the same size register, and since this is something that will happen in Pascal, that can't be too difficult to accomplish. So logic dictates, finding just two 32-bit values to work with would be only half as difficult as finding four...If I'm understanding him correctly, Alessio is asking why not make everything a 64-bit based and stash 2 32-bit values in a single register. I think that would only make sense if you can then use those 2 32-bit values at the same time, otherwise you have a way overbuilt 64-bit system that hardly ever gets used. But that would introduce quite a bit of constraints to optimize for.
"Judgment Day is inevitable."But FP64 and sometimes FP32 is wasted on neural networks, and that seems to be where all the money is going right now (for good reason it seems.)
Bitcoin mining on GPUs started with CUDA back in 2010 but it was quickly realised that the 32-bit integer operations on AMD were way, way faster.Mining is generally cryptographic hashing, and AMD's initial GPU popularity for things like Litecoin came from its better integer handling.
16-bit precision versus 24 versus 32 was something that was fought over on the way up to modern GPUs.
GCN is using 4 way round robin scheduling. It issues one instruction from a wave every 4th cycle, switching between four waves. This reduces the perceived latency (all common instruction results are immediately usable in the next instruction slot, simplifying the scheduling a lot), but it also means that IPC per thread is quite low. GCN needs huge amount of threads simultaneously on fly to utilize the hardware, and each thread needs to reserve enough registers for the maximum register usage in the shader code (even if the most expensive branch was not taken). This is starting to become a problem.Another thing I do not understand is why GPU manufacturers moves only now to FP16 support after more then a half decade fighting on DP benchmarks. I do not believe that is just a hardware design choice (mining anyone? xD)
That is true, but GCN's heavier reliance on concurrency and its occupancy concern has been around for years now, and GCN is the laggard in this regard.GCN is using 4 way round robin scheduling. It issues one instruction from a wave every 4th cycle, switching between four waves. This reduces the perceived latency (all common instruction results are immediately usable in the next instruction slot, simplifying the scheduling a lot), but it also means that IPC per thread is quite low. GCN needs huge amount of threads simultaneously on fly to utilize the hardware, and each thread needs to reserve enough registers for the maximum register usage in the shader code (even if the most expensive branch was not taken). This is starting to become a problem.
Yesterdays, too.Every computational device seems to be power constrained nowadays.
No it doesn't. A single hardware thread can issue to the VALU/SALU for as long as the code that fits into instruction cache can run, until it hits a latency event (read/write memory, constant buffer, LDS etc.).GCN is using 4 way round robin scheduling. It issues one instruction from a wave every 4th cycle, switching between four waves.
GCN can do that because the math pipeline is very short. It's no different in this regard than the VLIW chips.This reduces the perceived latency (all common instruction results are immediately usable in the next instruction slot, simplifying the scheduling a lot),
I think the more illuminating number is the peak register file bandwidth: 64 CUs with 64 lanes each using 16 bytes at 1GHz = 64TB/s.but it also means that IPC per thread is quite low. GCN needs huge amount of threads simultaneously on fly to utilize the hardware, and each thread needs to reserve enough registers for the maximum register usage in the shader code (even if the most expensive branch was not taken). This is starting to become a problem.
There are 64 KB registers per 16 wide SIMD, and four SIMDs per CU (total 256 KB registers per CU). Fiji has 64 CUs. In total Fiji chip holds 16 megabytes of registers! These massive register files take a big part of the die area, and consume a big part of the total power. The sizes of the L1 caches (64 x 16 KB = 1 MB total) and the L2 cache (2 MB) are tiny compared to the register files.
NVidia abandoned OoO machinery. It just isn't worth the power and die area. NVidia relies upon static compilation.People have been optimizing the memory bandwidth (8/16 bit normalized textures/buffers, DXT compression, bit packing data) for ages. Memory optimization has been more important than ALU optimization for a long time. However, there has not been a good way to optimize the register file storage. 16 bit integer and 16 bit float types allow the programmers to optimize the register file storage. This allows the GPU to run more complex kernels and/or run more threads concurrently, as the register file size is a hard limit for the concurrency. Without enough concurrency, the GPU cannot hide memory latency, and it stalls often. GPU doesn't have OoO machinery to hide the memory stalls, meaning that the stalls hurt a lot (several hundred cycles lost * 64 wide CU = more than 10k+ FLOPs).
Yes, AMD's only been doing 8-bit packed media instructions since late VLIW, such a laggard.That is true, but GCN's heavier reliance on concurrency and its occupancy concern has been around for years now, and GCN is the laggard in this regard.
Mobile IPs have had lower precisions--or never lost them, Nvidia's mobile and now desktop IPs have brought back half precision, and Intel's graphics have even had 8-bit packed formats within 32-bit words for media processing--for years.