AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
Another thing I do not understand is why GPU manufacturers moves only now to FP16 support after more then a half decade fighting on DP benchmarks. I do not believe that is just a hardware design choice (mining anyone? xD)
 
I hope mining fads are finally gone. They want to use one architecture for all markets, mobile included. FP16 was always very useful there.
 
I hope mining fads are finally gone. They want to use one architecture for all markets, mobile included. FP16 was always very useful there.

I think even FPGAs have been pushed out of the mining game, it's mostly ASICs now. So GPUs are probably quite safe from mining-related interference.
 
Another thing I do not understand is why GPU manufacturers moves only now to FP16 support after more then a half decade fighting on DP benchmarks. I do not believe that is just a hardware design choice (mining anyone? xD)
For graphics, it seems to be a mobile/power thing. For compute, the only case I've heard is neural networks, but that seems to be a really big deal.
 
I'm not sure if that's also what your posting implied silent_guy but I don't think current architectures can fetch an additional half-register (16 bits) to fully populate a 32-bit-register that's already half used with 16 bit data. So you'd have to re-fetch everything with would negate the power saving effect of 16 bit usage. Without any hard data I can imagine that being able to do that (fetching and populating 16 bit portions of registers independently) your data paths and control would be quite a bit more complex.
If I'm understanding him correctly, Alessio is asking why not make everything a 64-bit based and stash 2 32-bit values in a single register. I think that would only make sense if you can then use those 2 32-bit values at the same time, otherwise you have a way overbuilt 64-bit system that hardly ever gets used. But that would introduce quite a bit of constraints to optimize for.
 
For graphics, it seems to be a mobile/power thing. For compute, the only case I've heard is neural networks, but that seems to be a really big deal.
I know it was mostly related to mobile before, in fact D3D 11.1 and Windows 8.0 introduced HLSL minimum precision (FP16/FP10) mostly for Windows Phone mobile https://msdn.microsoft.com/en-us/library/hh404562.aspx#Min_Precision.
I remember a decade years ago, a lot of people were exited for FP64 support (even if official DP support came only with SM 5.0 though). Now a lot of people are exited for FP16 support on PC GPUs...
Anyway, games are still tied to FP32 (single precision) only, and personally the most incoming innovation I see are HBMv2 (since HBMv1 looks like an epic fail) according do AMD Fury) and NVLINK (or something equivalent, don't know if PCI-E 4.0 will be so much competitive, NVLINK should come with higher bandwidth, don't know about latency).

If I'm understanding him correctly, Alessio is asking why not make everything a 64-bit based and stash 2 32-bit values in a single register. I think that would only make sense if you can then use those 2 32-bit values at the same time, otherwise you have a way overbuilt 64-bit system that hardly ever gets used. But that would introduce quite a bit of constraints to optimize for.
This.
I am asking all this because, to be honest, I don't have such a great GPU implementation hardware background/knowledge: in my study course @uni hardware was always tied to a secondary background role (and sometimes totally absent :\). So, sometimes I find difficult to understand some detailed implementation.
 
For graphics, it seems to be a mobile/power thing.

Like these which show a dramatic image quality reduction?

ALU precision
The precision of a GPU’s ALUs also has an impact on image quality, along with the precision you store the final results in. The biggest distinction between today’s desktop and mobile GPUs is that on mobile you’re much more likely to find very low-power (but also lower precision) ALUs.

Developers are able to influence which ALUs are used by the GPU when running their shaders, forcing them to consider ALU precision as another possible factor in image quality.

YOUi Labs offers an Android app which tests the floating point accuracy of the GPU inside your device; here are some very noticeably different results obtained on two competing GPUs:


eg398n.jpg


http://www.alexvoica.com/wp-content/uploads/2015/07/PowerVR_SGX_RK3168_YouI_Labs_FP32-vs-FP16.jpg

Another thing I do not understand is why GPU manufacturers moves only now to FP16 support after more then a half decade fighting on DP benchmarks. I do not believe that is just a hardware design choice (mining anyone? xD)

Now, only?! I do not understand why it should exist in the first place...
 
Like these which show a dramatic image quality reduction?
Now, only?! I do not understand why it should exist in the first place...
Those are of course specifically crafted to show differences between fp16 and fp32. Noone is arguing that everything should be forced to fp16, but if you've got some algorithm you know would give an "exact enough" result with fp16 why burn 80% (*) more power or so by doing it with fp32 (and in todays environments, that either translates directly into lower battery life or 40% lower performance)?
I think the decision for everything being fp32 made sense at the time, at least for desktop usage the performance of a chip was not really limited by power, so not supporting fp16 in exchange for smaller, lower complexity chip was a good one.

(*) Of course I totally pulled the percentage number out of my ass but you get the idea...
 
I think that select group of people are still very excited about FP64 support for compute. But FP64 and sometimes FP32 is wasted on neural networks, and that seems to be where all the money is going right now (for good reason it seems.)
With FP16, you can double the number of coefficients you can store in memory, you can add more MAC units for less area, I assume you also effectively double the number of coefficients you can transmit over nvlink, etc.
IMO GPU compute has been a solution in search of a problem ever since it was introduced. Now that they have found that problem, it's only logical that they capitalize on it.

Article on low precision neural nets: http://petewarden.com/2015/05/23/why-are-eight-bits-enough-for-deep-neural-networks/
 
Last edited:
Those are of course specifically crafted to show differences between fp16 and fp32. Noone is arguing that everything should be forced to fp16, but if you've got some algorithm you know would give an "exact enough" result with fp16 why burn 80% (*) more power or so by doing it with fp32 (and in todays environments, that either translates directly into lower battery life or 40% lower performance)?
These points bears repeating. Pathological tests have limited real world relevance.
Using higher precision than is visually motivated is pure waste. Using lower precision either saves power, increases performance, or both.
We had a thread about the reasons for fp16 recently. I recommend reading it.
 
Another thing I do not understand is why GPU manufacturers moves only now to FP16 support after more then a half decade fighting on DP benchmarks. I do not believe that is just a hardware design choice (mining anyone? xD)
Mining is generally cryptographic hashing, and AMD's initial GPU popularity for things like Litecoin came from its better integer handling.
16-bit precision versus 24 versus 32 was something that was fought over on the way up to modern GPUs.

Double-precision floating point is still an important metric for certain workloads, and will remain so for a number of lucrative fields if they don't get swamped by Xeons.
There's money now in FP16, so there's FP16.

Solutions like Nvidia's dual FP16 functionality probably incur a modest hit to area and complexity, and they make software's job a bit harder.
If it were a matter of complicating the ALUs or waiting for the next node shrink that improved everything by a factor of 2 and bandwidth scaled appropriately, it probably wouldn't have been needed.
However, the physical node stagnation and the much lower tolerance for power consumption in mobile means the implementations need to start making changes that compromise the uniformity and generality of the hardware to make up for the fact that their silicon isn't helping them along.

ALUs themselves are a small component of overall hardware pipelines, and there are elements like the multiplier and routing hardware that scale worse than linearly. The data paths for those 32-bit or 16-bit values are a big factor in power and area. Nvidia may have found that with a modest hit to complexity in the ALUs and the logic needed to stitch them together, that there was generally more slack to play with in that operand size relative to the minimum area and timing imposed by the existing data paths.

There's plenty of other work besides FP16 that can make use of 32-bit values, so actually duplicating that much hardware seems to have not been cost-effective with current constraints.
 
However, the physical node stagnation and the much lower tolerance for power consumption in mobile means the implementations need to start making changes that compromise the uniformity and generality of the hardware to make up for the fact that their silicon isn't helping them along.
Not only there, but massively larger scale installations are becoming more and more TCO constricted, which in the long term boils down to power (density, consumption, cooling, electricity bill).
 
If I'm understanding him correctly, Alessio is asking why not make everything a 64-bit based and stash 2 32-bit values in a single register. I think that would only make sense if you can then use those 2 32-bit values at the same time, otherwise you have a way overbuilt 64-bit system that hardly ever gets used. But that would introduce quite a bit of constraints to optimize for.
In a full-rate FP16 case, you would need four 16-bit values to fill the same size register, and since this is something that will happen in Pascal, that can't be too difficult to accomplish. So logic dictates, finding just two 32-bit values to work with would be only half as difficult as finding four... ;)

But FP64 and sometimes FP32 is wasted on neural networks, and that seems to be where all the money is going right now (for good reason it seems.)
"Judgment Day is inevitable."
 
Mining is generally cryptographic hashing, and AMD's initial GPU popularity for things like Litecoin came from its better integer handling.
16-bit precision versus 24 versus 32 was something that was fought over on the way up to modern GPUs.
Bitcoin mining on GPUs started with CUDA back in 2010 but it was quickly realised that the 32-bit integer operations on AMD were way, way faster.

Litecoin wasn't even invented then.

And, for extra irony, Litecoin was invented with the naive intention that the memory usage would prevent GPU acceleration, making it a "fair" shitcoin that anyone with a CPU could mine.
 
Another thing I do not understand is why GPU manufacturers moves only now to FP16 support after more then a half decade fighting on DP benchmarks. I do not believe that is just a hardware design choice (mining anyone? xD)
GCN is using 4 way round robin scheduling. It issues one instruction from a wave every 4th cycle, switching between four waves. This reduces the perceived latency (all common instruction results are immediately usable in the next instruction slot, simplifying the scheduling a lot), but it also means that IPC per thread is quite low. GCN needs huge amount of threads simultaneously on fly to utilize the hardware, and each thread needs to reserve enough registers for the maximum register usage in the shader code (even if the most expensive branch was not taken). This is starting to become a problem.

There are 64 KB registers per 16 wide SIMD, and four SIMDs per CU (total 256 KB registers per CU). Fiji has 64 CUs. In total Fiji chip holds 16 megabytes of registers! These massive register files take a big part of the die area, and consume a big part of the total power. The sizes of the L1 caches (64 x 16 KB = 1 MB total) and the L2 cache (2 MB) are tiny compared to the register files.

People have been optimizing the memory bandwidth (8/16 bit normalized textures/buffers, DXT compression, bit packing data) for ages. Memory optimization has been more important than ALU optimization for a long time. However, there has not been a good way to optimize the register file storage. 16 bit integer and 16 bit float types allow the programmers to optimize the register file storage. This allows the GPU to run more complex kernels and/or run more threads concurrently, as the register file size is a hard limit for the concurrency. Without enough concurrency, the GPU cannot hide memory latency, and it stalls often. GPU doesn't have OoO machinery to hide the memory stalls, meaning that the stalls hurt a lot (several hundred cycles lost * 64 wide CU = more than 10k+ FLOPs).

Why now? The old VLIW hardware wasn't as register hungry as the new scalar architectures. Biggest gain from 16 bit registers was the ALU savings (not the register savings). The situation has changed, since the GPUs have become wider and wider. Mobile is also a good reason to be power efficient. However it seems that tightly power optimized products designed first for mobiles, such as NVIDIA Maxwell, seem to be a great fit for desktops as well. Every computational device seems to be power constrained nowadays.

Software support for 16 bit types was missing in DirectX 10 and in DirectX 11. DirectX 11.2 brought back support for 16 bit float types and added support for 16 bit integer types (min16float, min16int and min16uint). Now developers can actually use these types again, meaning that hardware support is meaningful.

Some people are scared about the image quality drop, because they still remember the old NVIDIA driver "optimizations" for Geforce FX that downgraded the shader math to 16 bit float. ATI had driver "optimizations" to downgrade the anisotropic filtering quality. The modern 16 bit integer and float support is fully developer controlled. It will be mostly the developers doing the optimizations this time. I don't think people should fear about banding and other issues, as long as developers use these features properly.

When used properly, it is impossible to notice the difference between 16 bit and 32 bit types. The final display output is only 8 bits per channel (256 values), and 16 bit float provides 10+1 mantissa bits (2048 values) for each stop (32 stops of brightness). Most color calculations can be done in 16 bit floats just fine. Also it is important to notice that not all calculations have any effect on the pixel colors. For example a object viewport/occlusion culling or a light culling shader could use 16 bit floats (with conservative rounding). 16 bit integers obviously bring zero loss of data, as long as the integer is guaranteed to be less than 65536 at all times. This is true for most integer math inside a shader.
 
GCN is using 4 way round robin scheduling. It issues one instruction from a wave every 4th cycle, switching between four waves. This reduces the perceived latency (all common instruction results are immediately usable in the next instruction slot, simplifying the scheduling a lot), but it also means that IPC per thread is quite low. GCN needs huge amount of threads simultaneously on fly to utilize the hardware, and each thread needs to reserve enough registers for the maximum register usage in the shader code (even if the most expensive branch was not taken). This is starting to become a problem.
That is true, but GCN's heavier reliance on concurrency and its occupancy concern has been around for years now, and GCN is the laggard in this regard.
Mobile IPs have had lower precisions--or never lost them, Nvidia's mobile and now desktop IPs have brought back half precision, and Intel's graphics have even had 8-bit packed formats within 32-bit words for media processing--for years. (edit: last point missed a set of packed GCN ops)

Perhaps more consistent is whether a design has pretensions for GPGPU, where the need for at least a 32-bit word as the first-class granularity became most important, and then whether there was a desire for an upgrade to high-throughput 64-bit. Giving up on those pretensions for a particular implementation has made the 16-bit decision simpler at 28nm.
I am uncertain at this point that this would have been taken as seriously if we weren't 2-3 years into a node transition that didn't happen.
The occupancy problem could have been halved if they could have just hopped to a 2x denser node with better power consumption, and I am starting to think GCN in particular was counting on living more of its lifespan at 20nm or below.

Slipping 16-bit capability (or lower) into the physical and timing gaps created by needing 32-bit paths and bookkeeping for operations that need the extra bit space is an incremental expense to hardware, even if things like GCN's SDWA introduces a shade of the VLIW access days, absent bank conflicts.

Every computational device seems to be power constrained nowadays.
Yesterdays, too.
 
Last edited:
GCN is using 4 way round robin scheduling. It issues one instruction from a wave every 4th cycle, switching between four waves.
No it doesn't. A single hardware thread can issue to the VALU/SALU for as long as the code that fits into instruction cache can run, until it hits a latency event (read/write memory, constant buffer, LDS etc.).

That's why you can get high throughput with only 2 or 3 hardware threads per VALU, if arithmetic intensity is very high.

This reduces the perceived latency (all common instruction results are immediately usable in the next instruction slot, simplifying the scheduling a lot),
GCN can do that because the math pipeline is very short. It's no different in this regard than the VLIW chips.

but it also means that IPC per thread is quite low. GCN needs huge amount of threads simultaneously on fly to utilize the hardware, and each thread needs to reserve enough registers for the maximum register usage in the shader code (even if the most expensive branch was not taken). This is starting to become a problem.

There are 64 KB registers per 16 wide SIMD, and four SIMDs per CU (total 256 KB registers per CU). Fiji has 64 CUs. In total Fiji chip holds 16 megabytes of registers! These massive register files take a big part of the die area, and consume a big part of the total power. The sizes of the L1 caches (64 x 16 KB = 1 MB total) and the L2 cache (2 MB) are tiny compared to the register files.
I think the more illuminating number is the peak register file bandwidth: 64 CUs with 64 lanes each using 16 bytes at 1GHz = 64TB/s.

People have been optimizing the memory bandwidth (8/16 bit normalized textures/buffers, DXT compression, bit packing data) for ages. Memory optimization has been more important than ALU optimization for a long time. However, there has not been a good way to optimize the register file storage. 16 bit integer and 16 bit float types allow the programmers to optimize the register file storage. This allows the GPU to run more complex kernels and/or run more threads concurrently, as the register file size is a hard limit for the concurrency. Without enough concurrency, the GPU cannot hide memory latency, and it stalls often. GPU doesn't have OoO machinery to hide the memory stalls, meaning that the stalls hurt a lot (several hundred cycles lost * 64 wide CU = more than 10k+ FLOPs).
NVidia abandoned OoO machinery. It just isn't worth the power and die area. NVidia relies upon static compilation.
 
That is true, but GCN's heavier reliance on concurrency and its occupancy concern has been around for years now, and GCN is the laggard in this regard.
Mobile IPs have had lower precisions--or never lost them, Nvidia's mobile and now desktop IPs have brought back half precision, and Intel's graphics have even had 8-bit packed formats within 32-bit words for media processing--for years.
Yes, AMD's only been doing 8-bit packed media instructions since late VLIW, such a laggard.
 
Status
Not open for further replies.
Back
Top