AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
If they hit the 2x per/watt of fiji, they will be fine. They don't really need a revolutionary architecture, just GCN 1.4 or something with more power saving tools and maybe more cache on 14/16nm.

If they could somehow hit 2x per/watt of the nano, then it would be pretty insane.
 
hmm no architecturally something has to change, nV has the same advantage as AMD going to 16/14 nm, so can't just base performance per watt on the node change.

Right, and they'd need double-rate FP16 to be competitive in deep learning applications and future DX12 games designed with FP16, plus support for Feature Level 12.1.

Beyond this, power-efficiency, power-efficiency, and then some more power-efficiency.
 
If they hit the 2x per/watt of fiji, they will be fine. They don't really need a revolutionary architecture, just GCN 1.4 or something with more power saving tools and maybe more cache on 14/16nm.
I don't think the Nano is the result of binning. Much more likely that they've done a serious voltage reduction and thus clock reduction as well. That's trading off perf/mm2 for perf/W. If they want to be at par again, they need both.
 
Right, and they'd need double-rate FP16 to be competitive in deep learning applications and future DX12 games designed with FP16, plus support for Feature Level 12.1.
At the moment NVIDIA are trailing AMD with high performance FP16 support on consumer or professional PC graphics...
 
At the moment NVIDIA are trailing AMD with high performance FP16 support on consumer or professional PC graphics...

As far as I know (and please correct me if I'm wrong) Tonga and Fiji support FP16 in 16-bit registers, but they can't compute twice as many operations on FP16 operands as they can on FP32 ones, i.e. they have the same peak power in both cases, although the lower register pressure with FP16 can lead to higher effective performance.

Meanwhile, Tegra X1 has double-rate FP16 support (wherever that chip may be) and more importantly, NVIDIA will bring that to Pascal. So it seems to me that AMD ought to have it too.
 
Half-precision (FP16) and lower precision (FP10) support is completely USELESS on PC gaming, it is mean to be used on mobile to reduce power and improve battery life. DirectX graphics also do NOT require a true FP16/FP10 support, ie: truncation could happen on a FP32 computation to get a FP16 value. FP16/FP10 can be useful on PC for development purpose only. Current PC GPUs with FP16 support should just use a simple FP32 computation and then truncate the mantissa.
 
This is probably wrong. Even if the ALU throughput is the same, you still have less register pressure and therefore a higher compute unit occupancy.

But one could also imagine a higher throughput with FP16 in some GPUs in the future even on the desktop. Some circuitry scales non-linear with the value bit size, so FP16 units take even less than half of the space of a FP32 unit.

I don't want to go into details, but for the products that I worked on, we would have been happy to invest some time for FP16, if that meant less register usage. And even more so if it would have meant higher throughput. That would have been a really low hanging fruit compared to some other optimizations.
 
Last edited:
Half space with half precision. Just because on FP32 you do not usually consider all the 23 bits of the mantissa, it doesn't mean that all those bits are useless since they will help to produce a more precise results on computations. With a true FP16 hardware support a sensible precision error can occur more frequently.
Personally, I would be more interesting to see what hardware now still do not support FP64 on D3D12 since with the last driver I installed, even my Surface Pro 3 now support double precision floats... A good FP64 support on IGP could be ideal for a physics GPGPU acceleration...
It will also require a complete GPU re-design from scratch. Another more interesting format to support also are integer, actually some GPU still use FP32 and FP64 to support 24 and 32 bit integers...
 
How would true FP16 support impact the rest of the GPU design? I mean, how would FP16 hardware support impact on FP32 and FP64 and integers performance. Also, current consumer GPUs share like 99% (maybe something lil less?) of the workstation and server GPU design. I don't know how much useful are FP16 on those GPUs.
Finally, I still dream on 10-bit per colour channel screen on consumer monitors (ie 10-bit per colour channel textures, ie no more r8g8b8a8 on RT..), but that's still a dream.. ): Still to much panels are 6-bit + dithering instead of "true" 8-bit, 10-bit deep share is still far away.
 
How would true FP16 support impact the rest of the GPU design? I mean, how would FP16 hardware support impact on FP32 and FP64 and integers performance. Also, current consumer GPUs share like 99% (maybe something lil less?) of the workstation and server GPU design. I don't know how much useful are FP16 on those GPUs.
At least the way nvidia seems to implement it, there's very little impact on the rest of the GPU design. All the registers used are still 32bit, they just hold 2xfp16 values, so outside the ALUs (which process these 2 values simd-style) there's no changes needed at all. I don't know useful fp16 would be in other markets, but presumably it doesn't cost all that much area, so it shouldn't be a big deal.
 
At least the way nvidia seems to implement it, there's very little impact on the rest of the GPU design. All the registers used are still 32bit, they just hold 2xfp16 values, so outside the ALUs (which process these 2 values simd-style) there's no changes needed at all. I don't know useful fp16 would be in other markets, but presumably it doesn't cost all that much area, so it shouldn't be a big deal.
So, SIMD vector-uint will be still 32-bit used as 2x16 instead of "true" 16-bit hardware/register units? Why not apply this design on FP64 too, ie 64-bit register only used to hold 1x64 bit, 2x32 bit or 4x16 values? (probably because they will no more sell Tesla GPUs? DX )
 
So, SIMD vector-uint will be still 32-bit used as 2x16 instead of "true" 16-bit hardware/register units? Why not apply this design on FP64 too, ie 64-bit register only used to hold 1x64 bit, 2x32 bit or 4x16 values? (probably because they will no more sell Tesla GPUs? DX )
Because 32bit is what you mostly want, you don't want to design your architecture around FP64 which you'll hardly ever need even (for consumer gpus). And there's a lot of value in having a scalar design at that level - you just take the inefficiencies this "mild" 2x simd gets you for fp16 precisely because it's still the same 32bit reg based architecture. That should get you some/most of the benefits a "pure" fp16 scalar based design would without having to invest heavily in more complex register fetch/store, requiring twice the instruction throughput etc. which would be totally wasted when not using fp16 (note though I don't actually know if Pascal implements it the same way but I'd assume so).
 
So, SIMD vector-uint will be still 32-bit used as 2x16 instead of "true" 16-bit hardware/register units? Why not apply this design on FP64 too, ie 64-bit register only used to hold 1x64 bit, 2x32 bit or 4x16 values? (probably because they will no more sell Tesla GPUs? DX )

I far as I understand, this is pretty much how Hawaii works, which is fine for a dual-purpose GPU (FirePro/Radeon) but wasteful for a purely gaming-oriented design, such as Fiji.
 
Why not apply this design on FP64 too, ie 64-bit register only used to hold 1x64 bit, 2x32 bit or 4x16 values? (probably because they will no more sell Tesla GPUs? DX )
Probably because you loose quite a bit of flexibility in terms of register usage: you can only achieve 2 ops in parallel if both halves of the double wide register are used. If not, half of the data that you fetched goes to waste. Your compiler would need to be much better at scheduling things exactly right.
 
I'm not sure if that's also what your posting implied silent_guy but I don't think current architectures can fetch an additional half-register (16 bits) to fully populate a 32-bit-register that's already half used with 16 bit data. So you'd have to re-fetch everything with would negate the power saving effect of 16 bit usage. Without any hard data I can imagine that being able to do that (fetching and populating 16 bit portions of registers independently) your data paths and control would be quite a bit more complex.
 
Status
Not open for further replies.
Back
Top