NVIDIA Maxwell Speculation Thread

Alexko · Jan 11, 2015

Novum said:
Well for integer it definitely could work (int24 is faster than int32), but I don't think that any desktop GPU at the moment has hardware support for <FP32. Correct me if I'm wrong.

Tonga has hardware support for FP16, but I'm not sure it can compute more FP16 instructions than FP32, I think it's mostly about saving power. Of course, it might lead to more efficient memory and register use, less pressure on various buffers, etc., so it might have some performance benefits.

lanek · Jan 11, 2015

Is it not something who just reduce the bandwith requirements , and so permit to decrease the memory bottleneck ? performance wise, im really not sure ( outside theorical floating points performance ), that it have much impact ( maybe 1-2% ? )

Speaking about the tegraX1, i have really like the presentation based on the 1 Tflops .. before i see it was FP16 and not FP32 ..

sebbbi · Jan 11, 2015

Novum said:
Well for integer it definitely could work (int24 is faster than int32), but I don't think that any desktop GPU at the moment has hardware support for <FP32. Correct me if I'm wrong.

GeForce 7800 GTX has quite big gains from FP16 ALU. The mini-ALU (for FP16 types) makes operations such as vector normalize much faster. I don't have a computer with Windows 8 + GeForce 7800 GTX so I don't know if their DirectX 11 compiler still supports the FP16 ALU + specific ops for 7000 series. In DX9, it was a big boost.

Alexko said:
Tonga has hardware support for FP16, but I'm not sure it can compute more FP16 instructions than FP32, I think it's mostly about saving power. Of course, it might lead to more efficient memory and register use, less pressure on various buffers, etc., so it might have some performance benefits.

The manufacturers have not talked about FP16/INT16 registers. If the hardware can actually store FP16/INT16 variables as 2 bytes in the register file, then the gains from FP16/INT16 are much bigger than just saving ALU. Register pressure (high GPR count) is a big performance limiter in many algorithms. Lower GPR count allows the hardware to run more threads, improving latency hiding. Also if you can use more registers for data manipulation (as the registers are smaller), you need to use less LDS. Registers are much faster than LDS.

It would be a HUGE improvement if we actually got 16 bit registers. But I don't think any manufacturer has confirmed this yet. They have just concentrated on showing big peak FLOPS numbers (big crowds don't understand how important GPR savings are, but they do understand high peak FLOPS numbers).

lanek said:
Is it not something who just reduce the bandwith requirements , and so permit to decrease the memory bottleneck ? performance wise, im really not sure ( outside theorical floating points performance ), that it have much impact ( maybe 1-2% ? )

FP16 ALU does not save memory bandwidth. It just saves power (and is smaller than FP32 ALU, so you can have more of them in the same area). Also if the FP16 ALU is reading/writing to/from 16 bit registers, it also saves saves register space (compared to 4 byte registers needed by 32 bit ALUs). FP16 storage formats (textures & buffers) are already supported by all GPUs (these provide BW savings over FP32 storage formats). DX10 mandated full support for FP16 storage formats (including filtering and blending).

sebbbi · Jan 13, 2015

sebbbi said:
It would be a HUGE improvement if we actually got 16 bit registers. But I don't think any manufacturer has confirmed this yet. They have just concentrated on showing big peak FLOPS numbers (big crowds don't understand how important GPR savings are, but they do understand high peak FLOPS numbers).

Andrew confirmed in the Broadwell thread that fp16/int16 ALU (16 bit native hlsl data types) halve the GPR usage (in bytes). This means that native fp16 support also brings noticeable gains for shaders that are not ALU bound. High GPR count is a common performance bottleneck in complex shaders (such as lighting). Reduced GPR usage allows the hardware to run more concurrent threads, meaning that the latency hiding works better (= higher GPU utilization = higher performance). This also means that the high end GPUs should gain more from fp16 (and int16) than initially expected.

mczak · Jan 13, 2015

sebbbi said:
The manufacturers have not talked about FP16/INT16 registers. If the hardware can actually store FP16/INT16 variables as 2 bytes in the register file, then the gains from FP16/INT16 are much bigger than just saving ALU. Register pressure (high GPR count) is a big performance limiter in many algorithms. Lower GPR count allows the hardware to run more threads, improving latency hiding.

Well nvidia says it's 2xfp16 simd-style. From that automatically follows that the registers are all the same, they just hold 2xfp16 values instead of 1xfp32. This should of course help with GPR usage just the same, as long as you can actually stuff your float operations together to benefit from simd...
(intel might be different, as they have quite flexible operand addressing, they use a quite different execution model in any case.)

Novum · Jan 13, 2015

sebbbi said:
GeForce 7800 GTX has quite big gains from FP16 ALU. The mini-ALU (for FP16 types) makes operations such as vector normalize much faster. I don't have a computer with Windows 8 + GeForce 7800 GTX so I don't know if their DirectX 11 compiler still supports the FP16 ALU + specific ops for 7000 series. In DX9, it was a big boost.

I meant modern GPUs obviously

sebbbi said:
Andrew confirmed in the Broadwell thread that fp16/int16 ALU (16 bit native hlsl data types) halve the GPR usage (in bytes). This means that native fp16 support also brings noticeable gains for shaders that are not ALU bound. High GPR count is a common performance bottleneck in complex shaders (such as lighting). Reduced GPR usage allows the hardware to run more concurrent threads, meaning that the latency hiding works better (= higher GPU utilization = higher performance). This also means that the high end GPUs should gain more from fp16 (and int16) than initially expected.

Interesting. I definitely would like to see that. We had that problem in our tile based lighting compute shader and FP16 would have likely be more than enough precision for a lot of the calculations there.

sebbbi · Jan 13, 2015

mczak said:
Well nvidia says it's 2xfp16 simd-style. From that automatically follows that the registers are all the same, they just hold 2xfp16 values instead of 1xfp32. This should of course help with GPR usage just the same, as long as you can actually stuff your float operations together to benefit from simd...

Yes. This would be the most straightforward solution. It wouldn't need any extra hardware support beyond a few additional fp16+fp16 instructions (that reinterpret one 32 bit register as two 16 bit floats). Compiler support however would require more work as you'd have to pair the 16 bit variables in a way that maximizes the chances of two for ones. This is similar to optimizing for 4d vector architectures (except that 2d vectors are easier to fill). Nvidia hasn't had non-scalar GPU architectures for long time, so I don't expect their compiler to perfect in this for some time (AMD on the other hand most likely still maintains their VLIW compilers). The simplest solution would be to lean on the programmer (store half2 in a single register, and half3 and half3 in two). But autovectorization would result in slightly better utilization (and less manual programmer work).

sebbbi · Jan 13, 2015

Novum said:
Interesting. I definitely would like to see that. We had that problem in our tile based lighting compute shader and FP16 would have likely be more than enough precision for a lot of the calculations there.

Same here. Fp16 would be enough for all the 3 channel (rgb) calculations, while the fp32 lighting math (including specular and fresnel) is basically 1d (less instructions on scalar GPUs).

Jawed · Jan 13, 2015

Novum said:
Interesting. I definitely would like to see that. We had that problem in our tile based lighting compute shader and FP16 would have likely be more than enough precision for a lot of the calculations there.

Can you prove that fp16 would be enough? You can test this with a version of the shader where you restrict the precision on those operands that you think fp16 would be enough.

Also, in my experience with compute on AMD, register allocation is often excessive. I wouldn't be surprised if the real problem was the compiler, not the complexity of your compute shader.

Jawed · Jan 13, 2015

sebbbi said:
Nvidia hasn't had non-scalar GPU architectures for long time, so I don't expect their compiler to perfect in this for some time

Kepler is multi-issue, statically compiled.

lanek · Jan 13, 2015

Possible GTX960 specs and performance leak:
http://www.guru3d.com/news-story/geforce-gtx-960-performance-and-specs-leak.html

original source : PCeva via WCCF
http://bbs.pceva.com.cn/thread-110570-1-1.html

http://wccftech.com/nvidia-geforce-...ealed-performs-slightly-faster-radeon-r9-280/

iMacmatician · Jan 13, 2015

So essentially half a 980, which is basically what I was expecting, but I also thought the TDP would be at least 10 W lower. It's a custom card according to WCCFTech so maybe a stock card would have a lower TDP?

sebbbi · Jan 13, 2015

Jawed said:
Can you prove that fp16 would be enough? You can test this with a version of the shader where you restrict the precision on those operands that you think fp16 would be enough.

You only need fp32 for the scalar (grayscale) lighting math itself. Once you know the diffuse and the specular contribution, you can convert the result to fp16, expand to 3d (rgb) and multiply with light color and material specular and diffuse colors. Diffuse and specular accumulation variables (2x rgb) can be fp16 (*). Adding a small amount of lights together and multiplying the result single time by material and light color isn't going to be a problem for fp16.

(*) this certainly reduces the shader GPR count since light accumulation variables have a long scope that lasts for all the light loops (alive when the shader GPR hits the peak).

Jawed said:
Also, in my experience with compute on AMD, register allocation is often excessive. I wouldn't be surprised if the real problem was the compiler, not the complexity of your compute shader.

There are several useful tricks to keep GPR (and SGPR) usage low. But even in tightly optimized code, the GPR count is sometimes a real issue that is hard to avoid. GPUs unfortunately aren't as flexible as CPUs when it comes to handling register pressure (register renaming gives dynamic amount of extra registers based on actual branches taken, spilling to stack -> store forwarding + L1 caching the stack). Static GPR allocation of the GPUs is especially bad when the shader has complex branches that are rarely taken (wost case compile time GPR count only matters). This problem will be reduced when we have robust GPU side kernel spawning from other kernels (CUDA and OpenCL 2.0 already have simple support for this, hopefully DirectCompute support will follow soon).

Jawed · Jan 13, 2015

sebbbi said:
You only need fp32 for the scalar (grayscale) lighting math itself. Once you know the diffuse and the specular contribution, you can convert the result to fp16, expand to 3d (rgb) and multiply with light color and material specular and diffuse colors. Diffuse and specular accumulation variables (2x rgb) can be fp16 (*). Adding a small amount of lights together and multiplying the result single time by material and light color isn't going to be a problem for fp16.

You could pack the 6xfp32 accumulators into two INT32s (6x 10-bit) and unpack just-in-time to accumulate. It sounds as if this code is memory-latency bound, so compute is free, therefore pack/unpack is free - though obviously there comes a limit.

OK, so it's impossible to talk about the nitty gritty without actual code at hand.

There are several useful tricks to keep GPR (and SGPR) usage low. But even in tightly optimized code, the GPR count is sometimes a real issue that is hard to avoid.

I agree. And if the compiler is being stupid with register allocation, you're basically stuck.

To add to the list of techniques:

use shared memory as a scratchpad, without doing any sharing across work items/pixels
if GPRs are being over-allocated due to the code's inherent parallelism, use nested if statements that all pass as true. Use a dummy INT32 parameter for the function and pass in -1, then make each if do a different test, e.g. if (dummy & 1) nesting if (dummy & 2) nesting if (dummy & 4) etc. It's completely bonkers but you can cut GPR allocation massively because you prevent the compiler from re-ordering your math across the ifs. GCN is crazy fast at coherent branching (though there is a limit, which you can only discern experimentally)
change loop counters to float if they're int, or vice-versa, if the value of the loop counter is manipulated/used within the loop - can free a register

I'm curious, have you performed a manual analysis of the GPR count and compared it with the compiler? Sure, there are hardware-specific register overheads (e.g. texture addressing related GPRs) but it's still illuminating.

GPUs unfortunately aren't as flexible as CPUs when it comes to handling register pressure (register renaming gives dynamic amount of extra registers based on actual branches taken, spilling to stack -> store forwarding + L1 caching the stack). Static GPR allocation of the GPUs is especially bad when the shader has complex branches that are rarely taken (wost case compile time GPR count only matters). This problem will be reduced when we have robust GPU side kernel spawning from other kernels (CUDA and OpenCL 2.0 already have simple support for this, hopefully DirectCompute support will follow soon).

I can't see much use of those unless there are proper on-chip producer/consumer queues. OpenCL 2.0 queues are rubbish as they're off-chip

sebbbi · Jan 14, 2015

Jawed said:
You could pack the 6xfp32 accumulators into two INT32s (6x 10-bit) and unpack just-in-time to accumulate. It sounds as if this code is memory-latency bound, so compute is free, therefore pack/unpack is free - though obviously there comes a limit.

f32to16 + f16to32 could be of course used inside the shader to store packed data manually in integer registers. You'd pay a few extra instructions per light source to convert the accumulators back to fp32, do fp32 math and then convert back to fp16. This would give you some GPR gains but would also increase the ALU count (instead of decrease it). On modern GPUs you seldom are purely register or ALU (or BW) bound. This kind of trick might be benefical in this particular case (light accumulation variables) but if you need the fp16 varible more often, the back and forth conversion costs will become a problem (without native fp16 ALUs you need to convert a lot more).

We have used this kind of tricks to conserve LDS space, but not to conserve registers. Packing two 16 bit integers together is also quite nice on GCN, since it has (full rate) combined mask+shift instructions.

Jawed said:
I agree. And if the compiler is being stupid with register allocation, you're basically stuck.

X360 shader compiler had a [isolate] tag to force it to compile blocks in vacuum (forcing register life time inside the block). You could also use this on variables to force the compiler to recalculate it (or reload from constant memory / L1). I kind of miss this. However the GCN microcode is very nice and clean, so I could see myself hand writing some performance critical shaders if the compiler doesn't cooperate. In the old DX9 (SM 2.0) era, I wrote all the shaders by hand in DX assembly. As the shaders were limited to 64 instruction slots this was nesessary to keep the slot count down to fit your lighting shaders to the GPU (we were among the first developers to ship a game with deferred rendering).

Jawed said:
To add to the list of techniques:

Loops in general are a good way to force the compiler to keep the variables inside a region. Barriers can be also used to prevent moving data loads (= GPR allocations) over them.

Jawed said:
I'm curious, have you performed a manual analysis of the GPR count and compared it with the compiler? Sure, there are hardware-specific register overheads (e.g. texture addressing related GPRs) but it's still illuminating.

We have good tools and use them, so it's easy to see the GPR count per line and peaks. I unfortunately can't go into console specific details in a public forum. Texture descriptors (SGPRs) have never been a big problem for us since we have been virtual texturing for long time (all texture layers are read from a single texture array using a single sampler).

Jawed said:
I can't see much use of those unless there are proper on-chip producer/consumer queues. OpenCL 2.0 queues are rubbish as they're off-chip

It's a good start towards the right direction. On-chip communication between different shaders cooperating with each other in a fine grained manner would be the final goal. Hopefully someday we get a flexible system that allows us to define our own stader stages and communication between them (in a way that allows the GPU to load balance well between the compute units).

I didn't like the DX11 tessellation design, since it added two additional predefined shader stages (PS+VS+GS+CS was already quite many). This clearly pointed out the need for user defined stages and communication. I would have preferred to get something more flexible instead of lots of API bulk designed around a single feature. If predefined stages will not go away, we will soon have 10+ different stages with slightly different semantics (and a HUGE bloated API).

liolio · Jan 14, 2015

Nvidia really got its shit right, the rumored performances are impressive for a chip that should be ~200mm²

Jawed · Jan 14, 2015

sebbbi said:
f32to16 + f16to32 could be of course used inside the shader to store packed data manually in integer registers. You'd pay a few extra instructions per light source to convert the accumulators back to fp32, do fp32 math and then convert back to fp16. This would give you some GPR gains but would also increase the ALU count (instead of decrease it). On modern GPUs you seldom are purely register or ALU (or BW) bound. This kind of trick might be benefical in this particular case (light accumulation variables) but if you need the fp16 varible more often, the back and forth conversion costs will become a problem (without native fp16 ALUs you need to convert a lot more).

We have used this kind of tricks to conserve LDS space, but not to conserve registers. Packing two 16 bit integers together is also quite nice on GCN, since it has (full rate) combined mask+shift instructions.

In this case I'm thinking of packing three 10-bit integers to save four overall. But anyway, if your register allocation is at 110, then saving four registers isn't going to magically get you down to 84, which is what you need to get a whole extra hardware thread on the ALU.

X360 shader compiler had a [isolate] tag to force it to compile blocks in vacuum (forcing register life time inside the block). You could also use this on variables to force the compiler to recalculate it (or reload from constant memory / L1). I kind of miss this.

Oh man, I would love that.

However the GCN microcode is very nice and clean, so I could see myself hand writing some performance critical shaders if the compiler doesn't cooperate. In the old DX9 (SM 2.0) era, I wrote all the shaders by hand in DX assembly. As the shaders were limited to 64 instruction slots this was nesessary to keep the slot count down to fit your lighting shaders to the GPU (we were among the first developers to ship a game with deferred rendering).

I have not so fond memories of debugging complex memory addressing in IL by writing stuff to a buffer instead of the kernel's output. Assembly, on PC, is still subject to the whims of the driver, so the only real solution seems to be writing/patching the ELF format. Are there whims to deal with on console if you write assembly?

Loops in general are a good way to force the compiler to keep the variables inside a region.

Yes, sometimes it even makes sense to have two or more loops in sequence, doing exactly the same work (but on different data, e.g. three loops one for each of red, green and blue).

Barriers can be also used to prevent moving data loads (= GPR allocations) over them.

Those or mem_fence hardly work on OpenCL (nested-ifs are much better).

Other techniques I forgot to mention last night:

reduction - if loops are fixed in iteration count and can be a power of two, then use multiple work items to each compute a distinct loop iteration, and reduce at the end
read-ahead - at the cost of some GPRs (e.g. if you have 100+ and have no chance of getting down to 84) then read data at the start of the loop that will be computed on the following iteration. Works very nicely as long as there's substantial stuff in the loop

It's a good start towards the right direction. On-chip communication between different shaders cooperating with each other in a fine grained manner would be the final goal. Hopefully someday we get a flexible system that allows us to define our own stader stages and communication between them (in a way that allows the GPU to load balance well between the compute units).

To catch up with what could have been possible with Larrabee years ago

I didn't like the DX11 tessellation design, since it added two additional predefined shader stages (PS+VS+GS+CS was already quite many). This clearly pointed out the need for user defined stages and communication. I would have preferred to get something more flexible instead of lots of API bulk designed around a single feature. If predefined stages will not go away, we will soon have 10+ different stages with slightly different semantics (and a HUGE bloated API).

I can't remember seeing an alternative tessellation API that would have been demonstrably as capable and cleaner. Anyway, I'm not sure if there's likely to be yet more pipeline stages.

There could be alternative pipeline styles (e.g. "ray trace").

What we need is to be able to lock L2 cache lines (or call them last-level cache lines if you want to make it generic) for on-chip pipe buffering. GPUs have loads of L2. NVidia is effectively doing so as part of load-balancing tessellation.

Novum · Jan 15, 2015

Jawed said:
Can you prove that fp16 would be enough? You can test this with a version of the shader where you restrict the precision on those operands that you think fp16 would be enough.

Also, in my experience with compute on AMD, register allocation is often excessive. I wouldn't be surprised if the real problem was the compiler, not the complexity of your compute shader.

FP32 was "enough" in the end, but Occupancy isn't that great.

Yes the compiler definitely is also a problem.

Alatar · Jan 15, 2015

Slides, specs and launch date:

http://videocardz.com/54329/nvidia-geforce-gtx-960-confirmed-specifications-and-launch-date

Also apparently NV has sold a million 970s and 980s by now. Quite impressive.

psolord · Jan 15, 2015

Is it confirmed that there will be another cut down GM204 video card (aka 960Ti) or Nvidia will only release this GM206 one and call it a day?

NVIDIA Maxwell Speculation Thread

Alexko

lanek

sebbbi

sebbbi

mczak

Novum

sebbbi

sebbbi

Jawed

Jawed

lanek

iMacmatician

sebbbi

Jawed

sebbbi

liolio

Aquoiboniste

Jawed

Novum

Alatar

psolord

Similar threads