ID buffer and DR FP16

It would be interesting to do a comparison when running both modes on Tonga, Fiji, and Polaris.
 
It would be interesting to do a comparison when running both modes on Tonga, Fiji, and Polaris.
That would be the other interesting test. Getting the difference between 16-bit processing (less pressure on registers as I remember), 32-bit processing and 2x16-bit.
I know the results were benched with early code (benchmark and driver) but as it is a synthetic benchmark, we shouldn't get more out of it in reality. But it seems the new AMD architecture need's that to come closer to the 1080ti.
 
So it is ~13% faster in a non-realistic benchmark.
So it is a nice to have but not a must have.
Throw in another few percent from using the ID buffer, and perhaps the 4.2 TF GPU ends up working like 5 TF part of the same architecture. That closes the gap with the competition while being out earlier and costing less - all in all a sweet design win if so.
 
Whatever tricks devs use to produce TLOU2, Death Stranding and Spiderman visuals in 4K, I'm all up for it. :) And if code from multiplatform games can be used to boost PC gaming as well, that's a double bonus.

Vega is coming late, but at least it looks it is another solid offering from AMD tech-wize.
 
Throw in another few percent from using the ID buffer, and perhaps the 4.2 TF GPU ends up working like 5 TF part of the same architecture. That closes the gap with the competition while being out earlier and costing less - all in all a sweet design win if so.

Did you mean 6 TF?
 
4.2 ->6.0 is a 40% lift, he's definitely not suggesting a 40% lift.

Oh thanks for the clarification. So 4.2TF performing like 5TF, and therefore 6TF performing like 5.2TF. That really does "closes the gap with the competition while being out earlier and costing less - all in all a sweet design win" as Shifty Geezer stated. A 0.2TF difference is minuscule compared to a gargantuan 0.7TF difference.
 
Last edited:
Oh thanks for the clarification. So 4.2TF performing like 5TF, and therefore 6TF performing like 5.2TF. That really does "closes the gap with the competition while being out earlier and costing less - all in all a sweet design win" as Shifty Geezer stated. A 0.2TF difference is minuscule compared to a gargantuan 0.7TF difference.
What, no. He implied a lift of 4.2 to 5. Which is huge already, nearly a 20% lift there. That's just a number in the sky as well. Because honestly, that so massive such that if 2 such features that are easy to implement can consistently provide that much lift, you're taking about nearly a shift of price performance for a family of GPUs. Everyone would have it.

He did not imply that 6TF would drop to 5.2. Why would it do that ? Why less?
 
What, no. He implied a lift of 4.2 to 5. Which is huge already, nearly a 20% lift there. That's just a number in the sky as well. Because honestly, that so massive such that if 2 such features that are easy to implement can consistently provide that much lift, you're taking about nearly a shift of price performance for a family of GPUs. Everyone would have it.

He did not imply that 6TF would drop to 5.2. Why would it do that ? Why less?

Sorry shouldn't you expect the same pie in the sky deficit (definitley some degree of performance lost) doing it in software, opposing the pie in the sky boost from doing it in hardware?
 
Sorry shouldn't you expect the same pie in the sky deficit (definitley some degree of performance lost) doing it in software, opposing the pie in the sky boost from doing it in hardware?
It doesn't work quite like that.

6TF is operating at 32bit precision when we refer to FLOPS, this is the "default" Unit for calculation. Rapid packed math is fitting 2 16bit values and performing the same operation (on both), but because you fit 2, you're operating at 2x. That's a gain in performance. It won't apply to everything since a great deal of calculations require 32bit. But for things that don't, if the ALU is the bottleneck, then rapid packed math would speed things up here.

For ID buffer, we have 0 way of knowing it's performance benefit over software emulation. I've not seen the numbers. Performance improvement caused by ID buffer will differ title to title depending on how the developer would like to resolve reconstruction. So once again, unsure.
 
Sorry shouldn't you expect the same pie in the sky deficit (definitley some degree of performance lost) doing it in software, opposing the pie in the sky boost from doing it in hardware?
Why? Two products are identical. Then one adds a 20% boost. That doesn't cause the other to have a 20% penalty. I'm simply saying the double-pumped FP16 isn't massive in itself, but has the potential in cahoots with the ID buffer, another unknown quantity, to do a good job of enabling PS4Pro to punch above its weight. Which is interesting from a hardware and software engineering perspective, if a GPU can get a significant performance boost from a couple of 'free' extras.
 
It doesn't work quite like that.

6TF is operating at 32bit precision when we refer to FLOPS, this is the "default" Unit for calculation. Rapid packed math is fitting 2 16bit values and performing the same operation (on both), but because you fit 2, you're operating at 2x. That's a gain in performance. It won't apply to everything since a great deal of calculations require 32bit. But for things that don't, if the ALU is the bottleneck, then rapid packed math would speed things up here.

For ID buffer, we have 0 way of knowing it's performance benefit over software emulation. I've not seen the numbers. Performance improvement caused by ID buffer will differ title to title depending on how the developer would like to resolve reconstruction. So once again, unsure.

Thanks, that makes sense.
 
Why? Two products are identical. Then one adds a 20% boost. That doesn't cause the other to have a 20% penalty. I'm simply saying the double-pumped FP16 isn't massive in itself, but has the potential in cahoots with the ID buffer, another unknown quantity, to do a good job of enabling PS4Pro to punch above its weight. Which is interesting from a hardware and software engineering perspective, if a GPU can get a significant performance boost from a couple of 'free' extras.

Oh I was thinking the "why" in your query/scenario was the second system's performance penalty from directly emulating how the first boosts 20%. Otherwise, why bother with a hardware implementation if you can just do it close enough in software. Diminishing returns. But I get what you are saying. Thanks.
 
Last edited:
6TF is operating at 32bit precision when we refer to FLOPS, this is the "default" Unit for calculation. Rapid packed math is fitting 2 16bit values and performing the same operation (on both), but because you fit 2, you're operating at 2x. That's a gain in performance. It won't apply to everything since a great deal of calculations require 32bit. But for things that don't, if the ALU is the bottleneck, then rapid packed math would speed things up here.
I wouldn't say that "great deal of calculations require 32 bit". Some calculations require 32 bit, but 16 bit is fine for lots of graphics calculations. However if shader requires 32 bit in some places, it must add conversion instructions to both sides (convert f16->f32, calculate at full precision, convert f32->f16). These conversions add extra ALU cost. This reduces the potential ALU gains, and can even slow down the shader if 16 bit and 32 bit calculations aren't separated well enough.

But the biggest thing isn't ALU related at all. People are used to calculating GPU performance by simple FLOPS number, because manufactures scale up all GPU units (ROPs, TMUs, bandwidth, geometry units) when they add more ALU to keep the GPU balanced. Doubling only ALU units isn't going to be a big gain, even if they could do 2x rate fp32 math with no packing, conversions or other issues. Most shaders aren't 100% ALU bound, there are many other bottlenecks that limit shader performance. If you only increase ALU, these other bottlenecks will dictate the performance.

There are obviously some shaders that are heavily ALU bound. If an ALU heavy shader doesn't require fp32 precision, you likely get nice gains by using double rate fp16 in this shader. That's it. Double rate fp16 helps some shaders, but it doesn't make everything faster. Don't blindly look at the FLOPS numbers in marketing slides.
 
It may help in a console though, that the software (in theory) targets the hardware and not vice versa as is the case on PC. You could have a PC card (the old X1950 XT) with too many math resources and they'll idle. On console, in theory, nothing needs to idle.

Of course, with 3rd parties developing for X1X, X1, X1S, PS4, PS4 Pro, PC, maybe Switch, 3rd party hardware targeting is likely pretty curtailed at this point. First party could see the fruit in theory...
 
It may help in a console though, that the software (in theory) targets the hardware and not vice versa as is the case on PC. You could have a PC card (the old X1950 XT) with too many math resources and they'll idle. On console, in theory, nothing needs to idle.
Yes, console software is able to take advantage of the hardware strengths and bottlenecks better, because you can target a single hardware configuration. But keep in mind that base PS4 still exists and it doesn't have double rate fp16. If you would start preferring extremely ALU heavy code that better suits a 2x fp16 machine, your performance could suffer on an 1x fp32 machine as result. I would't expect cross platform games to take whole advantage of fp16 until all major PC IHVs support it. With Vega, the situation improves. Now both Intel and AMD support fp16 in their GPUs. If Volta brings double rate fp16 to desktop, we can expect even more fp16 support in games. This of course would also help PS4 Pro in cross platform games.
 
Yes, console software is able to take advantage of the hardware strengths and bottlenecks better, because you can target a single hardware configuration. But keep in mind that base PS4 still exists and it doesn't have double rate fp16. If you would start preferring extremely ALU heavy code that better suits a 2x fp16 machine, your performance could suffer on an 1x fp32 machine as result. I would't expect cross platform games to take whole advantage of fp16 until all major PC IHVs support it. With Vega, the situation improves. Now both Intel and AMD support fp16 in their GPUs. If Volta brings double rate fp16 to desktop, we can expect even more fp16 support in games. This of course would also help PS4 Pro in cross platform games.

It will probably have more impact on next generation console
 
Last edited:
It will proabaly have more impact on next generation console
Definitely. But obviously there are some 1st party studios who are going to make really good use out of it on PS4 Pro. Latest Frostbite presentation also showed some gains, so fp16 adaptation is definitely under way. Vega launch will help adaptation, since developers can now optimize also on their development workstations and see gains immediately.
 
On HZD Checkerboarding: http://www.eurogamer.net/amp/digita...zero-dawn-the-making-of-ps4-pros-best-4k-game

"There are different ways to do checkerboarding as well," adds Giliam, who told us that they 'rolled their own' solution as opposed to using Sony's reference model. "You can have more information per pixel, or less information per pixel when rendering checkerboarding and depending on how much information you have, you can go for different checkerboard resolve techniques. We came up with one that doesn't need a lot of extra data at the per-pixel level and that gave us some performance boosts as well in the rendering of the whole geometry and the lighting pass."

Sounds like how Sebbbi would describe using his own VT system over Tiled Resources. I assume Sony's reference model includes the usage of ID Buffer.
 
On HZD Checkerboarding: http://www.eurogamer.net/amp/digita...zero-dawn-the-making-of-ps4-pros-best-4k-game



Sounds like how Sebbbi would describe using his own VT system over Tiled Resources. I assume Sony's reference model includes the usage of ID Buffer.
Too bad they didn't ask a question about ID buffer or FP16 and how they used it in the Pro version. I am very surprising they didn't actually. Sounds like an odd omission and makes it an incomplete interview.
 
Back
Top