Xenos FP10 mode clarification

Rockster

Regular
I have read conflicting statements about the usefulness of Xenos's FP10 mode. It's my understanding that in the NV3x cores, there were 2 different levels of floating point precision available within the ALU's themselves (FP16 or FP32). At times with long shaders the accumulation errors of FP16 were noticable. All ATI pixel shader ALU's operate internally at FP24, and new NV4x+ ALU always work at FP32, but intermediate results can be rounded back to FP16 to save register space.

With Xenos, I assume, all ALU's are always working in FP32 internally and that intermediate results to registers can be saved at that same level of precision. FP16 is an available input and output format, along with other non-float formats.

Is it correct to say that FP10 is simply another available output format, and is not a precision level used within the ALU's at all? That it is a coarser float format to save bandwidth between parent and daughter dies, and more importantly, to save space in eDram. And that any errors incurred through its use, would be a result of multipass techniques or alpha blends. Both of which "should" be minimal.

For HDR, an FP16 framebuffer could/would still be used for render to texture operations, but developers would just target sizes that best fit the available eDram. Final rendering passes would then be performed, and output with FP10 to reduce the tile requirement. Yes / No ?
 
Thanks, I have read most of those and that's why I'm looking for clarification. Much of that seems to imply that intermediate register storage for a running shader would be FP10, which I don't think is accurate. I would think only the final result of a completed shader would be converted to FP10.
 
Rockster said:
Thanks, I have read most of those and that's why I'm looking for clarification. Much of that seems to imply that intermediate register storage for a running shader would be FP10, which I don't think is accurate. I would think only the final result of a completed shader would be converted to FP10.

It's just a rendertarget format.
Doesn't affect internal precision at all.
 
Its for hdr only .

For example the nv40 in sm 3.0 mode runs its shaders at fp 32 while hdr is run at fp 16.

On the xenos it will be the same. fp32 for the shaders and fp 10 or fp 16 for hdr .

Now some say we will get artifacting much quicker with fp10 and well that may be true but many games will work around that to insure as little as possible and those that can't will use fp16. The benfits are less bandwdith used thus higher fps .
 
jvd said:
The benfits are less bandwdith used thus higher fps .

One of the devs could correct me on this, but I do not think the issue is bandwidth on Xenos. It is a matter of performance of the logic. Basically the performance hit of FP10 is similar to 32bit while FP16 requires substantially more performance from the ROPs, and therefore more logic transistors on the eDRAM which leads to a bigger/hotter/more expensive chip.

The other issue may be tiling. 2 tiles has a 0% performance hit. 3 tiles is a 1-5% performance hit. So the latency in tiling seems to catch up some. So using FP16 with 4x MSAA @ 720p could result in 4 or 5 tiles. Assuming best case scenario (that 2 tiles is right at the cusp so the extra 3rd tile was only a 5% hit total), 5 tiles would be 10% more hit on performance. This is assuming everything is linear. 15% is pretty big if you are are right about 30fps. That is the difference between 25 and 30 fps.

Worse case scenario could be worse.

But I am willing to bet we may see some developers give it a try, and we may see that some devs like 2xMSAA with FP16. That may be a nice tradeoff, but only time will tell.
 
Acert93, I don't think that's correct. It is a bandwidth issue, in much the same way ROPS in PC GPU's are limited to writing 2 MSAA samples per clock. With 32GB/sec between the parent and daughter die, there is sufficient bandwidth to write 8 32bit pixels per clock or 4 64bit pixels per clock. It doesn't make much sense to update the ROPS for 64bit color without an associated increase in bandwidth. Because FP10 is still 32-bit, bandwidth and fillrate remain the same as INT8. The size of the color buffer doubles with FP16, so the tile requirement goes up as stated. However, the impact as I understand it is unknown for 2 tiles and approximately 5% more for 3 tiles. I'm sure this is extremely application dependant however, and is a function of the number of objects that cross tile boundries.

My question revolved around dependant calculation errors, and if values are being stored in the FP10 format between operations, or if values maintained their higher precision while running, and FP10 format is only used as a rendertarget. The latter appears to be the case as I suspected.
 
Sounds like these discussions are mainly about Doom 3 style rendering, where the render target is read back into the pixel shaders for each lighting pass. FP16 would have an advantage over FP10 in those cases, but Doom 3 seems to work just fine with standard 32 bit colors (though it sounds like Carmack had to pull some pretty fancy tricks to make it work). I'd expect next gen games to use shadow buffers instead of stencil shadow, making each lighting pass a loop in the pixel shader instead of a full rendering pass. Everything stays FP32 until the final export to an FP10 or 16 render target.

The real question is, what is the visual difference between...
- FP32 -> FP16, post-filtered to Int8
- FP32 -> FP10, post-filtered to Int8.
Will the blooms look any different? Are low light variations lost?
 
Back
Top