FP16 and market support

Simon F said:
As I said before, the NV chip seems to suffer performance drops with increasing register usage which would seem to indicate a lack of register space and/or some strange limitations on access.
The reason for the register usage performance hit is explained here:

http://www.3dcenter.de/artikel/cinefx/index5_e.php

While analyzing the Gatekeeper function we noticed that the number of quads in the pipeline depends straight on the number of temp registers. The less temp registers are used, the more quads fit into memory.

The recommendation form nVidia aims at having as many quads as possible in the pipeline. Why is this so important? We found three central reasons:

* Before a quad can take another pass through the entire pipeline, it is neccessary to send an empty quad down the pipe for technical reasons. This is of course detrimental to the usable performance. But this influence is smaller the less empty quads are necessary. And that can be achieved by increasing the number of quads in the pipeline.
* Because of the length of the pipeline and the latencies of sampling textures it is possible that the pipeline is full before the first quad reaches its end. In this case the Gatekeeper has to wait as long as is takes the quad to reach the end. Every clock cycle that passes means wasted performance then. An increased number of quads in the pipeline lowers the risk of such pipeline stalls.
* The textures to read from can change in every pass through the pipeline. Because few quads result in few texture samples read in a row, the cache hit rate decreases. More memory bandwidth is required.
 
Simon F said:
DemoCoder said:
But if 50% of the average workload could be handled fine by FP16, wouldn't the optimal architecture be one that has HW FP16 support for 50% of those workloads that needed it, and HW FP32/FP24 for the other 50%? In that way, transistors saved by implementing FP16 (instead of FP24) for some part of the pipeline can be applied to either implementing FP32 or more FP24 units for the rest.

My personal conjecture (and I stress that it is only educated speculation) is that there were no transistors saved in the NV FP16 implementation. My guess is that it's a FP32 pipeline with additional units to expand 16->32 at the front end and to compact 32->16 at the back end. Putting in separate 32 and 16bit pipelines would be silly. Each 32bit register could thus store twice as many floating values if used in a 16bit mode.

As I said before, the NV chip seems to suffer performance drops with increasing register usage which would seem to indicate a lack of register space and/or some strange limitations on access. Having the 16bit storage would thus give a big boost in performance by halving the number of physical registers required.

Thats basically what I said ages back in this thread.

Just because FP16 does not show a huge performance increase over FP32 in NV3x does not mean this will always continue to be the case, (or that the lack of performance difference was intentional). We all know NV3x has had its problems. nVidia chose not to try to fix those problems in hardware but to focus on getting future chips right )see the merryl-lynch analysis of the NV30 launch and Huangs comments on the low-k process and NV3x).
 
This thread is kinda sad. :cry:

Anybody who is truly interested in this topic should read the two threads Reverend was linking to as they contain tons of information.

If you are to lazy to read all of them, atleast take a look at sireric's post as a perfect explanation for the decision to go FP24 on ATI's (and probably MS) behalf, and Dave H's post, which is IMO a very good summary of this topic.

Merry Christmas everybody. :)
 
My opinion:

1) ATi did a better job performance wise.

2) nVidia didn't do quite a good job performance wise. It's architecture seems limited by other things than directly FP32 and FP16 support.

3) About FP16, FP24 and FP32. If I look at it from HLSL viewpoint, I find it very frustrating that when I choose a float (FP32) that I get FP24 on an ATi card, but a real FP32 on an nVidia card. When a choose a half (FP16) than I get FP24 on an ATi card and FP16 on an nVidia card. If you ask me, this makes things more frustrating, because you aren't guaranteed to get FP32 when you as for it.
 
After looking at sireric's post, I have something to comment on:
the multipliers and adders would be nearly twice as big as well
Adders increase linearly with bit size, so a FP32 adder won't be twice as big as a FP24 adder. The multiplier would be about twice the size. For this to be an important consideration, of course, you would need to know exactly how big the adders/multipliers are in proportion to the rest of the shader, and how big the shader is in proportion to the rest of the architecture.

Anyway, I'm still not convinced that there's enough subtexel precision in FP24 for proper texture addressing. Unfortunately, there are no available applications that would adequately test this, so I cannot be sure. It would be really nice if there was...
 
Chalnoth said:
Anyway, I'm still not convinced that there's enough subtexel precision in FP24 for proper texture addressing.

Unfortunately, there are no available applications that would adequately test this, so I cannot be sure. It would be really nice if there was...

If there are NO applications that can show 24bit precision to not be enough wouldn't that indicate that in the current generation it actually most like is enough?......
 
OpenGL guy said:
If I read what you're saying correctly, you're saying that ATI should have done FP16, then added support for FP24. If you have 8 FP16 units that can also do FP24, then you've gained nothing. If you mean that ATI should have done FP16 then added extra FP24 units, how is this a better use of resources? You're talking about using more transitors than was already used. Even if it gains you an extra execution unit (you can do two FP16 ops or one FP24 op), it still means a larger chip which means higher cost, lower yields, etc.

Let's say 50% of the instructions executed on your architecture require only FP16 to have no discernable difference in output. The rest require atleast FP32 (texture address ops), or say, FP24.

Let's say a FP16 unit requires N transistors, and an FP24 unit requires 2.25N (N^2) transistors and FP32 requires 4N, just for sake of argument.

Now let's say (for sake of simplicity) that the ATI chip has 16 FP24 units, so cost(8*FP24) = 36N. However, these units only need FP24 on about half the workload. A hypothetical 8*FP16 + 8*FP24 chip would yield a cost of 26N + C (some unknown constant for multi-precision dispatch). In other words, the chip with 26N+C transistors is smaller, and cheaper than the 16*FP24 chip, but it produces nearly identical output and performance.

"Using transistors where they are needed most"

A 8*FP16 + 8*FP32 chip would cost 40N, or only slightly more than the 16*FP24 chip.

Isn't ATI is already running a multi-precision pipeline with FP32 for texture ops, and FP24 for color ops, so the extra complexity of dispatching differring precision ops is already there.

Any "tuned" chip should have functional units tuned to execute the expected workloads. If people are doing boatloads of POWS, RSQS, and SINCOSes for example, then you should load up on those at the expense of other things. It all depends what the workload is. Everything is a tradeoff in design. You can go for a uber-generic design, and your performance on random tasks might be better, but your performance on the average game might not be.

Anyway, I think ATI has a bigger problem than just this repetitive precision argument, which is how to take their architecture to VS/PS3.0 and make it run fast. The corner cutting and tradeoffs for 3.0 are much more complex.
 
Chalnoth said:
After looking at sireric's post, I have something to comment on:
the multipliers and adders would be nearly twice as big as well
Adders increase linearly with bit size
Well, not necessarily. If you go for that option then the execution time would also go up by a linear amount. That could mean that you'd require extra latency in order to make the circuit operate with the given clock frequency. Alternatively, you could use a more gate-hungry algorithm which would execute in fewer clock cycles. <shrug>

BTW: I'm not entirely sure I agree with some of the conclusions drawn by 3D Centre. (I'm not sure I even understand what they are trying to say in Point2!). I can certainly see the texture fetch latency argument (it was my thinking as well) but I'm wondering if that's all there is to it. Has anyone tried seeing the effect of using more registers with textures that are guaranteed to be in cache and thus unlikely to cause stalls?
 
I haven't kept up with the fine details of chip logic for a long time, but surely a multiply unit can't be that complex to produce?

The Motorola 68020 was the first chip I can remember to use a barrel shift register for multiplication, which I think should work with modifications to take the exponent into acocunt for floating point, but then I've never really delved deeply into FP stuff.
 
Just trying to think logically lets say we have a 16 bit manteas ( spelling is wrong its late xmas eve I can't be bothered to look it up ).

(DAMN LAYOUT EVEN WITH DOTS LOOKS DIFFERENT IN EDIT )

00100010 01101000
01234567 89abcdef
1.............X

0010 0010 0110 1000
0123 4567 89ab cdef
1.......X........1......X

00 10 00 10 01 10 10 00
01 23 45 67 89 ab cd ef
0...1...0...1...1...1...1..0

0010001001101000
0123456789abcdef
0..1..0.1..0..1..1.0 ( supposed to go under every even number every second number does not need to be evaluated)

Okay this is upside but you do XORs and store the value as such ( well you would delay the signal some how but meh ). The Xs represent that don't have to be evalulated.

So if you ignore the saving of the Xs then you would need 2N XORs.

Okay now you start from my diagramitic top. If on the first layer the evaulated bit is one then we shift N/2 bits. Then we go to the second layer based on wether or not the first evaulated bit was one we pick which half and it goes on. So the number of shifters will be log base 2 N.

There will be 2N ish gates used on the crossbars too.

Okay now its not the smallest amount but using the above method I can't see how normalisation is such a big problem because it only seems to be a order N in complex for transistors. Maybe I'm missing something I am tired after all.
 
Simon F said:
My personal conjecture (and I stress that it is only educated speculation) is that there were no transistors saved in the NV FP16 implementation.
My conjecture would match this conjecture.
 
radar1200gs said:
I haven't kept up with the fine details of chip logic for a long time, but surely a multiply unit can't be that complex to produce?
As a mathemagician of my acquaintance is fond of saying, "It Depends."

You can make very tiny multipliers if you don't mind taking many cycles to do it. You can make huge, complicated multipliers if you need things to go fast, you can pipeline them for high throughput, etc.

So there are lots of choices - that's where it gets complicated.
 
Simon F said:
BTW: I'm not entirely sure I agree with some of the conclusions drawn by 3D Centre. (I'm not sure I even understand what they are trying to say in Point2!). I can certainly see the texture fetch latency argument (it was my thinking as well) but I'm wondering if that's all there is to it. Has anyone tried seeing the effect of using more registers with textures that are guaranteed to be in cache and thus unlikely to cause stalls?

The problem with the texture fetch is not the latency. The problem is that the cache hit rate is going down. But this is not the main reason for the bad performance. The main reason is point 2.

At first I have to Excuse for the bad comprehensibility of the whole text. It was not the translation because the original is not better in this point.

point 2 means what you have say in another post. The registerfile of the NV3X chips is to small to store many temp values for a pixel an let the pixelprocessor run with full speed at they same time. We know that the NV35 have room for 256*2 quad FP32 Vector4 values. We also know that there are more than 200 stages in the whole pixelprocessor. Each pixelquad can only on one stage at the same time. Each pixelquad need a portion of the register file. If you use only 2 FP32 temps for each pixel everything is fine. If you use 4 FP32 temps for a pixel the problem starts. The register file have only room for 128 pixelquads. After the gatekeper have insert 128 pixelquads in the pipeline he have to stop with this. At this point the first pixelquad have not reach the end of the pipeline (> 200 Stages). But the show must go on and the gatekepper starts to insert empty pixelquads in the pipeline. Each of this empty quads is a waste of performance. The more temps you are use for a pixel the more empty quads are in the pipeline. The speed is getting more badly with every two more fp32 temps.

To defuse this problem nVidia add a option to store in one fp32 register two fp16 values. This is (with one exception) the whole fp16 support in the NV3X chips.

The exception ist the calculation of 1/sqrt(x). The shadercore is only able to calculate this with fp16 in one cycle. If you need it with full fp32 you have to spend 2 cycles.
 
radar1200gs said:
...
The same thing goes for game developers. In the real world, most DX9 class chips sold to consumers will support FP16 and developers will ignore it and the potential performance increases at their peril.

You are looking at the issue exactly backwards. Why would any hardware manufacturer want to do fp16 at all, if he can do fp24 which runs as fast, if not faster, than fp16 runs on the products which support fp16? This is but underscored by the fact that fp24 is the API target, not fp16.

A manufacturer wouldn't, in my view, simply because it isn't needed for anything.

So why does nVidia need fp16?

Answer: Because the nV3x architecture doesn't support any rendering precision *above* fp16 which is competitive in terms of performance.

If your assumption is that fp16 is "always faster" than a higher rendering precision, such as fp24, that assumption is incorrect, as the actual performance of a gpu architecture is not solely determined by the precision of the fp pipeline, but is determined by all of the other factors relative to gpu design which lie outside of fp rendering precision. And that's why R3x0 does fp24 as fast, and faster, than nV3x does fp16. Estimating gpu performance strictly by the rendering precision of the fp pipeline is a mistake.

Hence, it will be "he who markets fp16-dependent products" who will suffer, as opposed to developers who write software which displays optimally at fp24. Relative to the 3d hardware market, nothing has changed from 18 months ago, when people were buying nVidia 3d cards because they provided the best performance. All that's changed in the last 16 months or so is that the same people who were buying nVidia then, are now buying ATi--for the same reasons. The market will tend to center around better standards when they can be brought to market in a practical and economic sense. So, just as 16-bit integer usurped and replaced 8-bit integer, and 24/32-bit integer replaced 16-bit integer as a market standard in both software and hardware, it should be no surprise that a speedy implementation of fp24 is preferred above a speedy implementation of fp16.
 
Reverend said:
I think it would be wrong to attach figures/percentages to the differences. It is difficult to provide "percentage of difference" in the first place. I mean, how many percent is 32-bit better than 16-bit color?

It's far better to just list a smattering of examples where the differences between the two formats exists, and how severe or negligible the differences are. At the risk of sounding like a broken record, here and here are threads with lively and informative discussions about the differences.
I understand what you are saying Reverend. What I was looking for is just some context to the discussion wrt to precision. I recall in a previous thread you gave me a link showing some precision issues with TROAD. If that was the type of trade-off that happens then fine. I would agree that the transistors saved or utilized elsewhere was a good idea. For sake of argument lets say that Fp64 is the ultimate precision with which there would be no reason to ever exceed. Would Fp32 bring you 99.5 % the way there? Would Fp24 bring you 97% the way there?
Dio said:
FP16 is pretty much useless for anything but colour calculations.
FP24 is fine for texture calculations but starts to lose accuracy a touch on long chains of operations.
FP32 is as FP24, only pushes the breakdown point out somewhat further (about six times as many instructions).
FP64 pushes out the breakdown point so far that problems are very unlikely.
Dio, thankfully, has provided the context I was looking for. If you or anyone else cares to dispute his thinking I am all ears.
 
this thread appears to have become chalnoth vs the rest of the world.

FP24 is what the standard is right or wrong in his eyes but I am quite sure that based upon the input of all companies involved MS made the best decision regarding FP percision VS performance.


Ati is a lot smarter than you are and I am quite sure that all of their engineers are smarter than the ones over at nVidia based upon the current generation's performance and adherance to the DX9 spec....
 
Demirug said:
We also know that there are more than 200 stages in the whole pixelprocessor.

I think we shouldn't call these stages.
This is execution unit latency.
When you talk about quads going trough the pipeline, we should think of these as threads.

So think about it this way (rephrasing what you said):
FX creates a thread for every quad.
The number of threds possible to run depends on the peek(!) register useage.
FX has very high latency so it has to run a lot of threads to aviod being idle.

Now we know that nVidia produced FP32 vector processors before (vertex shaders) and they got a latency of 6 clocks only.
So the only cause of high latency should be texturing.

But I somewhat feel that 200 is unrealisticly high. Also the latency shouldn't be fixed but depend on whether the cache is hit or not.
So you should have a lower latency when you have guarantied cache hits, thus more registers should be useable in that case.
 
YeuEmMaiMai said:
...and I am quite sure that all of their engineers are smarter than the ones over at nVidia based upon the current generation's performance and adherance to the DX9 spec...
I hope you realize that is a very serious insult for any Nvidia employee. It's not because they made a tiny misjudgement and had a bit less luck that they're all morons. After all, Nvidia cards are still high quality and beat the corresponding ATI cards at several other points than ps 2.0 performance. I'd like to see you make the design decisions for the next generation graphics cards and we'll see if they work out well on all aspects...
 
Sure we can give it different names but it does not change the behavior.

Yes, most of the stages (nVidia call it slots) are used for the texture latency (nv: > 176). But even if you do not use the texture unit you have to go to all this stages because the texture unit bypass have the same size. This is necessarily because the order of all quads can not change. There is no real threadscheduler.

You can test the behavior. Write some pixelshader with the same size but different numbers of temps and no textures and you will see that it slow down with the number of temps you use.
 
Simon F said:
Chalnoth said:
After looking at sireric's post, I have something to comment on:
the multipliers and adders would be nearly twice as big as well
Adders increase linearly with bit size
Well, not necessarily. If you go for that option then the execution time would also go up by a linear amount.
I have seen several adder designs (most well-known is probably the Brent-Kung adder) that achieve both linear size and logarithmic execution time. These adders are usually slower than the fastest known adders (Sklansky, Kogge-Stone) by a constant factor of about 1.5 to 2. (For an overview of the mentioned adders, look here)

Of course, for FP addition there are a lot of hairy circuits to normalize inputs before addition and renormalize/round results after addition; the fastest designs still have logarithmic execution time and size proportional to n*log n (altough for FP addition in particular, there is a large number of size/speed tradeoffs that can be made)
 
Back
Top