On FP16 and FP32 performance...

Bigus Dickus · Nov 21, 2002

Just what is the expected performance hit for enabling FP color on the R300? I know there probably aren't benchmarks available to show that yet, but I'm asking more from a theoretical point of view.

What will that affect? Do the pixel shaders take longer to perform a given operation on a FP32 texel than they do an int 32 texel? Does it take more bandwidth to move texture data? Does it take a larger footprint in memory?

What is the difference between the way the R300 handles FP color vs. the way the NV30 does it? I know the NV30 has dedicated hardware for int 32, FP16 and FP32, per the B3D interview. I've seen speculation that the NV30 would be two or three times faster than the R300 using 128 bit color (or 96, as the case might be). Is there any rational for such claims?

What governs performance with FP color? Bandwidth, shaders, otherwise...

Lots of questions, yes, but this is one area I'm not too informed in. It's well known what the impact of enabling anisotropic filtering is on different architectures, or SSAA, or MSAA, but little is known (to me at least) about the impact of enabling high precision color.

I almost want to think that if the architecture was designed to natively handle 128/96 bit internal FP precision, then there would not be a tremendous performance penalty for using it. Is this hunch correct?

Luminescent · Nov 21, 2002

I was wondering the same things. Wish sireric, aran, or some other hardware engineer could help us out. I do know that on integers, the NV30 should be quite a bit faster due to its dual series of register combiners (each series containing eight combiners). Thus it can execute 2 integers per clock. Combining hardware also seems to execute a tad faster than general purpose processing hardware.

Now about the floating-point execution times, there should be no reason why clock for clock, the R300 should lag behind the NV30 in f32 number crunching, unless the NV30 has more fpu's or can issue more instructions. In f16 of course there is the NV30's benefit of dual execution.

What would be impressive to me would be an array of 32-bit 4 scalar units for each pixel processor in the NV30, which could each be split into two 16-bit sections, and given a scalar instruction, as well as a 4-way parallel vector. This would mean that the pixel core would have to be able to issue up to 4 scalar instructions and 1 vector per clock. That would be impressive and highly configurable (assuming the scalar units are similar to the R300's).

KimB · Nov 21, 2002

Only not altogether practical since the vast majority of data used in 3D scenes is vector-based. It just takes far fewer transistors to support nothing but 4-vectors (Or, in ATI's case, 3-vectors + 1 scalar or 4-vectors).

OpenGL guy · Nov 21, 2002

Bigus Dickus said:
Just what is the expected performance hit for enabling FP color on the R300?

Do you mean 128-bit float buffers or floating point in the pixel shader?

demalion · Nov 21, 2002

Umm, I thought that it was made clear that there was no performance hit... just no performance increase from decreasing from maximum precision.

I could do a search, as I'm pretty sure we've had an answer from the horse's mouth somewhere, but so could you!

But perhaps I misunderstood their comment.

EDIT: wow, the page refresh was 4 minutes old? TV sucks up time pretty well :-?

I'm going to guess he's going to give my answer if you don't answer "128-bit float buffer"...

the suspense is killing me!

OpenGL guy · Nov 21, 2002

demalion said:
I'm going to guess he's going to give my answer if you don't answer "128-bit float buffer"...

How could 128-bit float buffers be free? Figure out how much bandwidth they require: 8 pix per cycle * 16 bytes per pix * 325 mhz = 41.6 GB/s. No one ever said using float buffers would be free... why do you think people use compressed textures when ever possible? I mean, 32-bit textures aren't free are they? If they were, there would be no reason to compress textures in the first place.

3dcgi · Nov 21, 2002

OpenGL guy said:
demalion said:

I'm going to guess he's going to give my answer if you don't answer "128-bit float buffer"...

Click to expand...

How could 128-bit float buffers be free?

I think demalion was agreeing with this statement. He meant it was free unless a 128-bit buffer was used.

DemoCoder · Nov 21, 2002

Is there any performance difference between 32-bit integer pipeline calcs and 96-bit pipeline calcs? (assuming integer 32-bit framebuffer)

That is, will the R300 run faster if you select integer mode?

demalion · Nov 21, 2002

3dcgi reads me right, and DC is asking the same question I think BD was.

God, this nv30 launch is sprouting questions that have us trying to eat into your spare time. Do you need to go have a beer and relax before answering all these demands?

If so, I'll understand atleast.

OpenGL guy · Nov 21, 2002

DemoCoder said:
Is there any performance difference between 32-bit integer pipeline calcs and 96-bit pipeline calcs? (assuming integer 32-bit framebuffer)

The floating point pipeline is full speed. DX9 requires a full floating point pixel pipeline and R300 can do this without a reduction in performance.

Humus · Nov 21, 2002

DemoCoder said:
Is there any performance difference between 32-bit integer pipeline calcs and 96-bit pipeline calcs? (assuming integer 32-bit framebuffer)

That is, will the R300 run faster if you select integer mode?

Not as far as I know. There's no performance difference between running a shader in GL_ATI_fragment_shader or an equivalent shader in GL_ARB_fragment_program. None that I could measure at least.

Bigus Dickus · Nov 21, 2002

OK, so it's becoming just a bit less vague now.

Using float buffers takes a performance hit because of the larger data size... meaning higher memory usage and more bandwidth requirement to move data from memory, correct? Or am I going way out on a limb there?

If that's somewhat correct, then the NV30 and R300 will have the same bandwidth requirements and memory usage requirements then... right?

The other part of the question then is about the pixel shader performance. If I'm reading correctly, the R300 was designed for 32 integer and 32, 64 and 96 bit floats to all run at the same speed through the shaders.

So there would be a general performance degredation due to the increased bandwidth requirements, but not due to shader performance. Am I in the right ball-park here?

Then, the NV30 would appear to have dedicated hardware for 32 bit integer, 64 and 128 bit floats. From the B3D interview, it sounds like enabling FP16 has a performance hit, because there is separate hardware for each, and the int 32 hardware naturally runs faster, being optimized for that task. I can only assume this is talking about shader performance, or is there something key I'm missing here?

It almost sounds to me like 64 and 128 bit floats on the NV30 will reduce performance, simply because 32 bit integer has been optimized on dedicated hardware, whereas on the R300 the same hardware handles all modes. If correct, then regardless of whether the NV30 can address two 64 bit floats in the same time it can one 128 bit float, it will appear that the performance (of the shaders?) reduces as the mode becomes higher precision. 32 bit integer might fly... but you lose speed from there.

Does this mean that the NV30's greatest advantage is in 32 bit integer, despite what the "two 64 bit vs. one 128 bit" stuff makes it sound like?

Damn. Perhaps it's just as vague to me as ever. :-?

Luminescent · Nov 21, 2002

With respect to its own performance, yes, the NV30 sounds like it will be slower when using floating point. However, the R300 has no dedicated register combiners for integer operations, so it has nothing to compare its f16 and f32 performance to (which is essentially the same in the pixel processor). If it had a dedicated integer unit, it would probably be faster at handling the format. If you look closely at the following diagram of the NV30 pixel pipeline: http://www.beyond3d.com/articles/nv30r300/index.php on page 9, you'll notice why the pipe of the NV30 runs two integer ops per clock at a faster rate. It has two sets of register combiners (more specialized hardware) for the pixel pipeline.

For the sake of performance speculation, in the geforce fx presentation (http://www.nvidia.com/view.asp?PAGE=geforcefx_testimonials, around minute 54), it was indicated that the gigaflop number for the NV30 (200, 51 from the pixel shader) came from 32-bit precision flops. This would mean NV30 could execute 12.75 ops per-cyle, per-pipeline in f32 and 25.4 ops per-cycle, per-pipe on f16 ops. In comparison, the R300 could execute in one of its pixel processors 9 ops (4 fmads, and a complex scalar) in f32 format.

Bigus Dickus · Nov 21, 2002

Yes, and my point (and question) is still the same. Basically, it comes down to this:

Will the NV30's shader performance drop as higher precision modes are selected, since dedicated hardware has been optimized for each mode and it stands to reason (and supported by the information directly above) that the cacluational power drops as precision is increased?

If true, then the NV30 would lose ground relative to the R300 in higher precision modes (granted, it might have quite a bit of superiority in pixel shader power from the beginning in int 32 mode, due to ops per cycle or clock speed or both).

OpenGL guy · Nov 21, 2002

Bigus Dickus said:
So there would be a general performance degredation due to the increased bandwidth requirements, but not due to shader performance. Am I in the right ball-park here?

The bandwidth requirements are only increased if you use a large format for output. In other words, if you write to a buffer with more than 32 bits per pixel or use multiple render targets, then the bandwidth requirements would increase over plain 32 bit rendering.

DemoCoder · Nov 21, 2002

No Bigus, you are reading it backwards.

The NV30 will likely run the same speed at 128-bit precision as the R300, perhaps faster. However, in lower precision modes, you get a speed boost. This is only natural. If I choose to go from double precision to single precision, or integer mode, I should go faster, not stay the same.

The R300 has no specific optimizations for lower precision modes, hence you get no benefit from going to lower precision modes. In 32-bit mode, there are functional units and transistors on the R300 just sitting there doing nothing. NVidia has implemented their pipeline so that shader operations are dispatched to a pool of execution units so that in lower precision modes, the extra units are utilized to do additional work.

What this means is that on older games that use the 32-bit integer pipeline, the NV30 is likely to be faster because it gets a speed boost from dropping back precision.

Remember what 3dfx said back in the days of 16-bit vs 32-bit ? "If your 32-bit runs the same speed as your 16-bit, then you have a poor 16-bit implementation" .

The correct way to look at this is: NV30 runs 128-bit at "full speed" (probably most 128-bit ops take 1 clock cycle). But runs 64-bit at up to double speed (2 ops per cycle) and perhaps even 32-bit at up to 4x speed (4 ops per cycle).

OpenGL guy · Nov 21, 2002

DemoCoder said:
The R300 has no specific optimizations for lower precision modes, hence you get no benefit from going to lower precision modes. In 32-bit mode, there are functional units and transistors on the R300 just sitting there doing nothing. NVidia has implemented their pipeline so that shader operations are dispatched to a pool of execution units so that in lower precision modes, the extra units are utilized to do additional work.

While the other execution units of the NV30 stay unused. You can't have everything. In other words, if you are using floating point, the integer pipelines are unused, if you are using integer, the floating point units are unused.

The correct way to look at this is: NV30 runs 128-bit at "full speed" (probably most 128-bit ops take 1 clock cycle). But runs 64-bit at up to double speed (2 ops per cycle) and perhaps even 32-bit at up to 4x speed (4 ops per cycle).

But there's only one texture unit per pixel, so what's the point? Old games use textures, not shaders.

KimB · Nov 21, 2002

DemoCoder said:
The correct way to look at this is: NV30 runs 128-bit at "full speed" (probably most 128-bit ops take 1 clock cycle). But runs 64-bit at up to double speed (2 ops per cycle) and perhaps even 32-bit at up to 4x speed (4 ops per cycle).

I don't think it's possible to run 32-bit at 4x speed, since it's using the integer pipeline. It seems logical to assume that all 32-bit integer data will be switched to floating-point format for moving it through the pipeline (since FP16 has a 10-bit mantissa, no accuracy will be lost).

The same is true with the Radeon 9700, though it seems pretty certain that all calculations are done at the full 24-bit precision, internally (This doesn't mean there will be any benefit in the output...it all depends on how the math is done). The main issue here is that ATI didn't implement complete 32-bit precision, meaning that they couldn't get the benefit of dividing that 32 bits into two 16-bit floating point pipelines (or vice versa).

Regardless, the truth is that we still don't actually know whether the NV30 will be the same speed in 16-bit ops and half the speed in 32-bit ops, or the same speed in 32-bit ops and twice the speed in 16-bit ops. Given the transistor count, the second definitely seems plausible.

KimB · Nov 21, 2002

OpenGL guy said:
While the other execution units of the NV30 stay unused. You can't have everything. In other words, if you are using floating point, the integer pipelines are unused, if you are using integer, the floating point units are unused.

I'd be highly surprised if this were the case. Since the floating-point pipeline has the precision to support the 8-bit integer format, it seems only natural that this would be used.

This doesn't mean that there would be any noticeable benefit for the final output (The only quote that I've seen related to this didn't seem completely clear on the subject. That is, the nVidia representative could just have been referring to the external framebuffer storage format).

KimB · Nov 21, 2002

Oh, and one final thing. Just because it (possibly) can calculate twice as many ops on 32-bit data per clock as the Radeon 9700, this doesn't mean that there will be any benefit for any current shader. That is, the shader would have to have quite a bit more instructions per pixel than textures per pixel. Given the limited number of instructions available for DX8 shaders, it seems only natural that this essentially never happened. Doesn't mean it's impossible...just unlikely.

On FP16 and FP32 performance...

Bigus Dickus

Luminescent

KimB

OpenGL guy

demalion

OpenGL guy

3dcgi

DemoCoder

demalion

OpenGL guy

Humus

Crazy coder

Bigus Dickus

Luminescent

Bigus Dickus

OpenGL guy

DemoCoder

OpenGL guy

KimB

KimB

KimB

Similar threads