A quick question on the R300?

Fuz

Regular
The R300 has 8 Rendering pipelines and 1 Texture unit per pipeline.
My question is; if everything remained the same, but the texture units per pipe were increased to 2, what sort of advantage would that have? Where/what situation would we see the biggest increase in performance, and is there any situation where this would have no gain?

I have a basic idea of the advantages, but it would be good to hear from the more knowledgeable.

Any feed back would be appreciated. Thanks

Fuz
 
The obviuous answer is:

You could see an advantage where an application is
a.) using multi-texturing
b.) is fillrate limited currently (not bandwidth limited)

I haven't done fillrate testing on the 9700 yet so I cannot tell where those points are, but I suspect that it can easily happen when DXTC textures are used.

On the other hand if using 32bit uncompressed textures, it might get bandwidth limited so it might not be an advantage.

I think UT2003 uses texture compression (thats why the GF4 has such a large hit with aniso), and it uses many texture layers so it's very likely that it's score would improve.

Also because MSAA has no fillrate hit but require higher bandwidth it will benefit less. (The difference between AA on/off would be larger.)

This of course assuming they don't switch to DDRII in which case having two TMU's will likely be an improvement across the board.
 
Thanks Hyp-X.

So, if you are fillrate limited, then adding the second texture unit would help, but because at the moment in most situations we are bandwidth limited, the second texture unit won't be missed so much.

Also because MSAA has no fillrate hit but require higher bandwidth it will benefit less.

Adding the seconde texture unit will have no effect on the AA performance then.

This of course assuming they don't switch to DDRII in which case having two TMU's will likely be an improvement across the board.

I wonder how much of an overall improve this would make, because if the rumours are to be believed then we can take a rough guess at the performance of upcoming hardware.

Fuz
 
Also, from my understanding, a second TMU is useful primarily in DX7esque coding that uses standard look-up texturing. As pixel shading and procedural texturing become more common in the future (the distant future, it seems), then the computational power (ops per cycle) of the pipeline will be more important than number of TMU's per pipeline.

I suppose it is possible, at some point, that hardware might revert back to a shared TMU between multiple pipelines.

How was that for someone talking out their ass? :)
 
I didn't say it is bandwidth limited.

One can't even say that - it may be different from game to game.
Even more likely some part of the game-scene is fillrate limited while others might be bandwidth limited.

The biggest bandwidth hog in normal rendering is:
reading the Z-buffer, writing the Z-buffer, writing the color buffer.

Say you have a R9700 non-pro (300/300) just for the ease of calculation.
I assume Z is 8 bit/pixel avg. (because of compression).

Pixel cost (w/o texturing): 8+8+32 bit = 48 bit
Bandwidth available per pixel: 256*2/8 bit = 64 bit

When doing single texturing:
This leaves 64-48 = 16 bit per pixel free.
It's most likely is enough for DXTC but it will be bandwidth limited on uncompressed textures.

When doing 2x texturing:
This leaves 2*64-48 = 80 (2*40) bit per pixel free.
It's more than enough for DXTC even with aniso. It might even be enough with uncompressed textures (w/o aniso) especially if one of them is a low resulotion one. (eg. lightmap)

When doing 3x texturing:
This leaves 3*64-48 = 144 (3*48 ) bit per pixel free.
Not much difference here.

You can see that adding a second TMU will most certainly help 2x texturing but it won't likely be a 2x speed increase, except in special cases.

But let's consider DDRII @ 450 + 2xTMU

When doing single or 2x texturing:
This leaves 96-48 = 48 (2*24) bit per pixel free.
It's quite plenty for 1x texturing. It's likely enough for DXTC even with some aniso. While it's not enough for 32 uncompressed, it might not take a large slowdown in many cases.

When doing 3x or 4x texturing:
This leaves 2*96-48 = 144 (3*48 = 4*36) bit per pixel free.
It will mostly enough for high quality textures as well.


And before someone else points out, I did not calculate with bandwidth efficiency. :)
 
Does ATI have some sort of pipeline breakdown? We might have talked about this before, I don't remember?

I mean, instead of doing 8x1, can they switch the engine over to 4x2? The reason I ask is because an 8x1 design can be VERY inefficient when dealing with small polies.
 
Dave said:
Does ATI have some sort of pipeline breakdown? We might have talked about this before, I don't remember?

I mean, instead of doing 8x1, can they switch the engine over to 4x2? The reason I ask is because an 8x1 design can be VERY inefficient when dealing with small polies.

There are rumors about a possible 4 pipes config in the upcoming R9500 but nobody knows...

How do you mean 'small'? Small number of p. or what? :-?
 
Well it isn't too difficult to make an engine that is flexible in its config. It can adjust the config based on what the driver says is ideal. So if a game is using only a single pass it will flip to 8x1 mode, where if it uses 2+ textures it will go to 4x2 mode. It is a matter of controlling the ALU's and including the functionality to switch (it really doesn't change that much).

As for the polygon thing, if a polygon is say 2 pixels high and 1 pixel wide, only 3 or 4 pixels are used for that polygon. In such a case, you are wasting 4 of your pixel pipes because that triangle is being rendered and the other pipes are sitting idle.
 
Well isn't the texturing pipes use FP representation and calculation not integer? I'm probably mistaken somehow, but doesn't ALU imply integer?
 
Yeah, I guess so. Guess we just always did it that way because it was a term we used so much and when we did, we always knew what we were talking about (be it floating point, integer, etc) based on the context. And ALU just rolls off the tounge better than FPU.
 
Yeah, I thought it was something like that. ALU is nicer to say, it sounds like less of a swear, as well. I suppose that's just one of those differences between the industry and acadamia
 
Assuming the following:

1. The Radeon 9700 is actually capable of filtering four bilinear samples per pixel pipeline per clock, or two trilinear samples.

2. The move to two textures per pixel pipeline per clock does not increase the processing power, just allows the same power to be assigned to two textures instead of just one.

In this situation, the primary benefit would be when anisotropic filtering is disabled. Anisotropic filtering could also receive some benefit, provided there was additional hardware for computing the degree of anisotropic for the second texture per pixel pipeline per clock. If the additional hardware for the second aniso degree calculation was not there, then the second texture unit would do little more than provide a higher performance hit for anisotropic (aniso performance would be no worse than 8x1, just non-aniso performance would be significantly higher).

Of course, this is assuming that the test situation is fillrate-limited, which is not going to be too easy today. We may have to wait until DOOM3 for a truly fillrate-limited test.
 
Dave said:
I mean, instead of doing 8x1, can they switch the engine over to 4x2? The reason I ask is because an 8x1 design can be VERY inefficient when dealing with small polies.
I talked with an ATI hw engineer at GraphicsHardware2002 and he said to me they addressed this issue. Obviously he couldn't share any details to me.

ciao,
Marco
 
I don't see any reason why 8x1 must be inefficient with small polies. It just depends on how independent the different pipelines are. The hardware could be made so that four must work on the same tri, or that two must work on the same tri, and so on. I see no absolute necessity for the hardware to lock an 8x1 pipeline into calculating on just one tri at a time.

However, there may be a limitation that they all work with the same fragment program at once.
 
Chalnoth said:
Assuming the following:

1. The Radeon 9700 is actually capable of filtering four bilinear samples per pixel pipeline per clock, or two trilinear samples.

Are you sure this is the case? Ever since the original Radeon, 1 TMU could handle 1 bilinear sample, I remember this back from their diagrams about what their 3-TMU pipe could do(it could handle either 3 bilinear samples, or 1 tri- and 1 bi-linear). I don't see why they would regress on this. Plus, if trilinear samples took up 2 pipes instead of one, I would think the framerate hit for trilinear vs. bilinear filtering would be much higher. Do you have a URL where you got this info?
 
Actually, I got it from these message boards. I'm reasonably certain it was OpenGLguy.

So, it isn't certain, but I find it rather likely (particularly given the rather small anisotropic performance hit).
 
Back
Top