DX9 and multiple TMUs

CoolAsAMoose

Newcomer
I have seen some concern over the omission of a second TMU/pipeline in the R300. Maybe I am mistaken, but isn't the whole idea of the pixel shader that you should "program" effects that in earlier HW required multiple textures. This means it makes no sense at all talking about dual TMUs/pipeline in a DX9-chip (if performance on old software is not of big concern, that is). Am I mistaken?

So what is the next step in pixel-parallelism? Besides moving to 16 pipelines, I would say a super-scalar pixel shader capable of executing multiple pixel shader instructions in parallel. Or is pixel shader instructions always ackumulative, making parellel execution of them impossible?

Please help enlighten me!
 
You have different instruction types in a pixel shader. The 2 main groups are arithmetic instructions and texture sampling instructions. Arithmetic instructions work on registers and execute the actual mathematics involved in pixel shading, the texture sampling instructions calculate texture addresses and fetch the actual data through the cache from memory. Now what people used to know as a TMU is now turned into a texture sampling unit which execute the sampling instructions, the pipeline is now turned into a little CPU that does the maths work. And thats where the whole blur starts : how capable is that little CPU, when does it do things in a single clock and when does it require multiple clocks, do the 2 parts work inparallel or do they wait for each other, what about latency, etc.... The whole thing is a lot more complex as in the old days.

I am working on an article about the vertex shader which will explain why having 4 vertex shaders does not always mean the same thing since each vertex shader implementation can will be slightly different. The same principles remain tru for pixel shaders. There is no fixed definition for a pixel shader implementation just for functionality it has to support, how efficienct and how this is done is a completely different matter.

Take a look at the diagram :

fppu.gif


It shows there is one yellow block which is equal to a texture unit, it shows an address processor and a color processor.

Want to speed things up ? Get more of those units or make the existing units more functional or faster.
 
Kristof. I think an article like that will be helpful to a lot of people. I've been trying to compare R300 to Parhelia (P10 is a little harder) and one thing that is confusing is the definition of texture unit. I.e. what filtering can each unit do and maintain 1 pixel per clock, etc. Maybe your article will also give a few possibilities for why Parhelia's vertex shaders are so slow. It seems to me they optimized for long shader programs, but at the expense of short ones?
 
It only makes sense to add the capability to compute with a second (or more...) texture per clock if that logic takes up a small portion of the total pixel shader unit.

That is, it sounds like a great idea if that logic increases the pixel shader unit's size by, say, 20%.

If, however, it would prevent the addition of new pixel shader units, then it might not be such a good idea. I have a suspicion that each additional texture per clock would add closer to 80% to a pixel shader unit's size.

Then the problem comes in the form of: When is it better to have fewer texture units per pipe?

Well, consider 8x1 vs. a 4x2 pipeline. Here's the breakdown:
(situation): (8x1 perf) : (4x2 perf)

1 texture : 60 : 30
2 textures: 30 : 30
3 textures: 20 : 15
4 textures: 15 : 15
5 textures: 12 : 10
6 textures: 10 : 10

Anyway, as you can see, if lots of textures are going to be used, the 4x2 pipeline wouldn't be much worse. But, if few textures are in use, it could be significantly worse.

But why use a 4x2 pipeline? Well, I don't expect the NV30 to have a 4x2 pipeline. It will probably have an 8x1 pipeline just like the R300 (Given JC's comments that too many video cards today focus too much on multitexturing performance...). Having fewer texture units per pipeline is certainly much more flexible, giving game developers more freedom and allowing for more efficiency in more circumstances.

But, the only good reason that anybody should use more than one texture unit per pipeline is if it could be done without significantly increasing the number of transistors. Also, it would be pointless at this time to bother with increasing the number of TMUs if they are disabled when anisotropic is enabled, since it should be absolutely pointless to run any future high-end graphics card without anisotropic filtering enabled.
 
Chalnoth said:
But, the only good reason that anybody should use more than one texture unit per pipeline is if it could be done without significantly increasing the number of transistors. Also, it would be pointless at this time to bother with increasing the number of TMUs if they are disabled when anisotropic is enabled, since it should be absolutely pointless to run any future high-end graphics card without anisotropic filtering enabled.

I think this is the crucial point. My guess is that the texturing capabilities of modern GPUs are limited by the number of read ports in the texture cache. Adding extra ports to a chunk of SRAM increase logic complexity *and* impacts cycle time (to the point where it's not worth doing).

Cheers
Gubbi
 
Chalnoth said:
Anyway, as you can see, if lots of textures are going to be used, the 4x2 pipeline wouldn't be much worse. But, if few textures are in use, it could be significantly worse.

Thats not quite correct. A 4x2 pipleline is exactly as good as a 8x1 when an even number of textures are used but performes worse with a odd number of textures.
 
My point was that I doubt many games have a constant number of textures for every surface. The number of textures is likely to vary, so if the game uses relatively few, 4x2 is worse (since some surfaces are almost certainly going to have 1, 3, etc.). If the game uses many, then 4x2 is almost as good.

Update: Games like DOOM3, though they may use lots of textures, will use few to none on some passes, giving a significant advantage to an 8x1 pipeline.
 
Gubbi,

There's always the option of clocking the cache faster rather than adding tons of ports to it. The Alpha EV6 processor for example has at least one cache clocked at double core clockrate (I think it's the L1 icache), for example...
 
The reason for my original question (starting this thread) was that I was under the impression that many (most?) effects today accomplished using multiple textures instead will be procedurally generated using shader programs. Of course I see a lot of cases, like some cases of bump-mapping, where dual/multiple textures is the best solution. But isn't most uses of dual texturing simple light maps, where shader programs could create a better result?

So my point is: won't future applications (running on future hardware) use fewer texels per pixel than todays apps?

I do realize that performance on todays games also counts..........
 
I seriously doubt that the number of textures per pixel, on average, will decrease, in most situations.

Consider DOOM3 for just a moment. The four textures per pass in the GeForce3/4 cards isn't enough for many surfaces in this upcoming game. The same holds true for UT2k3. It has been stated by both Tim Sweeney and John Carmack that they can do their operations in a single pass on the Radeon 8500, while it takes two or three on a GeForce3/4.

And while it is true, to some extent, that many of these shaders might even use fewer textures with DX9 PS, as time goes on, DX9 games will use even more textures, just because they can.

Additionally, think much of the reduction in the number of passes with PS 1.4 is the fact that you can access the same texture more than once (if I remember correctly...).

While it is true that light maps may be going out, if you're going to get rid of light maps entirely, you have to replace them with something else for most geometry (such as a bump map) for proper lighting.
 
I can only speak for software rendering, but procedural shaders like you seem to be talkign about can rarely replace painted texturemaps. They are cool to have and can be used well for things like dirt, water, or similarly simple surfaces, but ultimately procedural shaders are only a small part of what I imagine a realtime-shading-engine could/should be used for.

More importantly you want to have the ability to use a number of texturemaps for different "channels", e.g to define specular, diffuse, bump, transparency or translucency values for a surface. Each of these would be a seperate pixel shader program in current realtime-3D technology I believe. To accomplish a really realistic looking surface (e.g. a character's head) you can need like up to and more than 4 different texturemaps layers. I think in such a situation, performance can only improve by having multiple TMUs per pipeline at your disposal, or wouldn't that be the case? Having pixelshaders doesn't reduce the need for applying more than one texturemaps, IMHO quite the contrary, it enables you to do a lot funkier stuff with those texturemaps though... ;)
 
Procedural shaders used to simulate natural stone, wood, dirt, etc. are typically based on fractal algorithms with some sort of noise basis function. In hardware, the noise functions will be implemented as 1D, 2D, or 3D textures, so what the non-interactive CG people would call purely procedural textures will still involve many texture lookups when implemented in hardware. A standard procedural marble texture will be implemented as a lookup into a 1D texture modified by some noise function, with one 3D texture lookup per octave of noise summed and 3-4 octaves required to get a good approximation. That's 4-5 texture lookups just to get the color. Additional texture lookups would be necessary if you want a bump mapped or environment mapped surface. Even with just procedural textures, we're going to be seeing more--not fewer--textures used per pixel with this next generation of hardware.

--Grue
 
Hrm one little tidbit from NVIDIA's CineFX whitepaper, NV30 has 16 texture units which can be reused as many times as needed. Assuming 8 pixel pipelines that would indicate a 8x2 wouldn't it?
 
Additionally, you save a lot of memory space with procedural textures, so that can, and probably will in part contribute to even higher res textures in the rest of the scene
 
Hrm one little tidbit from NVIDIA's CineFX whitepaper, NV30 has 16 texture units which can be reused as many times as needed. Assuming 8 pixel pipelines that would indicate a 8x2 wouldn't it?

If it indeed states that it has 16 texture units, then yes it would indicate an 8x2 architecture. Where did you see this though? If you're referring to the chart on page 8, that's simply the DX9 16 texture reference.
 
I'd like to know how the 9700 does trilinear. I can only assume that it can do it in a single clock with one tmu. Otherwise, I don't see the benifit of having only one TMU per pipe.
 

slide 15


Agreed. Looks definitely like a 8x2 architecture. I wonder how they plan on keeping that fed given David Kirk's comments about a 128bit memory architecture. Will DDR-2 be sufficient? Further things to note:

1) Arbitrary number of dependant texture reads. VERY nice.
2) Constants and operations use the same memory. ie if you have 512 constants, your max program size is reduced to 512 ops.
 
ATi is proud that Radeon 9700 is the first mainstream graphics chip with eight parallel pixel rendering pipelines. This is twice the amount of pipelines found in current high-end graphics chips. At a clock of 325 MHz, the eight pipelines are able to supply a fill rate of 8 * 325 = 2,600 Mpixels/s. Each pixel rendering pipeline has one texture unit, so the multi texturing fill rate is the same as the single texturing fill rate above. It might look as if one texture unit per pipeline is very little, but if you calculate the memory bandwidth requirement of eight parallel pipes with one texture unit doing a trilinear 32-bit color texture lookup, you will understand why two texture units wouldn't have made an awful lot of sense: 32 bit * 8 (trilinear filtering requires 8 texels to be read) * 8 (eight pipelines) = 2048 bit. 2048 bit would have to be read per clock, but 'only' 512 bit per clock are provided by the 256 bit-wide DDR memory interface of Radeon 9700. Bilinear filtering mode would still require 1024 bit per clock. Two texture units per pipe could never be fed by the memory interface. This is why it wouldn't have made sense to add those units.
http://www17.tomshardware.com/graphic/02q3/020718/radeon9700-07.html

anyone, can comment this part of the article of Tom, and gice me the consequences on the 2*8 architecture of the NV30?
 
Back
Top