DX9 and multiple TMUs

Ailuros · Jul 23, 2002

For a simplistic example, an IMR with dual TMUs (and enough bandwidth to support it) could conceivably do true trilinear "for free" compared to a deferred renderer with one TMU.

Apart from architectural differences I don't see the R300 "hurting" much with any kind of texture filtering so far. Unless you don't mean a 8x1 vs 8x2 case.

Joe DeFuria · Jul 23, 2002

MDolenc,

I'm not sure what you're trying to say.

First you appeared to imply that R-300 wasn't PS 2.0 compliant. Because it doesn't have 16 texture accesses per clock. (That's what Chalnoth was responding to).

Now you are saying that you know R-300 can do 16 texture access in one pass, just that 8x2 can do it in one clock. (Given enough bandwidth, of course.)

Are you saying that PS 2.0 has some requirement for texture accesses per clock?

Joe DeFuria · Jul 23, 2002

Apart from architectural differences I don't see the R300 "hurting" much with any kind of texture filtering so far. Unless you don't mean a 8x1 vs 8x2 case.

Aniso doesn't require texture reads from different mip-maps for one texture. Thus, more TMUs don't necessarily help aniso. (It depends on how many texels each TMU can fetch per clock.)

ATI did state that TRILINEAR (at least combined with aniso) would take some performance hit. They said "minimial", but whatever that means, it's more than zero.

We also don't know if ATI's "trilinear" will turn out to be "true" tilinear, or if they will use some auto-generating of mip-maps. I was speaking of a hypothetical "true" trilinear, where 2 different mip-maps need to be read for every filtered result.

Ailuros · Jul 23, 2002

I was actually pointing at those differences in implementation (wether from the software/hardware side).

I personally can live with a fast performing anisotropic algorithm combined with "just" bilinear, but that's a completely different story.

If a vendor is supposed to implement a TMU more per pipe for cases (like the trilinear example you posted) where with workarounds you can achieve very good approximations I don't see much use for it.

Of course more is always better, but there are other factors to get balanced too.

Xmas · Jul 23, 2002

Joe DeFuria said:
I was speaking of a hypothetical "true" trilinear, where 2 different mip-maps need to be read for every filtered result.

Some TMUs can only read from one mipmap (Parhelia, Kyro [with "fast trilinear"]) while others can do trilinear filtering in one clock (GeForce, Radeon). Of course trilinear is still slower than bilinear on these because of slightly higher bandwidth/cache requirements. But fill rate is identical.

MDolenc · Jul 23, 2002

What I'm talking about is performance. Radeon 9700 can't output one fully loaded ps.2.0 pixel per clock. Every other pixel shader capable chips can output atleast one fully loaded pixel of it's pixel shader generation (GeForce 3 & 4 can output one ps.1.x pixel, Radeon 8500 can output one ps.1.4 pixel). What I'm saying is that it would be nice to have hardware that can do one ps.2.0 pixel per clock (and that's what you need if you want to make ps.2.0 really usefull).

LeStoffer · Jul 23, 2002

MDolenc said:
Radeon 9700 can't output one fully loaded ps.2.0 pixel per clock.

How do you know this? I would presume the same thing to be true [per pipeline] in real world performance, but we don't really know how many clocks it'll take the Radeon 9700 to apply, say, 16 textures [in one pass].

Joe DeFuria · Jul 23, 2002

What I'm talking about is performance. Radeon 9700 can't output one fully loaded ps.2.0 pixel per clock.

There is MUCH more to performance than how many texels can be read per clock, which is what the TMUs do.

The GPU has to "do things" with the texture data...perform ops, possibly write to some render target...read back the results, etc.

In other words, even if the R-300 can not read enough texture data in one clock to produce a so-called "fully loaded PS 2.0 pixel", that doesn't mean that's where the bottleneck lies. To put another way, if some other part is a 8*2 pipeline, that does not not mean it can "produce" a PS 2.0 pixel per clock. All it means is that it can READ IN enough data to produce such a pixel. There is a big difference, and ultimate performance cannot be construed simply by looking at texture read ability.

Geeforcer · Jul 23, 2002

Joe DeFuria said:
For a simplistic example, an IMR with dual TMUs (and enough bandwidth to support it) could conceivably do true trilinear "for free" compared to a deferred renderer with one TMU.

Geforce 1 with 1 TMU/pipe could also do trilinear "for free": not all TMUs are equal. On a tiler, you would get much better return on investment by increasing the number of pipelines rather then adding extra TMUs.

KimB · Jul 23, 2002

MDolenc said:
What I'm talking about is performance. Radeon 9700 can't output one fully loaded ps.2.0 pixel per clock. Every other pixel shader capable chips can output atleast one fully loaded pixel of it's pixel shader generation (GeForce 3 & 4 can output one ps.1.x pixel, Radeon 8500 can output one ps.1.4 pixel). What I'm saying is that it would be nice to have hardware that can do one ps.2.0 pixel per clock (and that's what you need if you want to make ps.2.0 really usefull).

I'd be rather surprised if the NV30 could output one 1024-instruction pixel shader in a single clock, 16 textures or no.

MDolenc · Jul 24, 2002

LeStoffer said:
How do you know this? I would presume the same thing to be true [per pipeline] in real world performance, but we don't really know how many clocks it'll take the Radeon 9700 to apply, say, 16 textures [in one pass].

Pixel shaders 2.0 REQUIRES 16 different textures and up to 32 texture samples. You CAN NOT read TWO DIFFERENT TEXTURES with one TMU.
You have 8 t# (texture coordinate registers) 16 s# (texture) registers and 32 address instructions. So you can load 32 different texels from 16 different textures.

LeStoffer said:
There is MUCH more to performance than how many texels can be read per clock, which is what the TMUs do.

Of course there is. R300 still CAN do 64 arithmetic instructions in one clock which is also required by ps.2.0. You can do a ps.2.0 which will use 8 textures and 16 texture samples and 64 arithmetic instructions and R300 will run at one pixel per clock BUT that's below ps.2.0 spec! Now when you use 16 textures and 32 samples there is no way to do this in one clock on R300, so when you do this R300 could as well do 128 arithmetic instructions and that is over ps.2.0 spec. So R300 can provide 16 address instructions and 64 arithmetic instructions in one clock or 32 address instructions and 128 arithmetic instructions in two clocks (160 instructions). You need to be able to do 32 address instructions in one pass so that you can run ps.2.0 and on R300 this takes two clock.

Chalnoth said:
I'd be rather surprised if the NV30 could output one 1024-instruction pixel shader in a single clock, 16 textures or no.

That's not what I'm saying! What I'm saying is that any 8x2 configuration which can do 8 arithmetic instructions per pipeline (96 instructions which is less then 160 instructions on R300) could be SIGNIFICANTLY FASTER in ps.2.0.

Kristof · Jul 24, 2002

MDolenc said:
Of course there is. R300 still CAN do 64 arithmetic instructions in one clock which is also required by ps.2.0. You can do a ps.2.0 which will use 8 textures and 16 texture samples and 64 arithmetic instructions and R300 will run at one pixel per clock BUT that's below ps.2.0 spec! Now when you use 16 textures and 32 samples there is no way to do this in one clock on R300, so when you do this R300 could as well do 128 arithmetic instructions and that is over ps.2.0 spec. So R300 can provide 16 address instructions and 64 arithmetic instructions in one clock or 32 address instructions and 128 arithmetic instructions in two clocks (160 instructions). You need to be able to do 32 address instructions in one pass so that you can run ps.2.0 and on R300 this takes two clock.

Errr... you might want to rethink that statement. "64" arithmetic instructions "per clock" would mean that you can, for example, add 64 full IEEE float numbers within one clock... now thats just not realistic especially if you take into account that all these instructions can be dependant. What you want to say is "per pass", how many clocks a "pass" takes is a wholy different matter.

Also the DX specifications say little to nothing about how many clocks a certain operation would take, all it specifies is functionality per pass. With other words DX does not care "how" you actually do it, as long as you can do it "correctly".

K~

OpenGL guy · Jul 25, 2002

MDolenc said:
Of course there is. R300 still CAN do 64 arithmetic instructions in one clock which is also required by ps.2.0. You can do a ps.2.0 which will use 8 textures and 16 texture samples and 64 arithmetic instructions and R300 will run at one pixel per clock BUT that's below ps.2.0 spec! Now when you use 16 textures and 32 samples there is no way to do this in one clock on R300, so when you do this R300 could as well do 128 arithmetic instructions and that is over ps.2.0 spec. So R300 can provide 16 address instructions and 64 arithmetic instructions in one clock or 32 address instructions and 128 arithmetic instructions in two clocks (160 instructions). You need to be able to do 32 address instructions in one pass so that you can run ps.2.0 and on R300 this takes two clock.

I really think you don't understand the difference between a "pass" and a "clock". The amount of work you can do in one clock cycle has nothing to do with PS 2.0. PS 2.0 specifies what instructions and how many instructions there can be in a PS 2.0 program. How long it takes you to execute such a program is irrelevant to the specification.

Now, if you support PS 2.0, then you have to support programs with up to 64 ALU and 32 address instructions. That would be considered the maximum length of a pass if you had to do multipass rendering. For example, let's say you have a shader program that is 2048 ALU instructions, then you would probably need at least 32 passes in order to compute the output.

Some people have said that the extra passes are slow: This is true, but I really don't think the cost of extra passes is a big deal when you are talking about such long shaders!

KimB · Jul 25, 2002

OpenGL guy said:
Some people have said that the extra passes are slow: This is true, but I really don't think the cost of extra passes is a big deal when you are talking about such long shaders!

If you consider that each pass changes the state, and therefore stalls the pipelines, I would think it could be a very large issue.

In order for multipass to have a small speed hit, this is what I see as necessary:

1. Either the state change stall is much smaller than the time to execute one pass, or:
2. The software manages the multiple passes (no auto-multipass, or some specialized auto-multipass...), so that the number of pipeline stalls is reduced (i.e. runs one pass over a large number of triangles, then the next pass, and so on).

The question is, is the available PS program length in the R300 long enough to make the hit from worst-case multipass small compared to the time it takes to execute one max-length program? (As a side note, this topic has made me rethink my stance on the need for unlimited-size programs...large enough program sizes might be enough...though the VS really needs to have long or unlimited progs available...). Obviously, the NV30 looks like it will have a much smaller speed hit from multipass than the R300.

And will any HLSL compilers (DX, OpenGL, Cg, RenderMonkey) generate auto-multipass?

OpenGL guy · Jul 25, 2002

Chalnoth said:
If you consider that each pass changes the state, and therefore stalls the pipelines, I would think it could be a very large issue.

Not all state changes cause pipeline stalls.

In order for multipass to have a small speed hit, this is what I see as necessary:

1. Either the state change stall is much smaller than the time to execute one pass, or:

This is obvious.

2. The software manages the multiple passes (no auto-multipass, or some specialized auto-multipass...), so that the number of pipeline stalls is reduced (i.e. runs one pass over a large number of triangles, then the next pass, and so on).

You would have to multipass each batch of primitives.

And will any HLSL compilers (DX, OpenGL, Cg, RenderMonkey) generate auto-multipass?

If they don't, they are pretty much worthless.

KimB · Jul 25, 2002

OpenGL guy said:
Not all state changes cause pipeline stalls.

Could you enlighten us, then? What kind of state change wouldn't cause a pipeline stall? Is it realistically-possible for a compiler to always, or at least usually, compile multiple passes that generally don't cause stalls?

2. The software manages the multiple passes (no auto-multipass, or some specialized auto-multipass...), so that the number of pipeline stalls is reduced (i.e. runs one pass over a large number of triangles, then the next pass, and so on).

Click to expand...

You would have to multipass each batch of primitives.

Might this be hard without HOS?

If they don't, they are pretty much worthless.

I hope all current developers of HLSL's agree with you.

OpenGL guy · Jul 25, 2002

Chalnoth said:
OpenGL guy said:

Not all state changes cause pipeline stalls.

Click to expand...

Could you enlighten us, then? What kind of state change wouldn't cause a pipeline stall? Is it realistically-possible for a compiler to always, or at least usually, compile multiple passes that generally don't cause stalls?

You want me to give out proprietary information? Sorry, I can't do that. But I can say that I've worked at two different graphics chip manufacturers and both had states that could be changed without pipeline stalls. This doesn't mean ALL states can be changed freely, but many common ones can be.

KimB · Jul 25, 2002

Well, I may be able to imagine a couple that might be changed without stalls.

One might be a change that only changes the textures rendered, or the blending mode (i.e. change the BlendFunc), or some other change that doesn't modify the rendering so much that it would change the amount of processing power (clocks), or the amount of input/output data required to render.

If this is true, then could it be guaranteed that every 100-instruction shader would execute in the same amount of time? I highly doubt it. Some passes might need dependent texture reads, others wouldn't. Some passes might need eight textures, others four.

I think the most problematic thing would be if the programmer wanted to use flow control. If the hardware supports none, then it would need to do a separate pass for each loop/branch (Assuming even this is possible...I should hope it is, for ATI's sake...). This could get especially bad if the amount of code executed within one loop/branch was very small, but there were potentially a large number of loops/branches.

OpenGL guy · Jul 26, 2002

Chalnoth said:
Well, I may be able to imagine a couple that might be changed without stalls.

One might be a change that only changes the textures rendered, or the blending mode (i.e. change the BlendFunc), or some other change that doesn't modify the rendering so much that it would change the amount of processing power (clocks), or the amount of input/output data required to render.

Well, you just took your first step in understanding pipelining in 3D hardware

If this is true, then could it be guaranteed that every 100-instruction shader would execute in the same amount of time? I highly doubt it. Some passes might need dependent texture reads, others wouldn't. Some passes might need eight textures, others four.

Why should the length of time per pass matter? The driver would know when the pass was completed.

I think the most problematic thing would be if the programmer wanted to use flow control. If the hardware supports none, then it would need to do a separate pass for each loop/branch (Assuming even this is possible...I should hope it is, for ATI's sake...). This could get especially bad if the amount of code executed within one loop/branch was very small, but there were potentially a large number of loops/branches.

Are you talking pixel or vertex shaders? I don't see any mention of looping/braching in nvidia's leaked NV30 pixel shader specs, so all of your above comments apply to them as well. In fact, even more so, because they'll have a LOT more wasted gates (i.e. if each pass is relatively sparse in pixel shader ops, then what is the use of supporting 1024 of them?)

KimB · Jul 26, 2002

nVidia does support a form of flow control in the PS. It consists of executing all possible branches, and then choosing one to output.

As for the comment on how long it takes a program to execute, I should think that if you're going to piggy-back programs, you'd need them all to execute in the same time to preserve parralelism. If the execution lengths aren't kept the same, then it could become a nightmare to keep the pipelines full, for the same reasons that pipelines will get stalled with state changes where, say, the number of textures rendered changes. If I'm understanding everything you've written, this should be correct, shouldn't it?

DX9 and multiple TMUs

Ailuros

Epsilon plus three

Joe DeFuria

Joe DeFuria

Ailuros

Epsilon plus three

Xmas

Porous

MDolenc

LeStoffer

Joe DeFuria

Geeforcer

Harmlessly Evil

KimB

MDolenc

Kristof

OpenGL guy

KimB

OpenGL guy

KimB

OpenGL guy

KimB

OpenGL guy

KimB

Similar threads