DX9 and multiple TMUs

Chalnoth said:
I should think that if you're going to piggy-back programs, you'd need them all to execute in the same time to preserve parralelism. If the execution lengths aren't kept the same, then it could become a nightmare to keep the pipelines full, for the same reasons that pipelines will get stalled with state changes where, say, the number of textures rendered changes. If I'm understanding everything you've written, this should be correct, shouldn't it?
Each pixel pipeline will be executing the same pixel shader program with, possibly or likely, different data. The fact that they get done in different times is pretty much irrelevant because the driver won't know: It will only know when all pixel pipes are finished and that's really all that matters.

Of course, the driver doesn't necessarily have to wait on the hardware at all, but that's a different issue.
 
Well, if 3D pixel pipelines are really in the hundreds deep, then there's no way you could keep the pipelines flush without equal, or very nearly-so, program execution for each pixel.
 
Chalnoth said:
Well, if 3D pixel pipelines are really in the hundreds deep, then there's no way you could keep the pipelines flush without equal, or very nearly-so, program execution for each pixel.
Why should this even matter? Again, I think you are making some assumptions about 3D hardware that isn't necessarily true.

For example, take nvidia's GeForce 4. It can handle 4 pixels arranged in a 2x2 square at the same time. Well, I can bet you with 99.99% certainty that the same pipe in that 2x2 arrangement hits the same pixel every time. Thus if the pipes are numbered:
0 1
2 3

Then the pixel at (0,0) is always hit by pipe 0. It doesn't matter which pass you are on.

Now, if the output data has to be used as input next time around, then you'll have to synchronize things, but this can often be done at the hardware level without driver interaction, which means that you can use the CPU for other useful tasks.
 
Chalnoth said:
But...imagine this. What if nVidia has a sort of framebuffer compression that only stores one color data to the framebuffer when FSAA is enabled if that pixel is completely-covered by the current triangle? If this were the case, even 8x FSAA could be implemented without too much memory bandwidth hit.

I have a feeling ATi is doing this with the 9700.


LeStoffer said:
Some things that Mark J. Kilgard mentions about future hardware (slide36 ->):

* More texture units (4 today, 16 soon)
- Huh? We have two per pipeline today. I cannot make sense of this...

Parhelia has 4 per pipe although he was probably refering to per pass as others have suggested.
 
3dcgi said:
I have a feeling ATi is doing this with the 9700.

Well, one benchmark that should show it pretty well is the 3DMark2k1 fillrate test. All of the FSAA modes, from 2x on up, should show the same result.
 
Chalnoth said:
3dcgi said:
I have a feeling ATi is doing this with the 9700.

Well, one benchmark that should show it pretty well is the 3DMark2k1 fillrate test. All of the FSAA modes, from 2x on up, should show the same result.
Only if you get perfect compression.
 
Well, if the method I suggested were used, and there was no stencil buffer in use, then there would be no hit above 2x FSAA for a scene that contains no poly edges (like the fillrate test).

This is granted, of course, that there's still enough video memory to handle the textures, and that there's no loss in performance due to lowered memory efficiency (since the pixels wouldn't be all nice and in a row).
 
You actually need 8x2 (well you could do 16x1 or 4x4) to be fully ps.2.0 ready. Pixel shaders 2.0 require 16 different textures and 32 texture samples. Radeon 9700 can have 8 different textures and 16 texture samples (that's ps.2.0 talk) per clock. This is where those 160 instructions come from... (16 texture address instructions + 64 arithmetic instructions) * 2 clocks = 160 instructions. So if they want to be ps.2.0 compliant then they could also make 128 arithmetic instructions (which is over ps.2.0). You can not fetch two different textures from same TMU in a same cycle.

Simply not true... (and that PS 2.0 talk to)

the R300 can do exactly what you are claiming it cant..
 
Joe DeFuria said:
A 256-bit bus should easily be enough to feed an 8x2 architecture.

That, I disagree with. I don't think its easily enough. I don't think that the 128 bit bus is enough for the GeForce Radeon 8500 pipelines. They seem to be bandwidth limited more often than not.

That being said, this doesn't mean an additional TMU would be useless, because you are not always bandwidth limited.

Based on R-300 performance, I'd say that the 256 bit bus seems to be about the right pairing for an 8*1 pipeline. And while I think it would get some performance boost from an additional TMU in certain situations, it's not all clear to me that it would be worth the additional silicon cost.

Here's what I posted in another thread regarding Tom's comment:

First, Tom's calculation doesn't take into account texture caching, which reduces texture bandwidth IMMENSELY. Today's good GPU's hardly ever have to read a texture sample from memory twice when drawing a polygon, except when tiling a texture. Still, locally speaking, that hold about true.

Consider single-texturing. When minification is happening, bilinear filtering has texture bandwidth requirements of 32-bits per pixel maximum. Trilinear requires about 40 bits per pixel max, because one mip map is always 1/4 the resolution - however, since it requires 2 clocks to do the trilinear filtering (assumming 2 mipmaps are used instead of 1), that's only 20 bits per pixel per clock.

Remember, these are max figures, too. Increasing LOD bias lowers this, as well as looking at oblique angles. When textures are closer to the camera, magnification spreads the textures over more pixels, reducing this much more (the 3DMark2K1 has almost negligible texture bandwidth requirements for this reason - I'm talking only a few bits per pixel).

Most GPU's, including GF2, GF3, GF4, Radeon 8500, and R300, have about 64-bits of bandwidth per pixel per clock (give or take). You need 32-bits for the colour buffer write, and both Z reads and writes are necessary. With Z-compression, this is 16-64 bits per pixel, depending on compression (avg of 32 maybe?). This leaves only a bit for texture bandwidth, but again, texture bandwidth is not near as bad as Tom says it is. From here, the greater the texture bandwidth, the lower the efficiency. Alpha textures are a bit different, needing a Z-read an both a colour read and write (~80 bits/pix + texture bandwidth).

Generally, a second texture unit will help out a lot in multitexturing, because texture bandwidth is usually quite low. Some parts of the screen are bandwidth limited, so the performance gain isn't 100%, but it's still significant. Just look at RV250 vs. R200 in Quake 3 or Jedi Knight. The difference is quite noticeable.

I think R300 could have used the extra texture unit, but it was just the best way to keep the transistor count in check. It's already a huge piece of silicon that probably costs significantly more than ATI's chips have cost in the past.

Overall, its a design choice. A bigger question is the number of math shader ops that can be executed per pipe, per clock. I think R300 can do one op per clock per pipe to match the single texture unit per pipe, as more is not very useful at the moment.

As shaders get more complex, neither bandwidth nor textures per cycle will be all that critical. I think 8 pixel pipes is the max that we'll be seeing for a while, because having more shader stages in the shading pipeline will be just as good as extra pipes.

Parhelia, for example, can do a lot of shading ops per pixel, per clock, due to its "36-stage pixel shader". Theoretically, it is quite a bit faster than R300 in this sense. Unfortunately, PS 1.3 does not lend itself to particularly complex shaders. Still, it would be very interesting to compare R300, Geforce4, and Parhelia in this sense.

Considering all this, it would be interesting to see what design decisions are made in future GPU's (or VPU's as they are now called :) ), like NV30 and R400.
 
MDolenc said:
What I'm talking about is performance. Radeon 9700 can't output one fully loaded ps.2.0 pixel per clock. Every other pixel shader capable chips can output atleast one fully loaded pixel of it's pixel shader generation (GeForce 3 & 4 can output one ps.1.x pixel, Radeon 8500 can output one ps.1.4 pixel). What I'm saying is that it would be nice to have hardware that can do one ps.2.0 pixel per clock (and that's what you need if you want to make ps.2.0 really usefull).

This can't possibly be true. I suppose Geforce3/4 could do one pixel per clock, since they only have a maximum instruction length of 8 instructions + 4 texture accesses, but the 8500 couldn't possibly do 12 texture accesses and 16 shader ops in one clock (that's what ps 1.4 is capable of). That's absolutely ridiculous, as it only has 8 texture units (2 per pipe), and I doubt it can do 4 math ops per pipe per clock in the pixel shader.

Making hardware powerful enought to do any ps 2.0 shader at one pixel per clock is almost useless for quite a while. How many shaders would use an instruction length that big? How much silicon would be wasted 95% of the time? How much would such a monster cost?
 
Back
Top