Are Multi TMUs per pipe really outdated?

KimB · Jan 20, 2003

Hellbinder[CE said:
]So really.. an 8x2 would be the ideal. Being that while at the same time that Shaders are taking over more of the advanced textureing features requiring less TMU's, Tri and Aniso use is increasing reqiring an additional TMU to matain timely sample rates.

What would the math look like to show the target bandwidth needed to support a fully loaded 8x2?

No, I think it'd be ideal if each pixel pipeline were just capable of one trilinear-filtered pixel per clock. It's rather silly to bother to optimize for bilinear-filtered performance these days.

Dave H · Jan 21, 2003

No, I think it'd be ideal if each pixel pipeline were just capable of one trilinear-filtered pixel per clock. It's rather silly to bother to optimize for bilinear-filtered performance these days.

Except that then the texture bandwidth demands would be too great for such a design to be of much use (at least without texture compression).

Take the case of an 8x1 design outputting a trilinear-filtered texel from every pipe on every clock. That's 2048 bits of texel reads every clock (assuming 32-bit textures). Of course these reads are coming from texture cache, not from DRAM; but texture cache can only help to the degree that you inherently reuse the same texels when working on nearby pixels.

Assuming a neutral LOD bias, isotropic bilinear/trilinear have an inherent texel reuse level of 4:1. (AFAICT from my understanding of mip-map selection; if someone can confirm or correct this, please go ahead.) So that means you're spending 512 bits of DRAM bandwidth per clock just for texture reads, and this is before taking cache and memory controller inefficiencies into consideration. Note that the most bandwidth-heavy of all the current/near-future 8x1 cards have 256-bit wide DDR DRAM buses running just a shade slower than core clock. In other words, even the 9700/9700Pro are slightly short of the bandwidth required for the texture samples to support 8 trilinear-filtered pixels a clock. And that's without leaving any bandwidth left over for, say, the z-buffer and framebuffer.

So basically the answer is, even if ATI added the ability to do trilinear filtering in one clock, you would never be able to output 8 trilinear pixels in one clock without being bandwidth limited first. Now, that in and of itself is not a good reason to keep from putting in trilinear in one clock functionality, assuming--and I agree with you on this--that nobody should be running a 9700 class card with bilinear filtering. After all, better to put out some trilinear pixels in one clock and some in two than require two clocks for all of them.

Even when you consider that expanding each TMU from 4 to 8 texel sampling units will cost transistors and possibly clock rate, it still seems like a good trade-off if you're always going to be outputting trilinear filtered pixels. Where this assumption fails is on something like the Doom 3 engine, where you render a whole pass with no texture reads, only writing z-buffer values. Given that ATI (and everyone else) not only expects but encourages this sort of engine to become very popular in the coming years, it makes sense for them to push more lowest common denominator pixel fillrate instead of optimizing for a particular case which might not always exist (even when trilinear filtering is turned on).

This, incidentally, is the answer to the original topic of this thread: yes, multiple TMUs per pipe are obsolete, because future engines will prize flexible speed over designs hardwired to favor particular situations. Along those lines, I'd expect 16x1 (presumably with the ability for different pipes to be working on different polygons at one time, at least in two 8x1 segments) to come next, not 8x2.

Simon F · Jan 21, 2003

horvendile said:
Hm... I thought that loopback made having more than one TMU per pipe kind of unnecessary? Ignoring cost, that is. Is there any performance disadvantage to loopback compared to passing the data on to the next TMU?

What I am trying to say is: I thought that with loopback, an 8x1 arrangement could do everything as fast as as a 4x2, but the reverse would not necessarily be true. Have I missed something?

I suppose that depends on whether

each pipe in the 8x1 arrangement has to work on the same polygon or whether they can work on different polys
if they have to work on adjacent pixels (in, say, a 4x2 block), or if they can process arbitrary pixels

As polygons become smaller then an 8Px1T pipe arrangement might be less efficient than a 4Px2T (assuming there are a few textures to process per pixel).

arjan de lumens · Jan 21, 2003

Dave H said:
Assuming a neutral LOD bias, isotropic bilinear/trilinear have an inherent texel reuse level of 4:1. (AFAICT from my understanding of mip-map selection; if someone can confirm or correct this, please go ahead.) So that means you're spending 512 bits of DRAM bandwidth per clock just for texture reads, and this is before taking cache and memory controller inefficiencies into consideration. Note that the most bandwidth-heavy of all the current/near-future 8x1 cards have 256-bit wide DDR DRAM buses running just a shade slower than core clock. In other words, even the 9700/9700Pro are slightly short of the bandwidth required for the texture samples to support 8 trilinear-filtered pixels a clock. And that's without leaving any bandwidth left over for, say, the z-buffer and framebuffer.

The problem with such a calculation is that the two mipmaps that are needed for trilinear interpolation of each given pixel have different resolutions, with the result that you get a much higher level of reuse (about 3-4x as high) for the low-resolution mipmap than the high-resolution one. Also, 4:1 reuse sounds a bit low for the high-resolution mipmap (you need to select mipmap levels so that neither level exhibits texture aliasing, so you need to keep 1 texel a bit larger than 1 pixel most of the time, giving higher than 4:1 reuse), so in sum I would estimate the bandwidth usage for 8-pipe trilinear 32-bit texturing to about 250-320 bits per clock, rather than >500.

Dave H · Jan 21, 2003

The problem with such a calculation is that the two mipmaps that are needed for trilinear interpolation of each given pixel have different resolutions, with the result that you get a much higher level of reuse (about 3-4x as high) for the low-resolution mipmap than the high-resolution one.

Right. The low-resolution mip-map will have exactly 4x less detail than the high-resolution one, so ignoring cache effects you should get ~4x greater reuse.

Also, 4:1 reuse sounds a bit low for the high-resolution mipmap (you need to select mipmap levels so that neither level exhibits texture aliasing, so you need to keep 1 texel a bit larger than 1 pixel most of the time, giving higher than 4:1 reuse),

Are you sure about this? I glanced at the details in the OpenGL spec once, and while they were certainly a bit confusing (couldn't quite figure out what all of the parameters were without investing more time than I wanted to), I *do* know that the high-res mip-map is 1/2 a level lower than what you'd use if you were only doing bilinear. In other words, I think you might get texture aliasing if you only used the high-res mip-map (it'd be like doing bilinear with LOD bias -0.5); it's my impression that the blend with the low-res map is supposed to remove that aliasing.

The 4:1 level, incidentally (and as you suggest) is what you get if you pick the two "closest" mip-maps (i.e. for one, 1 texel is a bit larger than 1 pixel, for the other, 1 texel is a bit smaller than 1 pixel). Of course polygons at an angle to the screen screw this up: you would think that they'd give you worse reuse (fewer screen pixels for the same "amount" of texture), but there's some compensating term in the mip-map selection algorithm, with the end result being... :?:

arjan de lumens · Jan 21, 2003

Dave H said:
Are you sure about this? I glanced at the details in the OpenGL spec once, and while they were certainly a bit confusing (couldn't quite figure out what all of the parameters were without investing more time than I wanted to), I *do* know that the high-res mip-map is 1/2 a level lower than what you'd use if you were only doing bilinear. In other words, I think you might get texture aliasing if you only used the high-res mip-map (it'd be like doing bilinear with LOD bias -0.5); it's my impression that the blend with the low-res map is supposed to remove that aliasing.

Oh. OK, fine-reading the OpenGL spec seems to confirm that you are right on this one.

The 4:1 level, incidentally (and as you suggest) is what you get if you pick the two "closest" mip-maps (i.e. for one, 1 texel is a bit larger than 1 pixel, for the other, 1 texel is a bit smaller than 1 pixel). Of course polygons at an angle to the screen screw this up: you would think that they'd give you worse reuse (fewer screen pixels for the same "amount" of texture), but there's some compensating term in the mip-map selection algorithm, with the end result being...

Polygons viewed at an angle will give *better* reuse when doing plain trilinear. What happens when a polygon is viewed from an angle, is that it will be squished together along an axis in screen-space, with each texel stretched out along the perpendicular axis. The mipmap calculation formula will then adjust the mipmap level such that this squishing does not result in texture aliasing, the result being that each texel corresponds to an area ~1 pixel wide along the squish axis and many pixels long. With each texel corresponding to multiple pixels, you get better reuse the sharper angle you view the texture from. Of course, this has the side effect of blurrng the texture excssively along the perpendicular axis; this is what anisotropic mapping is supposed to fix (with aniso, I suspect that the texel reuse is pretty much the same no matter what angle the polygon is viewed from)

Mintmaster · Jan 24, 2003

Dave H, you're making good points. With 32-bit textures, you would need up to 40-bits per pixel of texture access (average of one texel per pixel of the hi-res mipmap, and one texel per 4 pixels for the next lower mipmap). Along with a 32-bit colour write and a Z read/write (compressed), you are correct in saying that can be well over the ~60 bits per pipe per clock that the 9700 architecture provides (assuming 100% memory efficiency, which is very unreasonable), let alone the 32 bits per pipe per clock that NV30 has.

However, you forgot to consider a few factors.

One is texture compression. This reduces texture size by a factor of 4 or 6, and suddenly texture bandwidth is not very important.

Another is the fact that when there's magnification rather than minification, you only need to sample the top level mipmap. Depending on the magnification, the texel to pixel ratio can be quite low, like in the 3DMark2001 fillrate tests or for lightmaps. This is especially true at high resolutions. In these cases texture bandwidth is nearly negligible, and in multitexture situations you really want another texture unit so that you can use extra bandwidth. As arjan said, viewing at angles also causes magnification, reducing bandwidth.

Finally, even when you are bandwidth limited, only when you need over 2x the bandwidth than what is provided will single cycle trilinear be useless. Otherwise you will still get some improvement. For the 9700, if a section of pixels needs on average 80 bits per pixel, but you can do the math in one cycle, then pixel rate is reduced to about 6 pix per clock (8 * 60/80), whereas if you needed 2 engine cycles to create the pixel, you're down to 4 pixels per clock. Still a 33% hit by using 1 TMU.

All these things make for a tough decision regarding TMU's. It seems the best thing to do is just simulate both scenarios with various scenes, and make a decision based on that.

KimB · Jan 24, 2003

arjan de lumens said:
Of course, this has the side effect of blurrng the texture excssively along the perpendicular axis; this is what anisotropic mapping is supposed to fix (with aniso, I suspect that the texel reuse is pretty much the same no matter what angle the polygon is viewed from)

Well, considering the fact that anisotropic is currently implemented by taking multiple bilinear/trilinear samples within a pixel, the texel reuse will remain the same with or without anisotropic (assuming the same level of texture aliasing), since each bilinear/trilinear sample will still have the same dimensions. Of course, there will be less texel reuse from pixel to pixel, though I'm not sure what matters more. The anisotropic filtering will undoubtedly require fillrate to perform.

KimB · Jan 24, 2003

Dave H said:
Even when you consider that expanding each TMU from 4 to 8 texel sampling units will cost transistors and possibly clock rate, it still seems like a good trade-off if you're always going to be outputting trilinear filtered pixels. Where this assumption fails is on something like the Doom 3 engine, where you render a whole pass with no texture reads, only writing z-buffer values. Given that ATI (and everyone else) not only expects but encourages this sort of engine to become very popular in the coming years, it makes sense for them to push more lowest common denominator pixel fillrate instead of optimizing for a particular case which might not always exist (even when trilinear filtering is turned on).

Well, I think that moving into the future, the important thing here will be, what is the balance between PS computational power, and texture filtering computational power? After all, it only seems natural that many textures will remain plain-old-8888 format, possibly compressed, for some time to come. It appears that current architectures do this integer processing in parallel with the PS unit.

So, for a given number of cycles in the PS, how many textures are going to need accessing? How long will it take those texels to get through the filtering stage? At what stage in processing are the colors of the sampled textures going to be needed? Anyway, these are all processing balance questions that need to be answered by the hardware companies. How each answers these questions will likely be a significant factor in the future performance of their GPU's.

As for DOOM3's z-only rendering, it seems to me that what should be done instead is to have the multiple z-checks per pixel pipeline per clock put to use for not just multisampling, basically adding a significantly-faster z-only rendering pipeline. I'm not really sure why this should affect texture filtering power per pixel pipeline, though it does seem that this is the reason why only one texture per pipeline is supported.

arjan de lumens · Jan 24, 2003

Chalnoth said:
arjan de lumens said:

Of course, this has the side effect of blurrng the texture excssively along the perpendicular axis; this is what anisotropic mapping is supposed to fix (with aniso, I suspect that the texel reuse is pretty much the same no matter what angle the polygon is viewed from)

Click to expand...

Well, considering the fact that anisotropic is currently implemented by taking multiple bilinear/trilinear samples within a pixel, the texel reuse will remain the same with or without anisotropic (assuming the same level of texture aliasing), since each bilinear/trilinear sample will still have the same dimensions. Of course, there will be less texel reuse from pixel to pixel, though I'm not sure what matters more. The anisotropic filtering will undoubtedly require fillrate to perform.

The 'correct' mipmap selection algorithm when aniso is enabled is different from when it is disabled - enabling aniso will, for pixels of polygons viewed at an angle, perform multiple bi/trilinear samples at a high-resolution mipmap level rather than a single sample at a much lower resolution level. The result is that the blurring effect of plain trilinear is removed, and the texel reuse is reduced substantially compared to plain trilinear.

KimB · Jan 24, 2003

arjan de lumens said:
The 'correct' mipmap selection algorithm when aniso is enabled is different from when it is disabled - enabling aniso will, for pixels of polygons viewed at an angle, perform multiple bi/trilinear samples at a high-resolution mipmap level rather than a single sample at a much lower resolution level. The result is that the blurring effect of plain trilinear is removed, and the texel reuse is reduced substantially compared to plain trilinear.

But it should be the same from sample to sample within a pixel when compared to the individual pixels without anisotropic. That's what I was attempting to say. Anyway, whichever way you slice it, anisotropic will still have more memory bandwidth/fillrate ratio than plain bilinear/trilinear for any architecture that doesn't have additional texture filtering power for anisotropic in the pipeline (that is, it will have more memory bandwidth available than plain bilinear/trilinear).

arjan de lumens · Jan 24, 2003

Chalnoth said:
But it should be the same from sample to sample within a pixel when compared to the individual pixels without anisotropic. That's what I was attempting to say.

That's the way I understood that you were saying it, and I still disagree. With plain trilinear viewed from a sharp angle, each texel (sampled from a low-resolution mipmap) will correspond to a long stripe of pixels (giving high reuse and blurring) ; with aniso, each texel will correspond to about one sample point (sampled from a higher-resolution mipmap; giving lower reuse).

Anyway, whichever way you slice it, anisotropic will still have more memory bandwidth/fillrate ratio than plain bilinear/trilinear for any architecture that doesn't have additional texture filtering power for anisotropic in the pipeline (that is, it will have more memory bandwidth available than plain bilinear/trilinear).

True, when you count the effect of framebuffer accesses ...

Dio · Jan 24, 2003

There seems to be somewhat of a misconception here, namely that 32-bit texture performance is all that matters.

As a well-known texture compression advocate, I personally think that compressed texture performance is what really matters. There's no reason that current applications can't compress at least 50% of their texture sets, and in the vast majority of cases 80%+.

With trilinear filtering on, even with a side-by-side it is very hard to tell which is which.

arjan de lumens · Jan 24, 2003

Color texture maps can often be compressed with good results. But we still lack good compression methods for bumpmaps/normal maps. So they will probably have to stay 32-bit for some time still.

Dave H · Jan 24, 2003

Thanks everyone for some very informative replies! I'm learning a lot!

Mintmaster said:
One is texture compression. This reduces texture size by a factor of 4 or 6, and suddenly texture bandwidth is not very important.

Well I didn't forget texture compression so much as purposely ignore it. But now I wonder if I was right to do that. I'm actually rather ignorant on this question: how often is texture compression used these days? Does it tend to be turned on by default in most new games/current drivers?

Another is the fact that when there's magnification rather than minification, you only need to sample the top level mipmap. ... This is especially true at high resolutions. In these cases texture bandwidth is nearly negligible, and in multitexture situations you really want another texture unit so that you can use extra bandwidth.

Doh! Should have realized that for myself. But it's a very interesting point, especially the observation about high resolutions. (Bumping up the resolution therefore makes you less bandwidth-limited then...)

Finally, even when you are bandwidth limited, only when you need over 2x the bandwidth than what is provided will single cycle trilinear be useless.
...
All these things make for a tough decision regarding TMU's. It seems the best thing to do is just simulate both scenarios with various scenes, and make a decision based on that.

There's always something a little silly about discussions on the Internet second-guessing some mid-level design decision like this, because obviously the engineers making the decision considered all the possibilities, ran simulations that go far beyond the capabilities of any layman on the Internet in revealing the remifications of each choice, and went with whatever came out best. So, obviously, limiting each TMU to one bilinear filtered pixel per clock was indeed the best decision for the R300.

The question, given your well-taken point that the ability to do trilinear in one clock should greatly increase fillrate even if bandwidth limitations stop you from running trilinear at full fillrate, is why? My guess is that it comes down to a low-level implementation issue. I would assume that the pixel pipelines are very highly, um...pipelined. That when we talk about "doing bilinear in one clock", we are not talking about calculating 4 texel positions, fetching those 4 texels, filtering them, fetching current z-value, comparing, writing new z-value, and writing to the framebuffer (am I missing anything?) all in one clock cycle. Instead, I'm pretty sure we're talking about doing all those things in several clock cycles, but with a throughput of one pixel per pipe per cycle.

Even so, there are some obvious reasons why moving to (a throughput of) one trilinear pixel per cycle might be more trouble than it's worth. One is that adding the extra pipeline stages (in each of 8 different pixel pipes) means a fair amount of extra transistors. (It's not just the extra texel sampling units; you have extra registers to save state between pipeline stages, plus extra control.) Given R300's ambitious task of putting a fast, complete DX9 GPU on a .15u process and selling it as low as the $150 bracket, it's safe to assume transistors were pretty tight. A second reason is that even if all 8 texel samples aren't for the same pixel, with a fully pipelined design you're still fetching 8 texels from the texture cache every clock. Of course there are going to be some clever ways around it, but...it is *tough* to design a cache that'll do that at a high clock rate.

So...those are my new guesses as to why they decided against trilinear in one clock. (Obviously weighed against the simulated performance benefit of having it.)

Chalnoth said:
After all, it only seems natural that many textures will remain plain-old-8888 format, possibly compressed, for some time to come.

Certainly. It's always been my impression that the FP formats are only useful for increased precision for various calculations, and as a way to store textures used in those calculations. Err, that's not very specific. That normal color-textures (like, you know--the ones that actually get stuck onto polygons) will stay 8888 for the forseeable future.

It appears that current architectures do this integer processing in parallel with the PS unit.

So, for a given number of cycles in the PS, how many textures are going to need accessing? How long will it take those texels to get through the filtering stage? At what stage in processing are the colors of the sampled textures going to be needed?

Clearly shading performance is going to be a huge part of overall GPU performance in the future. I do wonder, though, the degree to which shaders will be running parallel with the fixed-function pixel pipeline. Not because it can't be done, but because it seems like (for near-future games at least) PS will only be done on a small fraction of overall rendered texels if we're to get playable framerates. If most of the screen is still covered by "normal" pixels, I doubt you'll be able to be getting use out of the PS simultaneously with rendering those pixels. (I would assume the PS and the fixed-function pipeline have to be working on the same triangle at any one time, correct?)

Or maybe in the not-too-distant future we'll get games which run shaders on all pixels just as Doom3 puts bump maps on every surface.

As for DOOM3's z-only rendering, it seems to me that what should be done instead is to have the multiple z-checks per pixel pipeline per clock put to use for not just multisampling, basically adding a significantly-faster z-only rendering pipeline.

It might take explicit hardware support for such a feature (to be able to calculate the z-values of any 4 pixels in one pipe, instead of just 4 subpixels at various offsets from one pixel), but it does seem like an easy and very useful modification. I wonder if this is implemented in the R300/NV30 generation...? (And if it is implemented on the NV30, why can't it do PJMS??)

arjan de lumens said:
The 'correct' mipmap selection algorithm when aniso is enabled is different from when it is disabled - enabling aniso will, for pixels of polygons viewed at an angle, perform multiple bi/trilinear samples at a high-resolution mipmap level rather than a single sample at a much lower resolution level. The result is that the blurring effect of plain trilinear is removed, and the texel reuse is reduced substantially compared to plain trilinear.

Very interesting. It's kind of funny that I'd assumed the effects on texel reuse of isotropic filtered angled polys on the one hand and anisotropic on the other to be exactly the opposite of what you've said they are. But that's because I wasn't thinking of how to adjust the mip-map selection to make everything look correct, but rather of trying to make the closest analogy to the forward-facing poly case, because I actually understood that one!

But I think I understand everything now.

A couple questions:

First, doesn't aniso therefore run into the "problem" of hitting the largest mip-map size (and therefore becoming blurry but fast) more quickly?

And second, even with the smaller amortized z/framebuffer costs, it seems like if aniso incurs worse texel reuse it might still have a bandwidth/fillrate ratio not considerably different from isotropic filtered pixels. Is AF really primarily a fillrate hit? (Put another way: are there any common settings--other than bumping up the resolution, duh--which will tend to cost relatively less on a GFFX-style card than a 9700?)

Dave H · Jan 24, 2003

Dio said:
There's no reason that current applications can't compress at least 50% of their texture sets, and in the vast majority of cases 80%+.

With trilinear filtering on, even with a side-by-side it is very hard to tell which is which.

Guess that answers that question! So, are there usually settings to turn off texture compression in games (I haven't noticed them)?

Simon F · Jan 24, 2003

arjan de lumens said:
Color texture maps can often be compressed with good results. But we still lack good compression methods for bumpmaps/normal maps. So they will probably have to stay 32-bit for some time still.

16 bits is sufficient for normals.

revealing the remifications of each choice,

remifications: v: The acts of putting comments into a BASIC program?

Dave H · Jan 24, 2003

remifications: v: The acts of putting comments into a BASIC program?

Tens of thousands of words in my post and you have to pick on the one I spelled wrong!

And then to be so cruel as to associate it with the dullest construct in the dullest programming language ever! Why oh why couldn't I have written reifications instead?? And to think my parents put me through an Ivy League education, just to have it all end up like this.

Dio · Jan 24, 2003

Simon F said:
16 bits is sufficient for normals.

Depends on the application - some may even need 16 bits/channel, although if you encode in tangent space that's only a 2-vector. The ATI car demo for R9700 has an option to use 8-bit or 16-bit per channel and the difference is quite striking.

Simon F · Jan 24, 2003

Dio said:
Simon F said:

16 bits is sufficient for normals.

Click to expand...

Depends on the application - some may even need 16 bits/channel, although if you encode in tangent space that's only a 2-vector. The ATI car demo for R9700 has an option to use 8-bit or 16-bit per channel and the difference is quite striking.

I must admit I didn't specify how you should pack them! I was just thinking that 16bits gives 65k possible vectors. Assuming my "back of the envelope" calculations are correct, then if those 65k are spread 'evenly' on a sphere then there is angular difference of <1 degree between the points.

That seems more than subtle enough for computer graphics.

BTW I remember reading a paper that was discussing how a normal represented as 3 floats had enough accuracy to hit a sub-centimetre sized target on Mars from the Earth!

Dave H said:
Tens of thousands of words in my post and you have to pick on the one I spelled wrong!

Oh feel free to find mine... probably evxery second oone is wrongg

. It just struck me as amusing.

Are Multi TMUs per pipe really outdated?

KimB

Dave H

Simon F

Tea maker

arjan de lumens

Dave H

arjan de lumens

Mintmaster

KimB

KimB

arjan de lumens

KimB

arjan de lumens

Dio

arjan de lumens

Dave H

Dave H

Simon F

Tea maker

Dave H

Dio

Simon F

Tea maker

Similar threads