Are Multi TMUs per pipe really outdated?

Pretty much everything I am reading is indicating that the Single TMU or Single TMU function per pipe is the way of the future. For various reasons like the advent of shaders taking the place of traditional texturing.

My question.. What about Anisotropic Filtering support. I am not sure myself about the process. Is this a function that would be benefited by having 2 TMU's per pipe? I know that the R300 takes a pretty big hit in performance in newer games with Quality Aniso enabled. The Nv30 looks to do much better, but that is much more of a function of the Core being clocked at 500mhz.

is Aniso filtering performance being held back in games like UT 2003, due to its liberal use of Texturing by the 8x1 design? If so.. in keeping with single Tmu.. or even the total lack of TMu's in future products. How would high levels of Aniso be supported without Braking the performance bank?
 
I think the thing is TMUs are just "cheaper" in terms of the amount of die space they require. But for performance reasons you're better off with more pipes than more TMUs. So one reason, I think the reason we're seeing more pipes now is just that the fabrication technology has advanced to the point where it's feasible.

The other one is that memory bandwidth bottlenecks have been overcome more in the last couple years. Back in the time of the GF2, you could have loaded it up with 100 pipes and it wouldn't run any faster, just because it was starved for bandwidth. This allowed the 2 TMU Radeon to be almost as fast as it, due to its hierarchical-Z. Unfortunately, 2 pipes is still 2 pipes, and even back then you were better off with more pipes and less TMUs (also the 3rd TMU was kind of pointless, unfortunately).

In a way it's always made sense to have more pipes. I think basically what's happened is with more pipes adding additional TMUs just isn't as important. All the same, the step up to 8 pipes was pretty big, so that is probably why ATI and Nvidia chose to go with only a single TMU. As fabrication process advances we may see additional TMUs added, but probably never more than 2. I guess the question is whether texturing is serving as much of a bottleneck anymore, or whether geometry, lighting and pixel shading power is more important.
 
more TMUs more SSAA P0w4h :LOL:

R350:
8x2
400/1000 MHz (DDRII)
hardware truform
SSAA, MSAA+SSAA capable drivers
32x Aniso

that would be nice, I wouldn't really care for betetr PS/VS, I've still haven't seen that many games even using basic dx8 shading, at least not any good ones so I think the basic DX9 VS/PS will do just fine
however, more bandwidth and fillraet is always nice no matter how "undeveloped" games are: you can always use more Aniso and FSAA :D

sorry, ranting
 
I still don't feel that SSAA will make a comeback, at least not in the forms we've seen in the past. And multiple TMU's will not aid in supersampling.

As for anisotropic filtering, it's not necessary to add mulitple TMU's in order to improve performance. It is possible to just add additional filtering units (adding support for more textures per pixel would require more transistors), similar to how many video cards have the same multitexturing fillrate with and without trilinear filtering.
 
My guess that it's all about transistor count.
It should be something like that:

num_of_pixel_pipes * pp_cost + num_of_tmus * tmu_cost + num_of_arithmethic_units * au_cost

What I think that with floating point arithmetic the "tmu_cost" and "au_cost" gone up so much relative to "pp_cost" that it doesn't worth to pack multiples of them into the same pixel pipe.

In other words a 4x2 architecture wouldn't be much cheaper than the 8x1 so there's no point doing that.
If that theory turns out to be true, we might see 16x1 before 8x2...
 
Chalnoth said:
And multiple TMU's will not aid in supersampling.

I'm a off my rocker again or isn't it true that adding a TMU would increase texel fillrate which in turn would increase SSAA performance?
 
Ante P said:
isn't it true that adding a TMU would increase texel fillrate

Yes it would.

which in turn would increase SSAA performance?

No it wouldn't.
SSAA eats pixel fillrate just as much as texel fillrate.
Actually it eats pixel fillrate more if you do anisotropic too, because required level of anisotropy will go down on some part of the the image.

Now a 16x1 architecture on the other hand...
 
What I think that with floating point arithmetic the "tmu_cost" and "au_cost" gone up so much relative to "pp_cost" that it doesn't worth to pack multiples of them into the same pixel pipe.

I'm not sure TMU cost has/will go up significantly with FP. For instance, R300 can sample a 128bit FP texture, but it does so over 4 cycles.
 
Again, on another note - adding a second to R300 TMU could help out SSAA performance if Trilinear is the targert, as Trilinear is currently a two cycle op with R300.
 
Hm... I thought that loopback made having more than one TMU per pipe kind of unnecessary? Ignoring cost, that is. Is there any performance disadvantage to loopback compared to passing the data on to the next TMU?

What I am trying to say is: I thought that with loopback, an 8x1 arrangement could do everything as fast as as a 4x2, but the reverse would not necessarily be true. Have I missed something?
 
DaveBaumann said:
Again, on another note - adding a second to R300 TMU could help out SSAA performance if Trilinear is the targert, as Trilinear is currently a two cycle op with R300.

I thougth (and have read here) that each Pipeline can process 4 bilinear filtered texels(?) each cycle. So this is not true??

Why does the R300 need 4 cycles to process one 128bit FP-pixel/Pipeline. I thought (according to the press releases etc.. ) that the Pipelines are 96bit pipelines, able to process one 96bit FP pixel / pipeline / cycle.
 
horvendile said:
Hm... I thought that loopback made having more than one TMU per pipe kind of unnecessary? Ignoring cost, that is. Is there any performance disadvantage to loopback compared to passing the data on to the next TMU?

What I am trying to say is: I thought that with loopback, an 8x1 arrangement could do everything as fast as as a 4x2, but the reverse would not necessarily be true. Have I missed something?

There’s always advantages / trade-offs to be thought about. If you have a dual texturing application (on all polys) then (assuming not other bottlenecks and parity in the number of samples per pipe) the a 4x2 will produce 4 dual texture pixels in 1 cycle whereas an 8x1 will produce 8 dual texture pixels in 2 – so there’s parity there. The point being is that you don’t know how many texture layers the app is going to request.

However, currently most people use Trilinear and, on high end boards, anisotropic filtering, so increasing the number of samples that can be sampled per clock is a good thing (if the bandwidth can sustain it).

mboeller said:
I thougth (and have read here) that each Pipeline can process 4 bilinear filtered texels(?) each cycle. So this is not true??

I’d say it produced only 4 texels per clock, as in 4 texture samples (= 1 bilinear sample).

From the R300 developer documentation:

RADEON 9500/9700 can perform point or bilinear filtering of one texture request per clock cycle per pixel shader pipe, if texture format does not exceed 32 bits. For texture formats fatter than 32 bits it will take 2 clocks for processing 64 bit texture formats and 4 clocks for 128 bit formats. Trilinear filtering doubles number of clocks because it requires two bilinear blends. For all floating point formats on RADEON 9500/9700 only point filtering is supported.

mboeller said:
Why does the R300 need 4 cycles to process one 128bit FP-pixel/Pipeline. I thought (according to the press releases etc.. ) that the Pipelines are 96bit pipelines, able to process one 96bit FP pixel / pipeline / cycle.

The pixel shader pipeline is 96bit wide, but that doesn’t necessarily mean the texture sampler samples 96/128-bits per clock. Once the shader has its data it can operate at 96-bit precision.
 
DaveBaumann said:
From the R300 developer documentation:

RADEON 9500/9700 can perform point or bilinear filtering of one texture request per clock cycle per pixel shader pipe, if texture format does not exceed 32 bits. For texture formats fatter than 32 bits it will take 2 clocks for processing 64 bit texture formats and 4 clocks for 128 bit formats. Trilinear filtering doubles number of clocks because it requires two bilinear blends. For all floating point formats on RADEON 9500/9700 only point filtering is supported.


The pixel shader pipeline is 96bit wide, but that doesn’t necessarily mean the texture sampler samples 96/128-bits per clock. Once the shader has its data it can operate at 96-bit precision.


Thanks !
 
DaveBaumann said:
a 4x2 will produce 4 dual texture pixels in 1 cycle whereas an 8x1 will produce 8 dual texture pixels in 2 – so there’s parity there. The point being is that you don’t know how many texture layers the app is going to request.

Thanks.
Shouldn't that mean that if there is only one texture layer and no fancy filtering, the 8x1 is twice as fast as the 4x2 (since the second TMU is unused), if there are two layers they are the same speed, if there are three layers the 8x1 is, what, 1/3 faster than 4x2, and so forth?
So that 8x1 is often better than and never worse than a 4x2?
And, as said above, completely disregarding things such as transistor count.

Just trying to make sure I understand.

Are there any latencies whatsoever involved in loopback? Or in tossing data to the next TMU, should there be one?
 
horvendile said:
Thanks.
Shouldn't that mean that if there is only one texture layer and no fancy filtering, the 8x1 is twice as fast as the 4x2 (since the second TMU is unused), if there are two layers they are the same speed, if there are three layers the 8x1 is, what, 1/3 faster than 4x2, and so forth?
So that 8x1 is often better than and never worse than a 4x2?
And, as said above, completely disregarding things such as transistor count.

Thats correct.
More pixel pipelines = better.
 
However, currently most people use Trilinear and, on high end boards, anisotropic filtering, so increasing the number of samples that can be sampled per clock is a good thing (if the bandwidth can sustain it).

So really.. an 8x2 would be the ideal. Being that while at the same time that Shaders are taking over more of the advanced textureing features requiring less TMU's, Tri and Aniso use is increasing reqiring an additional TMU to matain timely sample rates.

What would the math look like to show the target bandwidth needed to support a fully loaded 8x2?
 
Hellbinder[CE said:
]
So really.. an 8x2 would be the ideal. Being that while at the same time that Shaders are taking over more of the advanced textureing features requiring less TMU's, Tri and Aniso use is increasing reqiring an additional TMU to matain timely sample rates.

What would the math look like to show the target bandwidth needed to support a fully loaded 8x2?

But a 16x1 would be better?
 
Back
Top