r300 unit texture per pixel pipeline

muppy · May 18, 2003

Why r300 has only one texture unit per pixel pipeline and not two?
Thanks
Muppy

breez · May 18, 2003

Because 8x2 is considerably more expensive than 8x1

R300 on 0.15 is a masterpiece already!

YeuEmMaiMai · May 18, 2003

i would say that about 50% of your transistor count is testure units

radeon r100 30 million 2 pipes 3 tmu
radeon r200 60 million 4 pipes 2 tmu per pipe
radeon r300 110 million 8 pipes 1 tmu per pipe

I would bet that the following have not changed very much since the R100 maybe some tweaks and updates

Vga core

IDCT engine

basic TMU structure

WaltC · May 18, 2003

YeuEmMaiMai said:
i would say that about 50% of your transistor count is testure units

radeon r100 30 million 2 pipes 3 tmu
radeon r200 60 million 4 pipes 2 tmu per pipe
radeon r300 110 million 8 pipes 1 tmu per pipe

I would bet that the following have not changed very much since the R100 maybe some tweaks and updates....

Well, I'd imagine that the difference of 6 TMUs in R100 versus 8 TMUs in R3xx would not equate to 80M transistors....

(Wasn't R100 2x3?) Likewise, by your numbers there is a 50M transistor-count difference between R2xx and R3xx, but both architectures use 8 TMUs.

You have to factor in the number of pixel pipes which also require transistors, not to mention fp circuitry, which is also expensive relative to transistor count.

But I do generally agree that it's 8x1 instead of 8x2 for R3xx because of lower transistor counts for 8x1 (which could well exacerbate yields at 8x2.) It's at least conceivable that if ATi moves R350 to .13 microns it could then spare the transistors for 8x2...unless they have other priorities for the architecture, of course, that they consider more important.

John Reynolds · May 18, 2003

And if you consider the fact that unless you've got the bandwidth to feed those additional texture units, they're just going to sit there stalled. Actually, they're going to do more than that, because the added transistor count from them will increase heat and power consumption and thereby lower the clock speed, leaving you with a chip that doesn't substantially outperform a 8x1 architecture and yet costs more to bring to market.

YeuEmMaiMai · May 18, 2003

But R200 added things like trueform and more advanced PS 1.4 support all I am saying is that TMU take up a lot of transistor realestate compared to things like VGA core and DVD engine...

Also the vortex power of the R300 is double that of R200

Tahir2 · May 18, 2003

HyperZ III also added a lot more transistors...which is why you see this feature switched off in 9500NP and also in the GFFX equivalent 5200.

The 8 pipelines + DX9 PS/VS + HZIII + deeper caches (guessing here) + FP Units = 107+ million transistors.

megadrive0088 · May 18, 2003

I am hoping that R390 has 8:2 and the bandwidth to support it
(GDDR3 or some other highend memory

dominikbehr · May 19, 2003

I believe Nx1 architecture is most efficient. There is no silicon wasted for additional texture units which may/may not be used when rendering.

Salvee · May 19, 2003

dominikbehr said:
I believe Nx1 architecture is most efficient. There is no silicon wasted for additional texture units which may/may not be used when rendering.

Yes, but wouldn't the second TMU be always active when at least trilinear filtering is used (since even R300 needs two clocks to apply trilinear) ?
Is there a technical possibilty of a 'stripped-down' second TMU/dedicated hardware which's sole purpose is to speed up trilinear or aniso ?

Joe DeFuria · May 19, 2003

John Reynolds said:
And if you consider the fact that unless you've got the bandwidth to feed those additional texture units, they're just going to sit there stalled. Actually, they're going to do more than that, because the added transistor count from them will increase heat and power consumption and thereby lower the clock speed, leaving you with a chip that doesn't substantially outperform a 8x1 architecture and yet costs more to bring to market.

Agree and disagree.

If ATI goes to 8x2, and keeps the "pixel rate to bandwidth" ratio similar to what they have today, then there can be significant gains. I do agree that you'd be hitting bandwidth limitations more often than not, but I also believe that at 8x1, you are not hitting bandwidth limitations enough that doubling the texel rate would be a no-gain.

Consider that the GeForce4 Ti is a 4x2 on a 128 bit bus.

Assuming the Loci memory controller and memory saving techniques would be at least as good as the GeForce4 ti, what would be the problem with having 8x2 on a 256 bit bus? Do we think the GeForce4 ti would be just as fast if it dropeed the 2nd TMU?

8x2 on a 256 bit bus is exactly double the pixel, texel, and bandwidth of a geForce4 ti. And again to be clear, such a part might be bandwidth limited more ofthen than not, but there's probably significant performance to be gained.

I won't guess for sure one way or another about Loci's config, but I certainly don't rule out the possibility of 8x2.

dominikbehr · May 20, 2003

Joe DeFuria said:
Consider that the GeForce4 Ti is a 4x2 on a 128 bit bus.

Consider NV35 which is 4x2 on 256bit bus.

Also look at Radeon 8500 -> Radeon 9000. There is no big performance drop between them.

In my opinion Nx1 architectures are better, especially because in NxM architectures some textures units stay idle because of either bandwidth limitations or application using number_of_textures % M != 0.

Also please note that in pixel shading architectures after you read textures you have to actually process them. With simple quake lightmaps on Nx2 arch you can get away with TEX, TEX, MUL output, tex1, tex2; What if you use more textures? With 4 textures you need at least 3 ALU ops to combine them. So you need double pumped ALU or something. But then you cannot use them in parallel when instruction depends on result of previous one. Now maybe you want to add out of order execution here? You end up with enough transistors to make 8 pipelines packed into 4 pipelines which limit you to 4 pixels per clock output.
Then you throw single texture, or color fill at this arch and you end up with most of the transistors idling...

Mr David Kirk said that 3D rendering is embarassingly parallel problem. But some people can really complicate it.

Simple things should be simple. Complex things should be possible. -Alan Kay

Joe DeFuria · May 20, 2003

Points well taken dominik.

Again, I certainly agree that 8x1 would be far more efficient. It's just that in the super high end space, the most cost-effective / efficient is not necessarily what you're after. You're after the absolte best performance so you get the bragging rights / brand recognition.

Consider NV35 which is 4x2 on 256bit bus.

And consider NV18 which is 2x2 on a 128 bit bus.

Also look at Radeon 8500 -> Radeon 9000. There is no big performance drop between them

Agreed...not very big, but it's there. And given the same price, what do people recommend? The 8500. Of course, because the 8500 is presumably more expensive to make, that's not really what ATI wants, so 4x1 in the mainstream / value makes perfect sense: you want to be as efficient as possible there.

Also please note that in pixel shading architectures after you read textures you have to actually process them....

Yes, I mentioned this in some other post. The disadvantage to this set-up is that you don't increase shading performance per clock. (But then, shading performance per clock is currently an R300 strong point.)

I think one key thing to consider (marketing wise) for a product coming out in Q3/Q4/Q1 '04....is that it be as good a "Doom3" renderer as possible. Doom3, AFAIK, is basically lots of "relatively simple" pixel shading operations...even the NV1x and R100 cores are capable of them. so having more pixel shading power, may not be the target for cards this fall.

r300 unit texture per pixel pipeline

muppy

breez

YeuEmMaiMai

WaltC

John Reynolds

Ecce homo

YeuEmMaiMai

Tahir2

megadrive0088

dominikbehr

Salvee

Joe DeFuria

dominikbehr

Joe DeFuria

Similar threads