DX9 and multiple TMUs

anyone, can comment this part of the article of Tom, and gice me the consequences on the 2*8 architecture of the NV30?

IMO, a 2*8 architecture would certainly be somewhat faster than a 1*8 architecture even with the same memory interface, but you woudn't get anywhere near 2X the performance. You are not ALWAYS memory bandwidth limited.

Put it this way: if it would be absolutely useless to have 2*8 pipelines with a 256 bit interface as Tom purports, it would also be absolutely useless to have 4*2 pipelines with a 128 bit intferface.

Of course, the top cards of today (Radeon 8500 and GeForce 4) are exactly that: 4*2 architectures with a 128 bit interface. While we're always clamoring for more bandwidth, I don't recall anyone arguing that the extra TMUs on Radeon and GeForce4 being useless.

I can't really see the practicality though of having more than 2 TMUs per pipe. And I only see the second TMU as really being useful to allow simultaneous access to two mip-map levels for things like Anisotropic / trilinear filtering....
 
A 256-bit bus should easily be enough to feed an 8x2 architecture.

After all, a 128-bit bus is enough to feed today's 4x2 architectures, isn't it?

Anyway, if nVidia does put forward an 8x2 architecture, what we'll probably see is better non-aniso, non-FSAA performance than the R300, but more equivalent performance when either is enabled.
 
A 256-bit bus should easily be enough to feed an 8x2 architecture.

That, I disagree with. I don't think its easily enough. I don't think that the 128 bit bus is enough for the GeForce Radeon 8500 pipelines. They seem to be bandwidth limited more often than not.

That being said, this doesn't mean an additional TMU would be useless, because you are not always bandwidth limited.

Based on R-300 performance, I'd say that the 256 bit bus seems to be about the right pairing for an 8*1 pipeline. And while I think it would get some performance boost from an additional TMU in certain situations, it's not all clear to me that it would be worth the additional silicon cost.
 
Joe DeFuria said:
That, I disagree with. I don't think its easily enough. I don't think that the 128 bit bus is enough for the GeForce Radeon 8500 pipelines. They seem to be bandwidth limited more often than not.

Well, since the GeForce4 almost never has much of a performance hit from enabling 2x FSAA, it should be pretty clear that, at least when FSAA is not enabled, there is enough memory bandwidth on this video card.

When FSAA is enabled, I agree, there certainly is not. It is for this reason that the use of an 8x2 pipeline is questionable (along with aniso speed, of course...), since most, if not all, of the increase in performance would be without FSAA or aniso enabled.

But...imagine this. What if nVidia has a sort of framebuffer compression that only stores one color data to the framebuffer when FSAA is enabled if that pixel is completely-covered by the current triangle? If this were the case, even 8x FSAA could be implemented without too much memory bandwidth hit.

Quick note: This sort of algorithm wouldn't be as efficient as Matrox' FAA (which I still doubt could work 100% properly) since it will need full pixel coverage on all the edges of all triangles, not just edge triangles.
 
Well, obviously 8x2 would require twice the texture cache.

Perhaps this isn't all that hard to do on a .13 micron process.
 
Well, I'm just attempting to point out that there are ways around memory bandwidth limitations other than increasing the memory bandwidth.
 
Of course....for all we know NV30 is a deferred renderer in the sense of PowerVR. On the other hand, NV30's memory architecture might end up being worse than what R-300 employs.

If you recall, you said "A 256-bit bus should easily be enough to feed an 8x2 architecture."

"Easily", in my mind, does not mean "let's think up some new ways comabt memory bandwidth, and if that is implemented, 256 bits would be enough..."
 
I think that in the future the idea of chips having 'separate TMUs' will (or maybe already is?) ease away.

Instead, you will have a resource pool (n texture reads per clock, m pixel shader operations per clock) that will be used on the available pixels in flight at any particular time.

The GF4, R8500 are comparable to a Pentium type architecture - where there was a U pipe that always ran and a V pipe that ran if there was a pairable instruction (i.e. the second texture unit).

Future chips (I don't know to what extent either R9700 or nv30 will do this) will likely go towards an Athlon / Pentium 2+ type architecture, where the next data that needs something doing is fed to the next execution unit. This means fewer transistors sit around doing nothing and you get more bang for your buck.
 
Of course....for all we know NV30 is a deferred renderer in the sense of PowerVR. On the other hand, NV30's memory architecture might end up being worse than what R-300 employs.

Was that sarcastic?

I don't think NV30 will be anything apart from an IMR. As far as the memory controller concerns, remains to be seen.

Even the R300's efficiencies haven't gotten analyzed completely yet; the card might show some highly impressive persentages so far but I wouldn't call that an efficient analysis

By the way if it would be a deferred renderer, why would there be a need for a second TMU anyway?
 
Well, considering what we know about NV30, assuming a 300+MHz clock and 8 pixel pipelines x2 texture units (I'm reaching on the clockspeed, but I would venture it a safe guess that it won't be slower than a R300 ;) in clockspeed on .13um) that means that effectively the fillrate will have at least doubled over Geforce4 Ti4600. Without a corresponding increase in memory bandiwidth are we going to see something similar to Geforce2 GTS? Or are we missing a vital piece of information ?
 
You actually need 8x2 (well you could do 16x1 or 4x4) to be fully ps.2.0 ready. Pixel shaders 2.0 require 16 different textures and 32 texture samples. Radeon 9700 can have 8 different textures and 16 texture samples (that's ps.2.0 talk) per clock. This is where those 160 instructions come from... (16 texture address instructions + 64 arithmetic instructions) * 2 clocks = 160 instructions. So if they want to be ps.2.0 compliant then they could also make 128 arithmetic instructions (which is over ps.2.0). You can not fetch two different textures from same TMU in a same cycle.
 
Was that sarcastic?

Yes and no.

I also don't believe it will be anything but an IMR, but we don't know if nvidia implemented any "additional" bandwidth saving techniques beyond any evolutionary improvements over what they've already done.

By the way if it would be a deferred renderer, why would there be a need for a second TMU anyway?

Much the same reason why you want more TMUs for IMRs. Deferred rendering doesn't give you infinite texture read access. ;) More TMUs = more texture reads per clock.

For a simplistic example, an IMR with dual TMUs (and enough bandwidth to support it) could conceivably do true trilinear "for free" compared to a deferred renderer with one TMU.
 
ben6 said:
Well, considering what we know about NV30, assuming a 300+MHz clock and 8 pixel pipelines x2 texture units (I'm reaching on the clockspeed, but I would venture it a safe guess that it won't be slower than a R300 ;) in clockspeed on .13um) that means that effectively the fillrate will have at least doubled over Geforce4 Ti4600. Without a corresponding increase in memory bandiwidth are we going to see something similar to Geforce2 GTS? Or are we missing a vital piece of information ?

Hmm, I just had a look at the Siggraph2002 OverviewOfGraphicsHardware.

Some things that Mark J. Kilgard mentions about future hardware (slide36 ->):

* More texture units (4 today, 16 soon)
- Huh? We have two per pipeline today. I cannot make sense of this...
* More effective early Z culling
* Extra fast stencil/depth only rendering

- So no NV30 as a deferred renderer in the sense of PowerVR architecture.

Interesting paper BTW.
 
* More texture units (4 today, 16 soon)
- Huh? We have two per pipeline today. I cannot make sense of this...
You can access 4 different textures on GeForce 3 & 4 in one shader. In DX 9 you can access up to 16 different textures.
 
MDolenc said:
You can access 4 different textures on GeForce 3 & 4 in one shader. In DX 9 you can access up to 16 different textures.

Yes, I know but he was taking about texture units. Maybe it was just a typo though...
 
MDolenc said:
You actually need 8x2 (well you could do 16x1 or 4x4) to be fully ps.2.0 ready.

That is not true. 16 textures in one pass does not mean 16 textures in one clock. In fact, the different pixel pipelines of the R300 (or NV30) may never work together for multitexturing.
 
Chalnoth said:
That is not true. 16 textures in one pass does not mean 16 textures in one clock. In fact, the different pixel pipelines of the R300 (or NV30) may never work together for multitexturing.
Of course it does not. Radeon 9700 cen provide 16 textures in one pass, but 8x2 hardware would be able to provide 16 textures in one clock.

Yes, I know but he was taking about texture units. Maybe it was just a typo though...
He didn't say that was per pipeline... It's just how many units are available to pixel shader.
 
Back
Top