G84/G86-TMUs - Quad or Octo?

AnarchX · Oct 3, 2007

Has G84 four Quad-TMUs or, since NV just add here 4 TAs, two Octo-TMUs?

Because I saw some theoretical fill-rate-tests and there had G84 problems to reach its theoretical maximum or just come near to it, and I asked me why.

http://techreport.com/articles.x/12285/4
http://www.elitebastards.com/cms/in...sk=view&id=379&Itemid=27&limit=1&limitstart=4
In my opinion it could not be a bandwidth-limitation, because difference between GT and GTS is only 15% @ Techreport (near to the 25% difference in core-clock), although GTS has 42% more bandwidth.

Jawed · Oct 3, 2007

I suspect G84 is ALU-limited here. This may be because it can't interpolate the texture coordinates fast enough to keep the TMUs fed - or if you prefer there's too many TAs.

Dunno.

Jawed

fellix · Oct 3, 2007

Is the interpolator logic locked to the shader clock domain?

Jawed · Oct 4, 2007

fellix said:
Is the interpolator logic locked to the shader clock domain?

The shaders do the interpolation!

That's what all the fuss about the multifunction interpolator was before G80 appeared: the special function unit is joined at the hip to the interpolator:

http://forum.beyond3d.com/showthread.php?t=31854

Jawed

fellix · Oct 4, 2007

Anyway, my point is that, since 163.71 driver with the new NVAPI, this limitation could be evaluated to some degree with shader clock scaling, without touching the rest of the chip.

Xmas · Oct 4, 2007

Jawed said:
I suspect G84 is ALU-limited here. This may be because it can't interpolate the texture coordinates fast enough to keep the TMUs fed - or if you prefer there's too many TAs.

That should be easy to test by using the same texture coordinates for multiple texture layers.

Jawed · Oct 4, 2007

Or NVidia will, soon, conveniently launch a GPU that has a much higher ALU:TEX ratio whilst also having a 1:1 TA:TF ratio?...

Jawed

fellix · Oct 4, 2007

If G84's TMUs are so much constrained by the interpolation rate, what could it be better done for the 8800GT SKU chip, if it applies the same cluster design (1:1 TA:TF)?

AnarchX · Oct 4, 2007

Actually, this question came in a discussion about 8800GT.

After some thoughts and suspicion of Jawed, I doubt that it will have 1:1 ratio, it probadly was just a decission for G84/G86, which buyers do not use so often triliniear or anisotropic filtering.

Jawed · Oct 4, 2007

fellix said:
If G84's TMUs are so much constrained by the interpolation rate, what could it be better done for the 8800GT SKU chip, if it applies the same cluster design (1:1 TA:TF)?

There are three options:

clock the SPs far higher, e.g. 2.4GHz. G84's SPs run at 1.45GHz, while the core is 0.675GHz - a clocking ratio of 2.15:1. If this new chip were 2.4/0.8, say, then the clocking ratio would be 3:1 - not a huge improvement, but maybe enough?
double the width of an SP array - currently an SP array is 8-wide, so this would be 16-wide. This does make it significantly more complex because of the register file/texturing/constant-buffer wiring to route data to/from the SPs. It also makes life more difficult for the fine-grained scheduler's scoreboarding unit (e.g. keeping track of when a pixel's texture results have been returned by the TMU)
double the number of SP arrays within each cluster. This increases the complexity of coarse-grained scheduling within the GPU (deciding where a pixel is sent to be executed and then collating pixels when they've been fully shaded) and creates a different kind of wiring problem twixt SPs and texturing/constant-buffers.

I don't know which of 2 or 3 is more friendly towards implementing double-precision. We're expecting DP to be, at best, half the speed of single precision, so DP shouldn't, in itself, create a major increase in wiring complexity (between major function blocks). Erm...

I think 3 is the way to go, with a dose of 1 sprinkled on top, but we'll have to wait and see.

Of course it would be nice to see a review that investigated G84 texturing against clocks, to see if ALU:TEX ratio is getting in the way.

Jawed

PeterAce · Oct 4, 2007

Jawed said:
There are three options:

clock the SPs far higher, e.g. 2.4GHz. G84's SPs run at 1.45GHz, while the core is 0.675GHz - a clocking ratio of 2.15:1. If this new chip were 2.4/0.8, say, then the clocking ratio would be 3:1 - not a huge improvement, but maybe enough?

double the width of an SP array - currently an SP array is 8-wide, so this would be 16-wide. This does make it significantly more complex because of the register file/texturing/constant-buffer wiring to route data to/from the SPs. It also makes life more difficult for the fine-grained scheduler's scoreboarding unit (e.g. keeping track of when a pixel's texture results have been returned by the TMU)

double the number of SP arrays within each cluster. This increases the complexity of coarse-grained scheduling within the GPU (deciding where a pixel is sent to be executed and then collating pixels when they've been fully shaded) and creates a different kind of wiring problem twixt SPs and texturing/constant-buffers.

I don't know which of 2 or 3 is more friendly towards implementing double-precision. We're expecting DP to be, at best, half the speed of single precision, so DP shouldn't, in itself, create a major increase in wiring complexity (between major function blocks). Erm...

I think 3 is the way to go, with a dose of 1 sprinkled on top, but we'll have to wait and see.

Of course it would be nice to see a review that investigated G84 texturing against clocks, to see if ALU:TEX ratio is getting in the way.

Jawed

Your options sparked a flashback to December 2006 when I was puzzling about the following :

http://forum.beyond3d.com/showpost.php?p=891262&postcount=305

Xmas · Oct 4, 2007

fellix said:
If G84's TMUs are so much constrained by the interpolation rate, what could it be better done for the 8800GT SKU chip, if it applies the same cluster design (1:1 TA:TF)?

"So much constrained"? I fail to see how 70-76% of theoretical peak rate in a synthetic multitexturing test indicates a real bottleneck.

fellix · Oct 4, 2007

You will, when you look at the past 90% rates from the other architectures.

Xmas · Oct 4, 2007

Which are equally of very limited relevance in real games.

G84/G86-TMUs - Quad or Octo?

AnarchX

Jawed

fellix

Jawed

fellix

Xmas

Porous

Jawed

fellix

AnarchX

Jawed

PeterAce

Xmas

Porous

fellix

Xmas

Porous