G84/G86-TMUs - Quad or Octo?

AnarchX

Veteran
Has G84 four Quad-TMUs or, since NV just add here 4 TAs, two Octo-TMUs?

Because I saw some theoretical fill-rate-tests and there had G84 problems to reach its theoretical maximum or just come near to it, and I asked me why.

http://techreport.com/articles.x/12285/4
http://www.elitebastards.com/cms/in...sk=view&id=379&Itemid=27&limit=1&limitstart=4
In my opinion it could not be a bandwidth-limitation, because difference between GT and GTS is only 15% @ Techreport (near to the 25% difference in core-clock), although GTS has 42% more bandwidth.
 
I suspect G84 is ALU-limited here. This may be because it can't interpolate the texture coordinates fast enough to keep the TMUs fed - or if you prefer there's too many TAs.

Dunno.

Jawed
 
Anyway, my point is that, since 163.71 driver with the new NVAPI, this limitation could be evaluated to some degree with shader clock scaling, without touching the rest of the chip.
 
I suspect G84 is ALU-limited here. This may be because it can't interpolate the texture coordinates fast enough to keep the TMUs fed - or if you prefer there's too many TAs.
That should be easy to test by using the same texture coordinates for multiple texture layers.
 
Or NVidia will, soon, conveniently launch a GPU that has a much higher ALU:TEX ratio whilst also having a 1:1 TA:TF ratio?...

Jawed
 
If G84's TMUs are so much constrained by the interpolation rate, what could it be better done for the 8800GT SKU chip, if it applies the same cluster design (1:1 TA:TF)?
 
Actually, this question came in a discussion about 8800GT.;)

After some thoughts and suspicion of Jawed, I doubt that it will have 1:1 ratio, it probadly was just a decission for G84/G86, which buyers do not use so often triliniear or anisotropic filtering.
 
If G84's TMUs are so much constrained by the interpolation rate, what could it be better done for the 8800GT SKU chip, if it applies the same cluster design (1:1 TA:TF)?
There are three options:
  1. clock the SPs far higher, e.g. 2.4GHz. G84's SPs run at 1.45GHz, while the core is 0.675GHz - a clocking ratio of 2.15:1. If this new chip were 2.4/0.8, say, then the clocking ratio would be 3:1 - not a huge improvement, but maybe enough?
  2. double the width of an SP array - currently an SP array is 8-wide, so this would be 16-wide. This does make it significantly more complex because of the register file/texturing/constant-buffer wiring to route data to/from the SPs. It also makes life more difficult for the fine-grained scheduler's scoreboarding unit (e.g. keeping track of when a pixel's texture results have been returned by the TMU)
  3. double the number of SP arrays within each cluster. This increases the complexity of coarse-grained scheduling within the GPU (deciding where a pixel is sent to be executed and then collating pixels when they've been fully shaded) and creates a different kind of wiring problem twixt SPs and texturing/constant-buffers.
I don't know which of 2 or 3 is more friendly towards implementing double-precision. We're expecting DP to be, at best, half the speed of single precision, so DP shouldn't, in itself, create a major increase in wiring complexity (between major function blocks). Erm...

I think 3 is the way to go, with a dose of 1 sprinkled on top, but we'll have to wait and see.

Of course it would be nice to see a review that investigated G84 texturing against clocks, to see if ALU:TEX ratio is getting in the way.

Jawed
 
There are three options:
  1. clock the SPs far higher, e.g. 2.4GHz. G84's SPs run at 1.45GHz, while the core is 0.675GHz - a clocking ratio of 2.15:1. If this new chip were 2.4/0.8, say, then the clocking ratio would be 3:1 - not a huge improvement, but maybe enough?
  2. double the width of an SP array - currently an SP array is 8-wide, so this would be 16-wide. This does make it significantly more complex because of the register file/texturing/constant-buffer wiring to route data to/from the SPs. It also makes life more difficult for the fine-grained scheduler's scoreboarding unit (e.g. keeping track of when a pixel's texture results have been returned by the TMU)
  3. double the number of SP arrays within each cluster. This increases the complexity of coarse-grained scheduling within the GPU (deciding where a pixel is sent to be executed and then collating pixels when they've been fully shaded) and creates a different kind of wiring problem twixt SPs and texturing/constant-buffers.
I don't know which of 2 or 3 is more friendly towards implementing double-precision. We're expecting DP to be, at best, half the speed of single precision, so DP shouldn't, in itself, create a major increase in wiring complexity (between major function blocks). Erm...

I think 3 is the way to go, with a dose of 1 sprinkled on top, but we'll have to wait and see.

Of course it would be nice to see a review that investigated G84 texturing against clocks, to see if ALU:TEX ratio is getting in the way.

Jawed

Your options sparked a flashback to December 2006 when I was puzzling about the following :

http://forum.beyond3d.com/showpost.php?p=891262&postcount=305
 
If G84's TMUs are so much constrained by the interpolation rate, what could it be better done for the 8800GT SKU chip, if it applies the same cluster design (1:1 TA:TF)?
"So much constrained"? I fail to see how 70-76% of theoretical peak rate in a synthetic multitexturing test indicates a real bottleneck.
 
Back
Top