"Free Trilinear" on G80

DavidC

Regular
Sorry if a similar thread was made before, but:

I was looking through Archmark results and noticed the Geforce 8800GTX scores: http://www.beyond3d.com/forum/showpost.php?p=866340&postcount=39

When Geeforcer said "free trilinear" is possible on the G80, how did he get that conclusion?? Is it because Bilinear/Trilinear scores are very similar to each other??

I can't help but to show you guys this link: http://www.forum-3dcenter.org/vbulletin/showthread.php?t=321049

Look on the first page for the Archmark results. Its for G965/GMA X3000. On the Textured fillrate score, it seems to behave similar to the G80, scoring similar on both Bilinear/Trilinear. Now I don't really know what that means, but maybe you guys do. Does it mean the G965 also has "Free" trilinear??
 
I would like to see results with newer drivers, because those aren't too "coherent", let us say. From a certain point of view, this would look like free trilinear. But I think the X3000 doesn't do trilinear at all (it forces everything to bilinear) in early drivers - so that might be a simpler and much better expanation to this. It does warrant some verification, though! Anyone has one to retest? :)


Uttar
 
given G80 has "twice the filtering power" per TMU compared to geforce 6, 7 etc. (almost all other GPUs actually) it's not surprising. Likewise geforce 1 and some S3 card also have free trilinear.
 
You can have 'free' trilinear when you have enough filtering power but something else is the bottleneck.
 
Yeah all that does is make me nervous that we're not getting enough speed out of bilinear. In a glass half empty sort of way, "free trilinear" just means "inefficient bilinear" ;).

Of course as noted by Nick, texture reads/filtering can be "free" if you're doing enough math, geometry, etc.
 
Yeah all that does is make me nervous that we're not getting enough speed out of bilinear. In a glass half empty sort of way, "free trilinear" just means "inefficient bilinear" ;)
Well, it means fewer bilinears per mm^2, but more trilinears per mm^2.
Assuming they are implementing this by double-pumping the filtering units (which we can't be certain of, but it'd make sense), rather than doubling their count, you'd expect this to be more of a "free bonus" though. But it does remain to be seen how they are actually implementing it!


Uttar
 
But it does remain to be seen how they are actually implementing it!
Agreed, but trilinear is simply more work (both memory and math), so if it's going the same speed as bilinear, it seems to me that bilinear could have been made to go faster. It's disappointing when there's hardware specific to one task that is apparently going idle otherwise. I have the same complaint with "free MSAA", but in that case there's some compression arguments that come into play which somewhat legitimize it.
 
Agreed, but trilinear is simply more work (both memory and math)
It's not free in terms of memory bandwidth, though, and neither is MSAA. As for math, IF this is achieved by custom designing the unit and double-pumping it, it's fairly cheap, relatively speaking.
And no matter how you achieve it, the idea is the same: if doubling the filtering power of the TMU costs X, while just doubling the number of TMUs costs 3X, it makes sense to do the former.

Remember that the G80 doesn't just double trilinear speed. It doubles filtering speed in general. From that point of view, "either trilinear or 2x AF or FP16" is free, math-wise. The only real-world case where it's going to waste is when the (non-FP16) texture is being magnified. Arguably, that's a case we want to encourage developers to move away from, anyway! :)


Uttar
 
I edited the thread title. Not that I have something against neologisms, but "trilinearing" was maybe too edgy for most of us. ;)
 
Agreed, but trilinear is simply more work (both memory and math), so if it's going the same speed as bilinear, it seems to me that bilinear could have been made to go faster. It's disappointing when there's hardware specific to one task that is apparently going idle otherwise. I have the same complaint with "free MSAA", but in that case there's some compression arguments that come into play which somewhat legitimize it.
I was thinking along the same lines as you. If you've doubled the buses to the memory and shader units, you already have to access different locations for the mipmaps, and also doubled the filtering units, why not generalize it? But then I realized a few things which I posted in the G80 architecture thread.

You need double the pixels in flight (via increased thread count and/or batch size) if you're going to double the number of textures you fetch per clock (at the same efficiency). This requires quite a bit of hardware, including double the register space.

I think it was a good design decision. 32 G80-style TMUs are as good as 64 traditional TMUs for volume textures, 64- and 128-bit textures, and ordinary textures wherever two or more mipmaps are needed, so all the heavier texturing loads are twice as fast.
 
Who cares nowadays if trilinear is free or not? Or to rephrase that question who on God's green earth would use on a G80 just trilinear and not high quality AF straight away?

Trilinear vs. bilinear AF isn't "free" in the strict sense, since one might find cases where the drop between them can account for something less than 20%; and that's also the reason why the trilinear optimisation still exists in the driver.

Even worse why would I care for improved bilinear or even worse bilinear AF performance? Because I'm fond of blurry mipmaps in the first case or mipmap banding in the second as simple examples?
 
You need double the pixels in flight (via increased thread count and/or batch size) if you're going to double the number of textures you fetch per clock (at the same efficiency). This requires quite a bit of hardware, including double the register space.
Hmm, yeah that's a good point that I hadn't though of.

It'd be nice to have super-efficient point sampling, or perhaps something like Fetch4 but more flexible (i.e. support multi-component textures). From my messing around it seems that some precision is lost by using hardware interpolation with fp32 (not sure why... any ideas?), and furthermore sometimes you want to grab the four samples, evaluate some function and *then* weight the result. It'd be nice if this was as fast - or nearly as fast - as using "hardware" filtering.

Ailuros said:
Even worse why would I care for improved bilinear or even worse bilinear AF performance? Because I'm fond of blurry mipmaps in the first case or mipmap banding in the second as simple examples?
Because there's a lot of things that you can do with interpolation that aren't just "filter diffuse texture using interpolated polygon texture coordinates". For some of those, some subset of {AF,trilinear,bilinear} is inapplicable. The example that comes to mind is using bilinear filtering to do post-processing... trilinear and aniso usually don't make sense here. Another example is summed-area tables in which bilinear is useful, but again not trilinear or AF.
 
From my messing around it seems that some precision is lost by using hardware interpolation with fp32 (not sure why... any ideas?)
AFAIK, because the four texels are scaled to match the higest exponent of the four, then interpolated.
 
AFAIK, because the four texels are scaled to match the higest exponent of the four, then interpolated.
I see, that would explain it. Makes it a bit less useful for fp filtering then (other than trivial cases like colour data)... Still, better to have it than not :) It's definitely be nice to have a fast point sampling path, although the current speed is certainly "not bad".
 
Because there's a lot of things that you can do with interpolation that aren't just "filter diffuse texture using interpolated polygon texture coordinates". For some of those, some subset of {AF,trilinear,bilinear} is inapplicable. The example that comes to mind is using bilinear filtering to do post-processing... trilinear and aniso usually don't make sense here. Another example is summed-area tables in which bilinear is useful, but again not trilinear or AF.

One more annoying L-question and then I'll shrug back into my corner: I'm under the impression that bilinear is already for free for years now; how would one theoretically speed up bilinear in that sense? (these are honest questions for the record).
 
One more annoying L-question and then I'll shrug back into my corner: I'm under the impression that bilinear is already for free for years now; how would one theoretically speed up bilinear in that sense? (these are honest questions for the record).
Bilinear filtering is only free (even when it's the base filter) when conditions to make it so prevail. Further, Andy's not talking about it being more 'free', he just wants more of it (more bilerps to burn).
 
AFAIK, because the four texels are scaled to match the higest exponent of the four, then interpolated.
this is what basicly happens everytime there's fp calculation going on (more or less..) so in the end
it's just a matter of implementation accuracy.
The loss of precision is not really something they can't avoid..(burning more transistors ;) )
 
Further, Andy's not talking about it being more 'free', he just wants more of it (more bilerps to burn).
Yeah, I just hate having hardware that goes unused... this is a lot better in G80 with unified shaders and scalar processors, but it can always get better :)

The thing that I really want with respect to filtering is for "software" filtering (trilinear, bilinear, or even aniso) to be *almost* as efficient as "hardware" filtering. As it is, it's quite disappointing to "fall off the fast path" if you need to do something as trivial as - for example - percentage closer filtering. Now of course bilinear PCF has a special hardware implementation (with special texture types and sampling functions), but that's entirely my point!

ATI's Fetch4 is a good start... giving you back the four interpolants at the same speed as a bilinear. As I mentioned though, it's too limited.

I'm willing to accept some speed decrease from doing filtering "manually", but it shouldn't be four times slower (for bilinear). It'd be nice to be able to do more complicated things that involve many (coherent, offset) texture samples without losing the benefit of the hardware filtering "fast path".
 
As ALU:TEX ratios increase, then software filtering becomes more bearable, I guess.

I was rather hoping that we'd have started seeing the end of the fixed-function TMU and ROP pipelines this generation (at least the blending), but well, it seems like that's still a few years off.

Jawed
 
As ALU:TEX ratios increase, then software filtering becomes more bearable, I guess.
Yeah the math doesn't hurt at all on the G80 or R580, it's just that four point sampled texture reads is close to four times slower than one bilinear. There are some reasons that I can understand for this to be the case (like the increased register pressure, etc. noted above), but it'd be nice to be more like 10% slower rather than 400%.
 
Back
Top