"Free Trilinear" on G80

Jawed · Jan 23, 2007

I wonder what the performance of reading from a texture array in D3D10 is like, e.g. each component along the 3rd dimension. Then you can have an arbitrary number of components for your texture read and then filter how you like.

Does that work?

Jawed

Andrew Lauritzen · Jan 23, 2007

Jawed said:
I wonder what the performance of reading from a texture array in D3D10 is like, e.g. each component along the 3rd dimension. [...] Does that work?

My understanding of texture arrays is that you actually index the third dimension, so you still would need four texture reads for four components. You can split components with ATI's Fetch4 (and MRT) already, but by the time you're doing 4 texture reads, you've negated any benefit (you're just getting the data back in transposed form instead). Technically you can avoid working out the 4 bilinear coordinates, but this hasn't been a bottleneck in my experience.

JHoxley · Jan 24, 2007

I don't have any specific numbers to hand, but a year or so back when I had to implement filtering in a shader for the ATI path (using FP16 textures on a ATI9800 vs NV6600's HW impl) using techniques discussed in this thread I didn't see such bad performance. ISTR the ATI path was within 2-3hz of the NV path - admittedly its hardly a scientific comparison! Low end hardware might well have masked it (I actually killed the 9800 with this project in the end

)

Jawed said:
I was rather hoping that we'd have started seeing the end of the fixed-function TMU and ROP pipelines this generation (at least the blending), but well, it seems like that's still a few years off.

fwiw, a few people I've spoken to that would have some influence on pushing programmable blending said its really not a priority. I'm not sure I agree (but haven't thought about it), but they argue that its such a big change for a relatively small class of algorithms - most uses could be shoe-horned onto existing blend-ops (e.g. use simple additive blending but abstract it out to f(x) = g(x) + h(x) - compute the complex functions g and h in shaders and then use the simple FF addition to composite...)

hth
Jack

Jawed · Jan 24, 2007

JHoxley said:
I don't have any specific numbers to hand, but a year or so back when I had to implement filtering in a shader for the ATI path (using FP16 textures on a ATI9800 vs NV6600's HW impl) using techniques discussed in this thread I didn't see such bad performance. ISTR the ATI path was within 2-3hz of the NV path - admittedly its hardly a scientific comparison! Low end hardware might well have masked it (I actually killed the 9800 with this project in the end )

A lot of 9800Pros died because the fan failed... I had one that did that. It caused very very strange behaviour in Excel when it was on its last legs...

fwiw, a few people I've spoken to that would have some influence on pushing programmable blending said its really not a priority. I'm not sure I agree (but haven't thought about it), but they argue that its such a big change for a relatively small class of algorithms - most uses could be shoe-horned onto existing blend-ops (e.g. use simple additive blending but abstract it out to f(x) = g(x) + h(x) - compute the complex functions g and h in shaders and then use the simple FF addition to composite...)

My main motivation in saying that was thinking of all that computing power that's sat idle when there's no texture filtering/blending being done.

But there's the precision problem as there's hardly any fp32 texture filtering capability in current GPUs, it's mainly int8, so maybe generalising it like this would cost too much in routing/scheduling etc.

Then again, with int16 and int32 as part of the ALU pipeline in D3D10, the common ground between the ALU pipeline and the TMU pipeline (which needs to filter int and fp 32 bit formats) seems to have increased.

There's also density, though - I assume for a given level of performance a fixed function filtering/blending pipeline will just be smaller.

Still, I can't help thinking there must come a cross-over point, based perhaps on available bandwidth versus bilinear filtering capability. Beyond that point the fixed-function bilinear pipeline's utility will simply tail off.

I admit, texture addressing, LODding, biasing, filtering/blending math is stuff I always trip up on

Maybe the argument is more pertinent to ROPs? I guess ROPs have a more uneven workload, specifically blending and Z testing. Isn't Z testing within a shader program the holy grail? (Admittedly, with dire parallelism consequences, i.e. read after write conflicts.) Wasn't this the brunt of David Kirk's argument against ROPs that support AA + FP filtering, their being too expensive for their utility - and that programmable output merge would come?...

Jawed

"Free Trilinear" on G80

Jawed

Andrew Lauritzen

Moderator

JHoxley

Jawed