RV870 texture filtering

If only the attribute interpolation is moved to the shaders that makes more sense. In that case Dave should really be a bit more elaborate and sensitive to context though ... the topic talks about texture filtering after all. Hell, not all attributes are used for texture access. If this turns out to be true it's really a bit of a faux pas ... texture interpolation is a bit of mismatch of terms which is not an accurate way of referring to attribute interpolation and very strongly suggests it's talking about filtering (anisotropic filtering strictly speaking isn't interpolation, but bilinear filtering is ... and bilinear is the bases for everything else).

Filtering is done on the TUs, attribute interpolation is done on the ALUs. Would be pretty suicidal to do otherwise, I think.
 
Well with specialized instructions it could be made fast enough ... but unless you could overlap it with normal shading instructions in some way it would leave the FP unused, so it doesn't make a lot of sense.
 
Well with specialized instructions it could be made fast enough ... but unless you could overlap it with normal shading instructions in some way it would leave the FP unused, so it doesn't make a lot of sense.

Wavey was simply overcome with the joy of launching the 5870, hence his wording:D
 
Eventually, D3D will require FP32 interpolation precision and AMD wins!
Sorry for being somewhat misleading. I was excited that texture filtering precision will eventually increase.

Its 8-bit for texture coordinate fractional part according to CUDA Programming Guide, probably higher on real chips, so there are at least 24x9 hardware multipliers.
Going to 24x24 isnt really a big deal, ~2.5x increase in transistor count. Can be compensated by increase in overall utilization of ALU.
 
Last edited by a moderator:
I don't quite get it why this sort of interpolation would be faster, as it seems to be limited by bandwidth and shader calculations.

For true trilinear you need 8 texels, which would imply filter rate would drop to 68 / 2 GTex/s because of bandwidth limitations. Some of the tests don't show this drop in performance, so I don't get it how that is possible.
This new way of filtering also seems to imply that all anisotropic filtering is done in the shaders.

Per SIMD group (16 x 5 ALUs) 64 texture bytes can be fetched from the L1 cache per clock cycle, or 16 rgba 32-bit texels. Enough data for bilinear filtering of 4 rgba texels.
To me it seems at least 4 shader instructions are needed per rgba texel to do bilinear filtering. This would mean that all shader ALUs would be fully utilized leaving no room for real shader calculations...
Agreed, all this changes looks too revolutionary.
AF needs variable number of samples, depending on du,dv.
Decompression requires interpolators, it was probably the same hw units that done filtering.

4 DP4 or MADD for each bilinear RGBA fetch? Seems reasonable.
But that assuming texture unit provides coefficients? How precise are they? And how much bandwidth needed between TU's and ALU?

Maybe R870 ALU "knows" dimensions of texture? This way they can calculate coefficients, lets assume with 4 new 1D instructions, making a total of 20 instruction slots for single bilinear fetch.

This review mentioned 80 full speed FP16 texture interpolation units and 80 texture addressing units:
http://www.ixbt.com/video3/cypress-part1.shtml
 
The way I know understand it from the above is that now the texture coordinates across a triangle are interpolated by the shader core where before apparently this was done with fixed function stuff.
The texture filtering itself remains done by the texture units and is fixed function.
 
Voxilla: I'm affraid you are confusing texture filtering and interpolation of vertex attributes. These are two different things. Interpolators aren't related to texturing, these units are completely separated. RV770 had 10 texturing units and 8 interpolators. RV870 still has full-fledged (20) texturing units (which are even more capable), only interpolators were removed and their work is done via shader core.

G80 and GT200 used separated mini-SPs for this work. They are located in the shader core, but not used for general shading. E.g. G80 had 128 SPs for general shading and 128 mini-SPs for interpolation and special functions.

//edit:
The way I know understand it from the above is that now the texture coordinates across a triangle are interpolated by the shader core where before apparently this was done with fixed function stuff.
The texture filtering itself remains done by the texture units and is fixed function.
Yes, that's correct.
 
Texture interpolators have been removed from the design and is done on the shader core. In general we are seeing this as a performance improvement - its also the reason why one of the Vantage feature test gets a disproportionate increase over the previous gen.
Is that Texture Fillrate Test or POM?

Jawed
 
No, those are quite in line with what you'd expect given the 4870s results. It's Perlin Noise that stands out quite a bit.
Really? Perlin Noise stands out, but so do the two other tests I mentioned:

HD5870 v HD4890, theoretically (ignoring bandwidth) precisely 2x:


http://www.ixbt.com/video3/cypress-part2.shtml
  • Texture fillrate - 1880 v 883 = 2.13x - this is a pure texturing test so rate of texture coordinate interpolation is important, though I now realise that this can't be the test Dave was referring to because it seemingly uses fp16 texels, which should run at half rate and therefore be texture rate limited not interpolation limited
  • POM - 59.5 v 26.8 = 2.20x - I don't know much about this test (other POM tests are heavy on dynamic branching)
  • Perlin Noise - 157.1 v 60.5 = 2.6x - this test, theoretically, does volume texture lookups (3DMark06 does, don't know what Vantage does). Volume texture lookups make use of dependent coordinate calculations, so there is no interpolation of texture coordinates, so a change in architecture relating to interpolation should make no difference
Perlin noise may benefit from the math changes. Unfortunately without shader code for these tests, it's a guessing game. But I notice that the 3DMark06 Perlin noise has 40 DP3s executed as DP4s, which is a 20% overhead.

In prior discussions of Vantage Perlin noise:

http://forum.beyond3d.com/showthread.php?p=1252196#post1252196

there's some doubt about its bottleneck. In theory math changes shouldn't improve it on ATI if it's not ALU-bound.

Jawed
 
So, it's Perlin Noise getting this boost, right? And I thought it was math limited. *doh*
I think it was that one, and yes the test in general is quite math intensive, however we aren't exactly short on Math power, so in previous gen this test is interpolator limited.
 
From what I was told, Pull-Modell-Interpolation is an optional features, usable under DX11. If the software chooses to use it, then - and only then, the shaders are utilized for interpolation.
This might be true from an API perspective, though a hardware design is free to always use shaders.

While we're at it:
What's decelerating your chips in Vantage's "GPU Cloth" feature test? There's hardly an improvement there.
Decelerating?
 
Really? Perlin Noise stands out, but so do the two other tests I mentioned:

HD5870 v HD4890, theoretically (ignoring bandwidth) precisely 2x:


http://www.ixbt.com/video3/cypress-part2.shtml
  • Texture fillrate - 1880 v 883 = 2.13x - this is a pure texturing test so rate of texture coordinate interpolation is important, though I now realise that this can't be the test Dave was referring to because it seemingly uses fp16 texels, which should run at half rate and therefore be texture rate limited not interpolation limited
  • POM - 59.5 v 26.8 = 2.20x - I don't know much about this test (other POM tests are heavy on dynamic branching)
  • Perlin Noise - 157.1 v 60.5 = 2.6x - this test, theoretically, does volume texture lookups (3DMark06 does, don't know what Vantage does). Volume texture lookups make use of dependent coordinate calculations, so there is no interpolation of texture coordinates, so a change in architecture relating to interpolation should make no difference
Perlin noise may benefit from the math changes. Unfortunately without shader code for these tests, it's a guessing game. But I notice that the 3DMark06 Perlin noise has 40 DP3s executed as DP4s, which is a 20% overhead.

In prior discussions of Vantage Perlin noise:

http://forum.beyond3d.com/showthread.php?p=1252196#post1252196

there's some doubt about its bottleneck. In theory math changes shouldn't improve it on ATI if it's not ALU-bound.

Jawed

Everything under (850x2)/750=2.26 would not be out of the ordinary in my books - well, at least not totally counterintuitive. So I ruled out Tex-Fill and POM, since they maxed at ~2,20.

BTW: Numbers directly from AMD (i hope it's ok to post them now):
amdbenchguidecrop.png

[cropped a bit for convenience]

edit:
I think Dave was talking specifically about 3DMark Vantage


I think it was that one, and yes the test in general is quite math intensive, however we aren't exactly short on Math power, so in previous gen this test is interpolator limited.

Thanks Dave (again)!
BTW - did you save something this time around also like you did with HD 4850? ;-)


Decelerating?

See above. It seems abnormally slow, compared to the other feature tests in comparison to both its own predecessors as well as NVidias (mathematically massively weaker) offerings.
 
Well if anyone actually knew what Futuremark was doing with their shader(s), it would be easier to answer...if only I had Vantage:)
 
So, now that the wait is over when will AMD persuade Futuremark to create a DX11 benchmark application? That would be IMO a nice compliment to the 5000 series cards and for Win7.
 
I'm sure they are working on a Dx11 graphics test, hopefully this time not tied to PhysX but instead using compute shaders.

Regards,
SB
 
I'm sure they are working on a Dx11 graphics test, hopefully this time not tied to PhysX but instead using compute shaders.

Regards,
SB
Dont think so. If that is true, you would have heard something about the new 3DMark.
But all I hear is about the first game of futuremark, shattered horizon.
 
I think that AMD needs to step up to the plate and provide a DX11 benchmark application. Be it from Futuremark or from someone else. IMO, it's imperative to have such a program to demonstrate how well the 5000 series video cards are when DX11 is used. They will have to go the extra mile to provide such application(s) so they can actually demonstrate the effectiveness of DX11 with their hardware :arrow:(again IMO).
 
Back
Top