RV870 texture filtering

Voxilla · Sep 23, 2009

Do I get it right that texture filtering is now done by the shaders ?

So for example for trilinear filtering 8 texels need to be fed to the shaders which then filters it to one value...

Obviously this saves some filtering ALUs. Probably also increases filtering accuracy as the fixed function filtering was only like 8-bits accurate.

As a lot more data needs to be passed to the shaders it seems the bandwidth from the L1 texture caches to the shaders has doubled.

Total L1 bandwidth claimed around 1 TB/s.

For bilinear 8-bit rgba you need 4*4 = 16 bytes per sample, so up to 1 TB / 16 =~ 60 GTex/s possible.

Per SIMD cluster L1 bandwidth is thus 1 TB / 20 = 50 GB /s
Bus width to SIMD from L1 should thus be 50 GB/s / 850 Mhz = 59 byte.

For sure this is 64 byte, making L1 bandwidth 20 x 64 x 850 = 1.088 TB/s and bilinear filter rate 68 GTex/s, exactly as claimed.

Voxilla · Sep 23, 2009

Tom's hardware has a clear answer:

http://www.tomshardware.com/reviews/radeon-hd-5870,2422-4.html

"While we’re on the subject of setup and rasterization, we should point out one other change. The fixed-function units that handled interpolation calculations have disappeared, and that job is now addressed by the shader processing units. AMD claims that the impact on performance is negligible, and this is in line with the current trend towards getting rid of as many fixed units as possible and taking advantage of the enormous processing power of modern GPUs."

Voxilla · Sep 23, 2009

Another interesting fact on the above page:

"In its architectural description, AMD ambiguously claims to include dual rasterizers. As you probably know, current GPUs are capable of rasterizing a single triangle per cycle, and that very serial approach has become the main reason for the performance bottlenecks that show up in synthetic geometry tests on unified-shader architectures.

At first, we thought AMD had found a way to parallelize the setup, which would have been particularly well-suited to a GPU that places a lot of importance on tessellation. There are any number of options for rasterizing several triangles in parallel, but they’re very complex. So, we were curious to see how AMD had solved this puzzle. Unfortunately the answer was disappointing: AMD was playing fast and loose with its wording. In practice, there’s still only a single rasterizer, handling a single triangle per cycle. But now there are twice as many scan conversion units, generating 32 pixels per cycle in order to match the increase in ROPs. Instead of dual rasterizers, it would be better to simply call this implementation a more powerful rasterizer."

So maximum triangle rate has not gone up and is still at most one triangle per clock cycle

CarstenS · Sep 23, 2009

Voxilla said:
Tom's hardware has a clear answer:

http://www.tomshardware.com/reviews/radeon-hd-5870,2422-4.html

"While we’re on the subject of setup and rasterization, we should point out one other change. The fixed-function units that handled interpolation calculations have disappeared, and that job is now addressed by the shader processing units. AMD claims that the impact on performance is negligible, and this is in line with the current trend towards getting rid of as many fixed units as possible and taking advantage of the enormous processing power of modern GPUs."

From what I was told, Pull-Modell-Interpolation is an optional features, usable under DX11. If the software chooses to use it, then - and only then, the shaders are utilized for interpolation.

Silent_Buddha · Sep 23, 2009

According to reviews I've read so far. There is no dedicated hardware interpolation on 5870. All interpolation is now done in the shaders.

As well, with regards to the post topic, there is no longer any adaptive AF. It's full trilinear all the time.

Regards,
SB

Tridam · Sep 23, 2009

CarstenS said:
From what I was told, Pull-Modell-Interpolation is an optional features, usable under DX11. If the software chooses to use it, then - and only then, the shaders are utilized for interpolation.

I checked that and unless Cypress has 80 interpolators, it's done in the shader core.

Ailuros · Sep 23, 2009

Silent_Buddha said:
As well, with regards to the post topic, there is no longer any adaptive AF. It's full trilinear all the time.

While the 2nd sentence is wonderful news, are you sure that the AF algorithm isn't adaptive anymore? It would be the dumbest move ever and would mean quite some filtering redundancy.

AlexV · Sep 23, 2009

Tridam said:
I checked that and unless Cypress has 80 interpolators, it's done in the shader core.

Same thing I'm seeing here.

CarstenS · Sep 23, 2009

Tridam said:
I checked that and unless Cypress has 80 interpolators, it's done in the shader core.

Why would Cypress have less than what it needs?

Silent_Buddha said:
According to reviews I've read so far. There is no dedicated hardware interpolation on 5870. All interpolation is now done in the shaders.

Maybe because it was exactly the purpose of phrasing it like that:
"Pull Model Interpolation
•New DirectX 11 Feature
•Uses Stream Processors for Interpolation with New Instructions"

Silent_Buddha said:
As well, with regards to the post topic, there is no longer any adaptive AF. It's full trilinear all the time.

Please, differentiate between adaptive AF-Levels and optimiziations in texture-sampling rate for a given level.

AlexV · Sep 23, 2009

CarstenS said:
Why would Cypress have less than what it needs?

Because nothing is free in chip design? It's not like you can go: "wow, we need this, let's put it in!" ad infinitum ad nauseum. Why did the RV770 gen have less than what it needed? Not saying it does or does not have dedicated interpolators (all we know is that it can properly hit its theoretical rates even with dumb pixels, which wasn't the case for RV770 due to interpolation limitations), but there are reasons for moving it to the ALUs...and reasons for not doing that.

MfA · Sep 23, 2009

I assumed the bit about interpolators was simply about vertex interpolation/tweening.

Dave Baumann · Sep 23, 2009

Texture interpolators have been removed from the design and is done on the shader core. In general we are seeing this as a performance improvement - its also the reason why one of the Vantage feature test gets a disproportionate increase over the previous gen.

Enforcer · Sep 23, 2009

http://www.anandtech.com/printarticle.aspx?i=3643

The texture units located here have also been reworked. The first of these changes are that they can now read compressed AA color buffers, to better make use of the bandwidth they have. The second change to the texture units is to improve their interpolation speed by not doing interpolation. Interpolation has been moved to the SPs (this is part of DX11’s new Pull Model) which is much faster than having the texture unit do the job. The result is that a texture unit Cypress has a greater effective fillrate than one under RV770, and this will show up under synthetic tests in particular where the load-it and forget-it nature of the tests left RV770 interpolation bound. AMD’s specifications call for 68 billion bilinear filtered texels per second, a product of the improved texture units and the improved bandwidth to them.

Very interesting and surprising. It seems AMD and NVIDIA/Intel will go different routes. Eventually, D3D will require FP32 interpolation precision and AMD wins!

Voxilla · Sep 23, 2009

Enforcer said:
Very interesting and surprising. It seems AMD and NVIDIA/Intel will go different routes. Eventually, D3D will require FP32 interpolation precision and AMD wins!

I don't quite get it why this sort of interpolation would be faster, as it seems to be limited by bandwidth and shader calculations.

For true trilinear you need 8 texels, which would imply filter rate would drop to 68 / 2 GTex/s because of bandwidth limitations. Some of the tests don't show this drop in performance, so I don't get it how that is possible.
This new way of filtering also seems to imply that all anisotropic filtering is done in the shaders.

Per SIMD group (16 x 5 ALUs) 64 texture bytes can be fetched from the L1 cache per clock cycle, or 16 rgba 32-bit texels. Enough data for bilinear filtering of 4 rgba texels.
To me it seems at least 4 shader instructions are needed per rgba texel to do bilinear filtering. This would mean that all shader ALUs would be fully utilized leaving no room for real shader calculations...

MfA · Sep 23, 2009

Unless they put some 8-bit SIMD instructions in the shaders ... these are essentially free compared to floating point. Personally if I were to move filtering to the shader units I'd still put in instructions specifically for 8 bit per component textures.

AlexV · Sep 23, 2009

Voxilla said:
I don't quite get it why this sort of interpolation would be faster, as it seems to be limited by bandwidth and shader calculations.

For true trilinear you need 8 texels, which would imply filter rate would drop to 68 / 2 GTex/s because of bandwidth limitations. Some of the tests don't show this drop in performance, so I don't get it how that is possible.
This new way of filtering also seems to imply that all anisotropic filtering is done in the shaders.

Per SIMD group (16 x 5 ALUs) 64 texture bytes can be fetched from the L1 cache per clock cycle, or 16 rgba 32-bit texels. Enough data for bilinear filtering of 4 rgba texels.
To me it seems at least 4 shader instructions are needed per rgba texel to do bilinear filtering. This would mean that all shader ALUs would be fully utilized leaving no room for real shader calculations...

Attribute interpolation.

CarstenS · Sep 23, 2009

Dave Baumann said:
Texture interpolators have been removed from the design and is done on the shader core. In general we are seeing this as a performance improvement - its also the reason why one of the Vantage feature test gets a disproportionate increase over the previous gen.

Thanks for clarifying Dave! :thumbs:
My mails to AMD PR seem to be stuck for quite a few hours now.

So, it's Perlin Noise getting this boost, right? And I thought it was math limited. *doh*

While we're at it:
What's decelerating your chips in Vantage's "GPU Cloth" feature test? There's hardly an improvement there.

Enforcer said:
Anandtech said:

http://www.anandtech.com/printarticle.aspx?i=3643
AMD’s specifications call for 68 billion bilinear filtered texels per second, a product of the improved texture units and the improved bandwidth to them.

Click to expand...

Very interesting and surprising. It seems AMD and NVIDIA/Intel will go different routes. Eventually, D3D will require FP32 interpolation precision and AMD wins!

Now, that's a surprising from anandtech there. I was under the impression, that the 68 GTex are a product of the number of TMUs times the chip's clock rate.

Voxilla · Sep 23, 2009

AlexV said:
Attribute interpolation.

Can you explain with some more words what you mean ?

MfA · Sep 23, 2009

If only the attribute interpolation is moved to the shaders that makes more sense. If this turns out to be true it's really a bit of a faux pas ... texture interpolation is a bit of mismatch of terms which is not an accurate way of referring to attribute interpolation and very strongly suggests it's talking about filtering (anisotropic filtering strictly speaking isn't interpolation, but bilinear filtering is ... and bilinear is the bases for everything else).

Voxilla · Sep 23, 2009

MfA said:
Unless they put some 8-bit SIMD instructions in the shaders ... these are essentially free compared to floating point. Personally if I were to move filtering to the shader units I'd still put in instructions specifically for 8 bit per component textures.

Yeh, or 16 bit fixed point, I wonder.

RV870 texture filtering

Voxilla

Voxilla

Voxilla

CarstenS

Moderator

Silent_Buddha

Tridam

Ailuros

Epsilon plus three

AlexV

Heteroscedasticitate

CarstenS

Moderator

AlexV

Heteroscedasticitate

MfA

Dave Baumann

Gamerscore Wh...

Enforcer

Voxilla

MfA

AlexV

Heteroscedasticitate

CarstenS

Moderator

Voxilla

MfA

Voxilla

Similar threads