Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 03-Feb-2005, 23:52   #1
Dave Baumann
Gamerscore Wh...
 
Join Date: Jan 2002
Posts: 12,947
Default FP16 Bilinear Filtering

Does anyone know the cost of doing an FP16 Bilinear Filter on NV40? 2 Cycles? 3 Cycles? Something else?
__________________
Expand. Accelerate. Dominate.
Tweet Tweet!
Dave Baumann is offline   Reply With Quote
Old 04-Feb-2005, 09:35   #2
Zeross
Member
 
Join Date: Jun 2002
Location: France
Posts: 233
Default

It seems that using FP16 textures cost 2 cycles on the NV40 wether you're using point sampling or bilinear filtering.
__________________
Twitter
Zeross is offline   Reply With Quote
Old 05-Feb-2005, 15:39   #3
Tridam
Regular
 
Join Date: Apr 2003
Location: Louvain-la-Neuve, Belgium
Posts: 523
Default

FP16 bilinear filtering is free on NV40. However its texturing unit can't output more than 2 FP16 components per cycle (that's the same with every GPU).

FP16 point sampling x or xy : 1 cycle
FP16 point sampling xyz or xyzw : 2 cycles
FP16 bilinear filtering x or xy : 1 cycle
FP16 bilinear filtering xyz or xyzw : 2 cycles
__________________
Damien Triolet - HardWare.fr
Sorry for my bad English. Maybe one day it'll be better :D
Tridam is offline   Reply With Quote
Old 06-Feb-2005, 22:21   #4
akira888
Member
 
Join Date: Jul 2003
Location: Houston
Posts: 652
Default

Would it make sense that the reason for that is because that data bus between the texture filtering unit and the shading units is only 32 bits wide? (designed for the usual case of RGBA_8 textures)
__________________
"The struggle of man against power is the struggle of memory against forgetting." -Milan Kundera
akira888 is offline   Reply With Quote
Old 07-Feb-2005, 01:21   #5
arjan de lumens
Senior Member
 
Join Date: Feb 2002
Location: gjethus, Norway
Posts: 1,256
Default

Quote:
Originally Posted by akira888
Would it make sense that the reason for that is because that data bus between the texture filtering unit and the shading units is only 32 bits wide? (designed for the usual case of RGBA_8 textures)
Actually, yes. There was a presentation at the Graphics Hardware 2004 conference where some people studied bottlenecks in GPUs for GPGPU-type tasks, and found that NO current GPU had a path from the texture cache to the pixel shader that was wider than 32 bits. (link to presentation (300K ppt) ) So you could read a 4-component FP32 texture if you wanted to (on both R3xx, R4xx, NV3x, NV4x, unfiltered), but the hardware would take 4 cycles to deliver the data on all architectures. This was shown to be a cache->shader shader path limitation, not a memory bandwidth limitation (this can be easily tested by benchmarking reads of grossly magnified textures).
arjan de lumens is offline   Reply With Quote
Old 07-Feb-2005, 18:13   #6
Luminescent
Senior Member
 
Join Date: Aug 2002
Location: Miami, Fl
Posts: 1,036
Default

Current GPU architecture's can read 4 FP32 bit values in a cycle, but can something like NV4x bilinearly filter those 4 fp32 values in a single cycle, bandwith limitations aside?
__________________
"Friendship is unnecessary, like philosophy, like art... It has no survival value; rather it is one of those things that give value to survival."
-C.S. Lewis
Luminescent is offline   Reply With Quote
Old 07-Feb-2005, 18:55   #7
arjan de lumens
Senior Member
 
Join Date: Feb 2002
Location: gjethus, Norway
Posts: 1,256
Default

Quote:
Originally Posted by Luminescent
Current GPU architecture's can read 4 FP32 bit values in a cycle
Source? It may seem like obvious that they SHOULD have that ability, but actual benchmarking so far tells a different story.
arjan de lumens is offline   Reply With Quote
Old 07-Feb-2005, 21:27   #8
Xmas
Off-season
 
Join Date: Feb 2002
Location: On the pursuit of happiness
Posts: 3,019
Default

Quote:
Originally Posted by Luminescent
Current GPU architecture's can read 4 FP32 bit values in a cycle, but can something like NV4x bilinearly filter those 4 fp32 values in a single cycle, bandwith limitations aside?
NV4x can't filter FP32 values. And if the TMUs are limited to 32 bit output per clock (before conversion to FP32), there certainly are no units that would generate more than that.
Xmas is offline   Reply With Quote
Old 07-Feb-2005, 21:43   #9
Sage
13 short of a dozen
 
Join Date: Aug 2002
Location: Southern Methodist University
Posts: 935
Send a message via ICQ to Sage Send a message via AIM to Sage Send a message via MSN to Sage Send a message via Yahoo to Sage
Default

why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?
__________________
This post powered by Macintosh.
Sage is offline   Reply With Quote
Old 07-Feb-2005, 21:55   #10
Luminescent
Senior Member
 
Join Date: Aug 2002
Location: Miami, Fl
Posts: 1,036
Default

Quote:
Originally Posted by arjan de lumens
Quote:
Originally Posted by Luminescent
Current GPU architecture's can read 4 FP32 bit values in a cycle
Source? It may seem like obvious that they SHOULD have that ability, but actual benchmarking so far tells a different story.
I was using the following statement of yours as my source:
Quote:
So you could read a 4-component FP32 texture if you wanted to (on both R3xx, R4xx, NV3x, NV4x, unfiltered), but the hardware would take 4 cycles to deliver the data on all architectures.
I guess glanced at it too quickly and added onto it. At first glance it seemed you were indicating that the pixel units could read 4 fp32 values from the texture units in single cycle if it weren't for the fact that the data path from the tex units to the pixel shader was crippled to 32-bits.

Secondly, I meant 4 FP16 values, since, as Xmas pointed out, NV40 cannot filter FP32 textures.
__________________
"Friendship is unnecessary, like philosophy, like art... It has no survival value; rather it is one of those things that give value to survival."
-C.S. Lewis
Luminescent is offline   Reply With Quote
Old 07-Feb-2005, 22:28   #11
akira888
Member
 
Join Date: Jul 2003
Location: Houston
Posts: 652
Default

Quote:
Originally Posted by Sage
why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?
Probably (disclosure: speaking as a hobbyist coder and not an engineer) because they correctly expected the vast majority of texture reads to only have a 32 bit return value and therefore it simply wasn't worth it to double the size of the on-chip data bus (which would use valuable die area) to accelerate a relatively uncommon operation.
__________________
"The struggle of man against power is the struggle of memory against forgetting." -Milan Kundera
akira888 is offline   Reply With Quote
Old 08-Feb-2005, 15:07   #12
Xmas
Off-season
 
Join Date: Feb 2002
Location: On the pursuit of happiness
Posts: 3,019
Default

As I tried to point out, it's not only the data path, it's the number of units as well. There would be no point to restrict one but not the other.
So there are only two FP16 bilinear interpolators, while there are four 8bit capable interpolators. They could be implemented as 2* FP16 + 2* FX8, or maybe you can somehow combine two FX8 interpolators to form one FP16 interpolator (though I don't see an easy way to do that)
Xmas is offline   Reply With Quote
Old 08-Feb-2005, 17:40   #13
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Quote:
Originally Posted by Xmas
So there are only two FP16 bilinear interpolators, while there are four 8bit capable interpolators. They could be implemented as 2* FP16 + 2* FX8, or maybe you can somehow combine two FX8 interpolators to form one FP16 interpolator (though I don't see an easy way to do that)
The only issue with this idea is that it takes three interpolations to do the summation for bilinear texture filtering.
Chalnoth is offline   Reply With Quote
Old 08-Feb-2005, 18:29   #14
Xmas
Off-season
 
Join Date: Feb 2002
Location: On the pursuit of happiness
Posts: 3,019
Default

Quote:
Originally Posted by Chalnoth
Quote:
Originally Posted by Xmas
So there are only two FP16 bilinear interpolators, while there are four 8bit capable interpolators. They could be implemented as 2* FP16 + 2* FX8, or maybe you can somehow combine two FX8 interpolators to form one FP16 interpolator (though I don't see an easy way to do that)
The only issue with this idea is that it takes three interpolations to do the summation for bilinear texture filtering.
That's why I wrote "FP16 bilinear interpolators".

And there's one additional MAD for trilinear/AF sample accumulation.
Xmas is offline   Reply With Quote
Old 08-Feb-2005, 19:57   #15
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Actually, it was that one additional interpolator for sample accumulation that I was concerned with. I was assuming each interpolator would pretty much automatically be operating on 4-component objects (though it seems that in nVidia's case the FP16 interpolators are a bit more flexible and capable of dual-issue).
Chalnoth is offline   Reply With Quote
Old 08-Feb-2005, 20:55   #16
FUDie
Member
 
Join Date: Sep 2002
Posts: 559
Default

Quote:
Originally Posted by Chalnoth
Actually, it was that one additional interpolator for sample accumulation that I was concerned with. I was assuming each interpolator would pretty much automatically be operating on 4-component objects
That doesn't appear to be the case for FP16.
Quote:
(though it seems that in nVidia's case the FP16 interpolators are a bit more flexible and capable of dual-issue).
Dual-issue? Where? Looks more like a "loopback" arrangement. I.e. x and y are interpolated first then z and w. There's no dual issue involved here: There's a full 2-component FP16 bilinear interpolator.

-FUDie
__________________
Ph.D. - Piled Higher and Deeper
FUDie is offline   Reply With Quote
Old 08-Feb-2005, 21:00   #17
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Quote:
Originally Posted by FUDie
Dual-issue? Where? Looks more like a "loopback" arrangement. I.e. x and y are interpolated first then z and w. There's no dual issue involved here: There's a full 2-component FP16 bilinear interpolator.

-FUDie
Actually, that does make more sense. The initial way I was thinking you'd split up a filtering operation would be to have interpolator 1 average samples 1 and 2, have interpolator 2 average samples 3 and 4, then have a third interpolator to average the results of the above samples. Makes more sense, I suppose, to divide this up such that instead of just having two interpolators instead of three (a large waste in computation), you'd just make them 2-component instead of 4-component.
Chalnoth is offline   Reply With Quote
Old 09-Feb-2005, 05:41   #18
Humus
Crazy coder
 
Join Date: Feb 2002
Location: Stockholm, Sweden
Posts: 3,216
Send a message via ICQ to Humus Send a message via MSN to Humus
Default

Quote:
Originally Posted by Sage
why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?
Internal buses are hardly for free. Besides, when you're using floating point textures, you're likely also doing a decent amount of math, which should balance up the extra latency.
__________________
[ Visit my site ]
I speak for myself and only myself.
Humus is offline   Reply With Quote
Old 09-Feb-2005, 06:15   #19
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Quote:
Originally Posted by Sage
why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?
Well, I would tend to think that it's the math units that are the limitation here. It's got to take quite a few more transistors for FP16 interpolators than the usual FX8 interpolators. And if the math units can't do it any faster, why waste any transistors on data paths?
Chalnoth is offline   Reply With Quote
Old 09-Feb-2005, 15:40   #20
Sage
13 short of a dozen
 
Join Date: Aug 2002
Location: Southern Methodist University
Posts: 935
Send a message via ICQ to Sage Send a message via AIM to Sage Send a message via MSN to Sage Send a message via Yahoo to Sage
Default

okay I guess I didn't read carefully enough, I was under the impression that the bus was the limiting factor.
__________________
This post powered by Macintosh.
Sage is offline   Reply With Quote
Old 09-Feb-2005, 15:44   #21
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Quote:
Originally Posted by Sage
okay I guess I didn't read carefully enough, I was under the impression that the bus was the limiting factor.
Well, to be fair, there's no way to for certain determine whether or not this is the case (without being able to look at the architecture). But it makes most sense that the math units are the limitation.
Chalnoth is offline   Reply With Quote
Old 09-Feb-2005, 18:33   #22
akira888
Member
 
Join Date: Jul 2003
Location: Houston
Posts: 652
Default

It that were so then why does point sampling a FP32 texture have a 4 cycle latency? Point sampling shouldn't require any interpolation at all.
__________________
"The struggle of man against power is the struggle of memory against forgetting." -Milan Kundera
akira888 is offline   Reply With Quote
Old 09-Feb-2005, 20:10   #23
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Quote:
Originally Posted by akira888
It that were so then why does point sampling a FP32 texture have a 4 cycle latency? Point sampling shouldn't require any interpolation at all.
Now that is rather odd, because point sampling a FP32 texture also requires the same bandwidth as bilinear filtering a FX8 texture. It would seem that the architecture isn't as well-optimized to sampling FP32 textures as it could be.
Chalnoth is offline   Reply With Quote
Old 09-Feb-2005, 22:09   #24
psurge
Member
 
Join Date: Feb 2002
Location: LA, California
Posts: 825
Default

Chalnoth - a complete guess, but maybe it's because the texture caching logic fetches/determines cache hits in 2x2 texel blocks. Or, if the latency is always 4 cycles, then probably the bus between the texture unit/shader is only 32bits, so you would need 4 cycles to transfer one fp32 4-vector.
psurge is offline   Reply With Quote
Old 10-Feb-2005, 01:21   #25
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,779
Default

Quote:
Originally Posted by Sage
why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?
While akira's explanation is the most important reason, I think another issue is the way in which FP textures are currently used.

When doing HDR post processing, the bandwidth needed will slow you down anyway, since you have a 1:1 mapping between pixels and texels. Say you were doing a blur of 4 pixels, you need to read 256 bits of data, then write 64 when you're done. I think many of today's GPU's have only 32 bits of bandwidth per pipe per clock, and rarely will you get >90% utilisation.

It makes sense to me. Why make the GPU capable of more than the memory will be able to feed it? Only when you get into ordinary usage of FP textures (i.e. not 1:1) will bandwidth be less an issue, and I think the sky in FarCry's HDR mode is the only example so far.
Mintmaster is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
3dfx Rampage ;) Ante P 3D Architectures & Chips 219 26-Feb-2012 19:48
Chat Transcript: ATI's texture filtering algorithms cho 3D Architectures & Chips 89 23-May-2004 05:06
engineer forgot bilinear filtering unit in PowerVR PCX1 ? ram 3D & Semiconductor Industry 24 29-Jan-2004 19:46
Geforce FX Bilinear Anisotropic Filtering Question ?? Doomtrooper 3D Architectures & Chips 152 16-Feb-2003 04:26
N64 Bilinear filtering hack Roly Console Technology 2 10-Dec-2002 06:39


All times are GMT +1. The time now is 08:39.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.