Memory BW vs. texture units?

freka586

Newcomer
When looking at the rumoured specs of the 1GB version of AMDs 2900 it hit me that I know too little about the texture and sampling stage of the modern pipeline.

Say that these cards provide about 140 GB/s of memory BW. And that the texture operations are controlled by 16(?) texture units. In what cases would you see a substantial performance benefit when comparing with e.g. the roughly 100 GB/s 512MB 2900, or previous generation products?

Unfortunately I do not have very much experience from the gaming perspective, so perhaps these questions are utterly foolish.... For the workstation application I am working on we are 100% bound by the number of texture units. I guess this is because we perform loads of trilinearly filtered volume texture lookups. We saw almost no perf. increase at all for HD2900XT compared to X1900XT.
 
are you asking, how much extra performance will i gain from going from a 512mb card to a 1gig one
(very little i guess)
or how much extra perfromance wil i gain froma 1gig card when im botlenecked by texture lookups ?(same as pevious answer i guess)

or are you asking, when im botlenecked by texture lookups what would gain me extra performance more memory bw or more texture units ?
 
It seems he's basically asking why his application doesn't see a benefit moving from R580 to R600. At first I was going to say that it won't if it's just manipulating INT8 textures, but a quick look at TechReport's and XbitLabs' fillrate tests show that R600 shows improvements over R580 in synthetic tests. freka's app may not be as straightforward as those tests, though, and I'm not sure if those tests are showing the benefit of R600's advantage in bandwidth or texture address units.

The obvious answers are:

* 1GB buys you nothing over 512MB if your app isn't using more than 512MB, and
* the GDDR4 (aka 1GB, for now) XT's extra bandwidth buys you nothing over the GDDR3 version if bandwidth isn't your bottleneck.

But this all hinges on what "loads of trilinearly filtered volume texture lookups" means to R580 and R600, assuming the lookups are the bottleneck. That's beyond me. If I had to guess (blindly), I'd say that those lookups are entirely dependent on INT8 bilerps, in which case R600 won't show an improvement beyond its somewhat higher clock speed. (HW.fr proves my point, though I'm not sure why TR's and Xbit's filtering tests didn't show the same.)

If the app were pushing FP16 textures around, then you'd expect to see R600 embarrass R580.
 
I second Pete's comments in every area, and I'd just like to emphasize that the texture filtering capability of the 2900XT is way more narrow than any other capability on the texturing chain, so it's quite sure that it's the bottleneck. Take a look at the architecture chart here, check the sampler array rectangle.

As the 1GB 2900XT may have a higher core clock speed, it will show some increase - but that's probably not the amount you're looking for. Not until you switch to FP16 textures, anyway.
 
I guess this is because we perform loads of trilinearly filtered volume texture lookups. We saw almost no perf. increase at all for HD2900XT compared to X1900XT.
You'll want a NVidia card, then. G80 will have 4 times the performance per clock, and overall you should get 2x the performance on the cheaper 8800GTS.

Volume texture performance was one of the reasons I went for G80 this generation.

Which you don't need 95% of the time. Can't make any sense of this design decision..

Marco
Neither can I, to be honest. Seems like FP16 filtering is really only marginally useful for some aspects of HDR post processing. I guess 32 texels per clock with half speed FP16 was too expensive? Maybe this was a half-step before R650.
 
Neither can I, to be honest. Seems like FP16 filtering is really only marginally useful for some aspects of HDR post processing.
Where doing it in the shader with Fetch4 wouldn't really be much slower (if at all), anyway...
 
Which you don't need 95% of the time. Can't make any sense of this design decision..
I'd say there are two distinct points here, FP16 filtering rate and FP16 fetch rate. The bilinear filter itself is probably not the most expensive part, and in light of sRGB textures as well as formats like R9G9B9E5_SHAREDEXP or R11G11B10_FLOAT, higher precision filtering at full rate does make some sense.
 
Where doing it in the shader with Fetch4 wouldn't really be much slower (if at all), anyway...
Fetch4 is indeed pretty efficient in my experience but remember that it's limited to only 1-component textures! If it was possible to fetch the 4 bilinear samples of an arbitrary texture that would indeed be interesting, although my understanding is that it wouldn't be much faster than simply doing the four texture fetches (particularly with the new offset instructions).

So although Fetch4 seems useful in theory, in practice it really only seems to be useful for PCF, and now that DX10 mandates the comparison sampling modes even that seems of somewhat questionable utility...
 
I'd say there are two distinct points here, FP16 filtering rate and FP16 fetch rate. The bilinear filter itself is probably not the most expensive part, and in light of sRGB textures as well as formats like R9G9B9E5_SHAREDEXP or R11G11B10_FLOAT, higher precision filtering at full rate does make some sense.
I see your point but I question their decision of having RGBA8 filtering and FP64 filtering running at the same rate, when this rate is nothing to write home about for a monster chip like R600.
The vast vast majority of texture fetches are currently from RGBA8 textures and it will stay that way in the conceivable future IMHO, slowing it down to FP64 filtering rate (but someone would say they just improved FP64 filtering rate..) is not exactly the smartest decision they could take, IMHO again.
It would have made sense if at least FP64 filtering rate was close to the same filtering rate of their direct competition (with RGBA8 filtering just running half rate), but AFAIK this is not the case.

Marco
 
I agree that R600's texturing rate is too low. All I'm saying is that you want sRGB textures at full speed, and 10 or 11 bit fixed point precision isn't enough for that. So having FP16 as minimum filter precision makes sense.
 
Back
Top