G92 quadruple bilinear texel rate?

swaaye · Jan 6, 2008

Nevermind. I figured it out. 56 texture address units (GT) divided by 2 for FP16. Half rate.
http://www.techreport.com/articles.x/13772

Does this apply to G84/6 too? I suppose it does.

Buntar · Jan 6, 2008

FP16 fetch is half rate on all G8x/G92. FP32 fetch is four times slower than INT8 fetch on this chip architecture.

Number of TAs doesn't have much to do with FP16 bilinear texel rate.

swaaye · Jan 6, 2008

Buntar said:
Number of TAs doesn't have much to do with FP16 bilinear texel rate.

So you mean by that that there is another component up in the works that causes the half-rate and the TAs are just idle?

MDolenc · Jan 6, 2008

G92 GT has 56 texture address units (TA) and 56 texture filtering units (TF), which means it can produce 56 bilinear filtered texels per clock. It can do so with INT8 data (classic 32bit RGBA if you will) as Buntar described, but can't do so with data formats that are larger then 32bits (namely FP16 or half precision float with 64bits and FP32 or single precision float with 128bits). That's why FP16 filtering will be just half as fast and FP32 filtering will be only a quarter of the original and yes TAs will just sit idle. Original G80 was different in that it had half the TA units for the same amount of TF units. This caused a very low performance hit when using anisotropic filtering. It also caused the appearance that FP16 was full speed (where in fact even UIN8 filtering was half speed due to lack of TAs).
It is all however just a function of available bandwidth. For example even if you could do FP16 and FP32 filtering at full speed (and spent the transistor budget to do so), you probably won't see full speed texel rates in practice as you'll run out of bandwidth.
It's the same for post pixel shader blending... G8x/G92 will only run full speed on 32 bit data (UINT8), FP16 will run at half the speed and blending will reduce performance even further.

Arun · Jan 6, 2008

Just three quick corrections fwiw: A) Vec1/Vec2 FP10/FP16/INT16 is full-speed. B) Vec1 FP32 is full-speed, Vec2 FP32 is half-speed. C) INT8 blending is half-speed (same as FP10/FP16). And yes, that's a retarded design flaw, arguably justifiable by little else than marketing ("free HDR!")

Buntar · Jan 7, 2008

Arun said:
Just three quick corrections fwiw: A) Vec1/Vec2 FP10/FP16/INT16 is full-speed. B) Vec1 FP32 is full-speed, Vec2 FP32 is half-speed. C) INT8 blending is half-speed (same as FP10/FP16). And yes, that's a retarded design flaw, arguably justifiable by little else than marketing ("free HDR!")

Don't know about blending, but I disagree with your points "a" and "b". Performance of G8x bilinear filtering doesn't seem to differ with different numbers of texture channels. A diagram below should prove my point (half speed FP16, quarter speed FP32, and "half speed" INT8 on G80 due to lack of TAs):

Mintmaster · Jan 7, 2008

Arun said:
And yes, that's a retarded design flaw, arguably justifiable by little else than marketing ("free HDR!")

You think so? Each ROP has a 64 bit path to memory, and handles blending for 4 pixels. Just the blending alone needs 256-bits of data for INT8, and then you have texture data and z reads. If you wanted full speed INT8 blending then you'd need ~5 data transfers per core clock, i.e. 3GHz memory. INT8 at half speed is a very good choice.

Maybe you could make the case that FP16 blending should have been 1/4 speed, but it doesn't really matter. I doubt they wasted much space doing it the way they did.

Arun · Jan 7, 2008

Buntar: I generated the data for that graph, so I should know...

Because of the lack of TAs *and* a bug for FP32 (confirmed by NV) on G80 (under the drivers used for that test), it is indeed impossible to see what I'm refering to there though. These numbers should be a fair bit clearer hopefully! (for G92)

G92 Trilinear
----------------------------------
DXT1: 16783.216533MTexops/s
DXT3: 16783.216533MTexops/s
DXT5: 16666.666418MTexops/s

INT8 Vec1: 16783.216533MTexops/s
INT8 Vec2: 16783.216533MTexops/s
INT8 Vec3: 16783.216533MTexops/s
INT8 Vec4: 16783.216533MTexops/s

FP10: 8391.608267MTexops/s
RGB9E5: 8391.608267MTexops/s

Depth16: 8333.333209MTexops/s
Depth24: 8391.608267MTexops/s
Depth32: 8391.608267MTexops/s

FP16 Vec1: 16783.216533MTexops/s
FP16 Vec2: 16783.216533MTexops/s
FP16 Vec3: 8362.369213MTexops/s
FP16 Vec4: 8391.608267MTexops/s

INT16 Vec1: 16551.723891MTexops/s
INT16 Vec2: 16666.666418MTexops/s
INT16 Vec3: 8391.608267MTexops/s
INT16 Vec4: 8391.608267MTexops/s

FP32 Vec1: 16783.216533MTexops/s
FP32 Vec2: 8421.052506MTexops/s
FP32 Vec3: 4203.152302MTexops/s
FP32 Vec4: 4195.804133MTexops/s

As for blending: I agree it's not a big problem, and if you do the calculations it's only a very minor bottleneck. If you look at a 'classic' particle workload, I think depth often isn't even read (hier-z...) and it doesn't need to be written either. As for texturing, you only read one DXT5 texture; that's 1 byte or 2 bytes depending on bilinear or trilinear.

So if you estimate blending to take exactly 12 bytes/pixel for your average particle counting memory subsystem inefficiencies, then the 8800GT would be perfectly balanced: it requires and has exactly 57.6GB/s for that. But I can definitely imagine scenarios where it takes, say, 10 bytes/pixel. Then you've just lost 17% performance for that part of the frame. You could argue that it's no longer the case with 4x MSAA, but not every benchmark and/or game is run with AA, obviously.

And while the 8800GT seems mostly balanced, the fact the blending rate is so 'borderline' means it's not the case in every SKU; 8800 Ultra, for example. Anyhow, keeping FP16 at the same speed as INT8 doesn't make a lot of sense to me; but you're right that it doesn't matter much and I should just STFU about this for once!

It really isn't fair to put this bottleneck on the same footing as triangle setup on G80 as I sometimes did, either...

It is going to be interesting to see what happens when NV switches to GDDR5 though (2500MHz+ vs 900MHz for G92's GDDR3). If they 'only' double the number of ROPs per memory partition, they'd need ~40% higher clock rates to achieve identical ROP performance/bit of bandwidth. While ~850MHz isn't really unrealistic by itself, it might be on a larger chip than G92 on 65/55nm. Hmmm...

mczak · Jan 7, 2008

Arun said:
It is going to be interesting to see what happens when NV switches to GDDR5 though (2500MHz+ vs 900MHz for G92's GDDR3).

I wouldn't count on gddr5 being 2.5Ghz from the start. Those numbers for new memory technologies seem to be consistently overstated by quite a bit. I think the initial announcements for gddr4 were 1.8Ghz, should have been available a year ago or so, and even today you can't get anything higher than 1.4Ghz. So I guess I'd be actually surprised if initially available parts (not just some samples) would be 2.5Ghz+ and not rather a more modest 2Ghz or so.

Buntar · Jan 8, 2008

Arun
Yes, those numbers are clearer. I didn't know about a bug on G80 (was it a driver bug? if so, has it been fixed?).
Would be interesting to see the same tests done with bilinear though. Would all the results just double, or would there be exceptions?

G92 quadruple bilinear texel rate?

swaaye

Entirely Suboptimal

Buntar

swaaye

Entirely Suboptimal

MDolenc

Arun

Unknown.

Buntar

Mintmaster

Arun

Unknown.

mczak

Buntar

Similar threads