8800 GTX OpenGL extensions

KimB

Legend
They're published on nVidia's site:
http://developer.nvidia.com/object/nvidia_opengl_specs.html

Some highlights:
1. EXT_framebuffer_sRGB: sRGB color space framebuffer support. This appears to be a fully gamma-correct framebuffer (with gamma 2.2 assumed).
2. EXT_packed_float: 32-bit pixel size for three floating point values, with 5 bits exponent for each color, and 665 mantissa bits for RGB, respectively. There is no sign bit.
3. EXT_texture_compression_latc and EXT_texture_compression_rgtc: new texture compression formats for two-component textures.
4. EXT_shared_texture_exponent: another 32-bit floating point format, with 5 bits shared exponent, and 999 mantissa bits for RGB, respectively. There is no sign bit.
5. NV_depth_buffer_float: support for 32-bit floating point depth buffers that aren't necessarily clamped to the range [0,1], support for a 64-bit depth format (32-bit depth, 8-bit stencil, 24-bits unused).
 
Yeah, but the discussion perspective should be different on this forum than a programming forum :)
 
The two new 32bit FP texture formats are interesting, opposite to ATI's 7e3 implementation on Xenon. Any infomation these two formats can be used as framebuffer?
 
EXT_shared_texture_exponent: another 32-bit floating point format, with 5 bits shared exponent, and 999 mantissa bits for RGB, respectively. There is no sign bit.
This is mandated by the DX10 spec, but let me just insist on how awesome I think it is. Being able to filter that stuff in hardware means that HDR textures (for the sky, for example) now become cheap and usable. It's still a fair bit more expensive than DXTC textures in terms of bandwidth and storage space but hey, it's HDR!

When you look at how FP16 filtering works on G7x (and probably G8x, which extends the same principle to FP32, I'd suspect, depending on the precision the DX10 spec requires), it makes perfect sense that this is supported. Basically, FP16 filtering is already done via a shared exponent internally. When doing bilinear on 4 texels, it transforms them to high-precision integers with a shared exponent, and then does filtering on that. The maximum error is quite minimal, but you could still find a cornercase where it'd become relatively significant, I think, because of the fact it doesn't take the weights into account to determine the shared exponent. So FP32 filtering on G80 might not be fully usable for GPGPU in the unlikely event they'd need filtering in their apps; but that doesn't make it any less awesome for games (variance shadow mapping ftw!)

And the FP32 depth buffer and FP32+Stencil depth buffer are awesome too. Anyone knows if those are required by D3D10? It'd be nice if that became standard for the next-gen, as it makes some techniques a bit less expensive, or higher quality at least depending on how you handle it.


Uttar
 
The two new 32bit FP texture formats are interesting, opposite to ATI's 7e3 implementation on Xenon. Any infomation these two formats can be used as framebuffer?
Well, the two new FP formats were certainly written in the spec as if they were primarily texture formats. I'll see if I can't find out, though.

Edit: Looked a bit more closely, and it looks like one of them can be rendered to. From the EXT_packed_float extension:
"This extension also provides support for rendering into an unsigned
floating-point rendering format with the assumption that the texture
format described above could also be advertised as an unsigned
floating-point format for rendering."

The shared exponent format, however, appears to be only a texture format.
 
So FP32 filtering on G80 might not be fully usable for GPGPU in the unlikely event they'd need filtering in their apps; but that doesn't make it any less awesome for games (variance shadow mapping ftw!)
Yeah I've just been playing around with that this morning, and it's a beautiful thing:

- No more nonsense of choosing texture formats based on some feature matrix... choose what you want and it "just works"! Now silly performance cliffs either.

- VSM can now be trivially implemented by rendering to a 2x FP32, and full filtering (trilinear+aniso). Hell you can even turn on multisampling by my understanding (haven't tested that one yet... few more code changes) to get nice anti-aliased shadow edges.

- This card is just stupid-fast... I've been throwing big workloads at it and getting >300fps. I think the options for implementing even better filtering in software are numerous now. I haven't tested, but it would be neat if software filtering were nearly as fast as hardware (should it be by the design?). Even if it isn't though, it's still ridiculously fast.

I'm am nothing but impressed with this card so far. It opens up so many possibilities for graphics and otherwise, and takes a large burden off of the programmer in terms of "hacking around" card features.

Next up, let's see how this beast can handle dynamic branching!
 
VSM can now be trivially implemented by rendering to a 2x FP32, and full filtering (trilinear+aniso). Hell you can even turn on multisampling by my understanding (haven't tested that one yet... few more code changes) to get nice anti-aliased shadow edges.
I think it'd be MUCH faster to render without a colour buffer and a FP32 depth buffer for VSM, and then you do a second fullscreen pass to compute z*z in another texture. Potentially, you could even do it as another depth-only pass, using the pixel shader to modify depth based on reading the z-buffer of your previous pass! I suspect the much higher Z/clock rate still applies when you're doing that, but I didn't test it, obviously.

Given you'll be using trilinear filtering, your FP32 rate for a 1xFP32 texture is 32TEX/clk. Now, if you used a 2xFP32 texture, it'd only be 16TEX/clk. So as far as I can see, and if those numbers are correct (which they ought to be), using two 1xFP32 texture instead of a single 2xFP32 one should be just as fast. That, and the optimization I proposed above, could result in very nice speed gains.

Hell you can even turn on multisampling by my understanding (haven't tested that one yet... few more code changes) to get nice anti-aliased shadow edges.
I think you can, yes. Remember the Z-Rate of the G80 is the same when you're taking a 2048x2048 buffer as when you're doing 4x MSAA on a 1024x1024 one, though! So all you get for it is a rotated grid instead of an ordered one, imo. And for a shadowmap, I'm not sure if that's really an advantage, although I didn't think about it too much at this point to be honest.

I haven't tested, but it would be neat if software filtering were nearly as fast as hardware (should it be by the design?). Even if it isn't though, it's still ridiculously fast.
It shouldn't be as fast for software filtering as for hardware filtering. I suspect that given the amount of MADDs you got there though, you could get a quite decent implementation going that wouldn't be stupidly slower for filtering of FP32 data. Remember you basically got the equivalent of Fetch4 on the G80 as D3D10 requires it, although I didn't check how that's implemented in OpenGL yet. And some nice low-level stuff is available that should help for software filtering too, like TXQ. So tons of potential there definitely, although you shouldn't expect amazing performance either - but it might be very usable for some algorithms anyway.

I'm am nothing but impressed with this card so far. It opens up so many possibilities for graphics and otherwise, and takes a large burden off of the programmer in terms of "hacking around" card features.
Yup, can't wait to finally get my own eventually... :)


Uttar
 
I suspect the much higher Z/clock rate still applies when you're doing that, but I didn't test it, obviously.
If you write fp32 Z from the shader, it'll be just as fast as if you wrote 1 fp32 color from the shader. You don't get the magic Fast Z mode when shader writes Z.

Remember you basically got the equivalent of Fetch4 on the G80 as D3D10 requires it
D3D10.0 doesn't have Fetch4 (or equivalent), and neither does G80 :(
 
If you write fp32 Z from the shader, it'll be just as fast as if you wrote 1 fp32 color from the shader. You don't get the magic Fast Z mode when shader writes Z.
Ah well. Now that I think about it, that makes perfect sense: if your Z compression works via planes, and the trick you're using is sending compressed data throughout the chip, changing depth per-pixel couldn't possibly give you a speed advantage here. The optimization I proposed would still work though except for that specific part, so I'd still assume it to give a nice speed boost.
D3D10.0 doesn't have Fetch4 (or equivalent), and neither does G80 :(
Awww :( Hopefully the D3D10.1 refresh will, then.


Uttar
 
if your Z compression works via planes, and the trick you're using is sending compressed data throughout the chip, changing depth per-pixel couldn't possibly give you a speed advantage here.
You don't need to use alternative representations (plane/barycentrics/whatever) to go around that limit. If you've got a limited data path to memory/ROP out of your shader, then you'll run at that rate, regardless of everything else. And if that path is narrower than 70 GSamples/sec (or whatever the Z rate was measured to be), then you'd be limited by that.
 
You don't need to use alternative representations (plane/barycentrics/whatever) to go around that limit. If you've got a limited data path to memory/ROP out of your shader, then you'll run at that rate, regardless of everything else. And if that path is narrower than 70 GSamples/sec (or whatever the Z rate was measured to be), then you'd be limited by that.
Ah, that's true - there's no good reason that I can see to allow for a fair bit more pixels to enter or quit the shader core per cycle than the ROPs could handle in 99.9% of cases. The only other case where this would matter is if you tried to KILL pixels in a z-only pass based on something but without modifying the depth (then, compression wouldn't even be a factor). I can't see much of anything that'd require that, but I'm still mentioning it since it could be an easy way to prove that factor to be accurate on any architecture where KILL isn't too expensive.

Clearly, for various implementation simplicity reasons, it might still be slightly higher than what the ROPs allow (or slightly lower, but then you couldn't expose them even in the most basic of cases) - and it can't hurt to be able to refill the pipelines slightly faster if they've been idling, I'd imagine! (for example, after a z-only pass...)

Although at least, on G8x, that should only reduce latency tolerance once a sufficient number of objects are available to the cluster to hide its units' own latency - on G7x, that'd just kill overall performance no matter what since the texture unit and its related latency hiding wasn't decoupled! And that happened more often than just when refilling the pipelines AFAIK.


Uttar
EDIT: Just thought I'd mention that the reason I'm saying you'd have the fill up the shader core after a z-only pass is that I don't think the VS would be the bottleneck there most of the time. But yeah, obviously, at least it wouldn't be completely empty; unlike on a non-unified architecture, where your pixel shading pipelines would obviously have nothing in them.
 
I think it'd be MUCH faster to render without a colour buffer and a FP32 depth buffer for VSM, and then you do a second fullscreen pass to compute z*z in another texture.
Depth textures aren't very good for VSM in my experience... the nonlinear precision is highly problematic, and to my knowledge they don't support filtering anyways, so it'll just have to be copied to a standard texture anyways. Plus the G80 seems to not suffer any performance hit for MRTs (yay!), so the two components can still be rendered just fine in one pass, even to two textures if necessary.

It shouldn't be as fast for software filtering as for hardware filtering.
It would be really nice if it got closer than previous generations... if there's still dedicated hardware doing the filtering (rather than just the ALUs), that's disappointing.

Remember you basically got the equivalent of Fetch4 on the G80 as D3D10 requires it, although I didn't check how that's implemented in OpenGL yet.
I wasn't aware of Fetch4 working on G80... that said, it's a bit limited anyways with respect to 1-component textures. Sure once can always split stuff up but at some point (4 1 component textures?) it's probably going to be slower again. There's the frac/etc. involved in calculation texture addresses, but other than that there shouldn't be much overhead to the software filtering method (especially with unnormalized lookups now).

And some nice low-level stuff is available that should help for software filtering too, like TXQ.
TXQ?

So tons of potential there definitely, although you shouldn't expect amazing performance either - but it might be very usable for some algorithms anyway.
It's actually quite usable even on previous hardware... fp32 bilinear was fairly fast even on 7x00's and X1x00's. Not as fast as hardware of course, but not too bad.

Anyways I'm still thinking rendering to a single texture is probably the best. I'm going to need at least 4 components probably (for some extensions that I'm doing), so Fetch4-type lookups - even if supported - no longer buy me much by my understanding. Furthermore the texture generation pass is usually not the bottleneck... it's the filtering on the other end that eats power :)
 
Depth textures [...] to my knowledge they don't support filtering anyways, so it'll just have to be copied to a standard texture anyways.
Depth textures have been filtered since GeForce FX; although the filtering precision was low there, it has improved with GeForce 6 and again with G80, as they added filtering hardware for more texture formats.

A "Texture Query" instruction. See NV_gpu_program4 for details.
 
Back
Top