On textures and compression

n00body · May 7, 2006

*note*: this question pertains to both the PS3 and the 360 as seperate entities. This is not a comparason thread, and I will not tolerate others flooding it with one line flamebait/console-versus comments. I would ask that Devs who have actually been coding for these machines be the primary respondents. With that understood, my question is:

____I would like to inquire about the sorts of texture formats and compression employed by devs in development for the 360 and the PS3. Is it mostly common PC formats, or are their system-specific types?

____From my current understanding, I suspect that the popular choice for the 360 is the .dds format, since it combines a nice feature set in a standardized package and is part of DirectX. And with it being an ATI chip inside, 3Dc is practically a given for 360 normal maps. Anything else to add to, or compliment this list?

____For the PS3, I had heard about several texture formats called RGBA####, where each '#' corresponds to the bits alotted to the corresponding color value and alpha of each variant. Aside from that, I have heard that the PS3 is also capable of using .dds and 3Dc as well since the two are relatively open. Anything to pile on without being hit with an NDA?

Squeak · May 7, 2006

S3TC and some kind of lossless on top to stretch main-mem a bit further, and that´s about it I think.

S3TC is the only thing supported in hardware on the GPU/VPU AFAWK.
You could probably make the shaders do some work but then the result would have to be sent to main mem, waisting bandwidth.

I think a combination of procedural textures for stuff that are suited for that approach, and virtual textures/clipmapping for explicitly defined textures, is the way forward..

pc999 · May 7, 2006

There isnt any new and better comprenssions they could use (besides the "new" 3Dc) that they can use after all of this are is relatively old tech?

Or could they just use the CPU to use any compression they want with L2 chache locking, would that be better?

Shifty Geezer · May 7, 2006

pc999 said:
Or could they just use the CPU to use any compression they want with L2 chache locking, would that be better?

No, because you need to decompress on the other end. If the texture are compressed using PC999 compression the GPU is going to need to run PC999 decompression on them to use. I also don't think there's much scope for better lossless compression, certainly not for a noteable amount of BW savings on existing compression schemes.

pc999 · May 7, 2006

I as thinking on decompress it on the CPU and pass it, already decompressed, to the GPU by the L2.

I as also thinking in this, there should be something in between that can beat 6:1 and give a boost in qualitity of textures.

Rockster · May 8, 2006

My question is in your scenario is, how does the GPU instruct the CPU to fetch textures?

rounin · May 8, 2006

pc999 said:
I as also thinking in this, there should be something in between that can beat 6:1 and give a boost in qualitity of textures.

I think neither one was talking about something like you said (CPU compression). Carmack was most likely alluring to his "Megatexture" thing...

DudeMiester · May 8, 2006

I know Yann L said that palletted textures offered some very nice compression, but as that isn't supported anymore, it's not so useful. However, with fast dynamic branching and large register arrays, it might be possible to do a palletted texture with the pixel shader.

In fact, if you laid out your pallette colours intelligently, placing similar colours near each other, you could probably compress your grayscale index texture as well. Maybe get 10:1 compression. If you need more quality, perhaps place two indexes in each component of the index texture, then blend the respective colours in the shader. Then again, this wouldn't work with AF, so it has it's problems. Still, I would suspect it would be best applied only onto foreground textures anyways, which don't need AF so much, because those textures would take up the largest screenspace and mean the least shader switches.

EDIT:

Just looked at the DX10 specs, and it's better then I thought. You can create constant buffers representing your palettes and just share them between shaders. Since you can have up 4096 palette colours per buffer and 16 buffers bound per shader, there should be no problems for all the textures you might need.

Shifty Geezer · May 8, 2006

pc999 said:
I as thinking on decompress it on the CPU and pass it, already decompressed, to the GPU by the L2.

That would save main RAM BW but wouldn't be possible, as the GPU goes looking for textures. If there were a mechanism for the GPU to request a texture via CPU, the CPU would have to fetch, decompress and deliver the texture to GPU with very high latency. The hardwired texture compression schemes are directly accessed and used on the GPU.

As has been noted, the point of texture compression is chiefly to save BW, not memory (though reducing textures to 1/6th their size is obviously a good thing when you only have 512 MB RAM total!) and you need a compression scheme that works well in that purpose...

I as also thinking in this, there should be something in between that can beat 6:1 and give a boost in qualitity of textures.

As discussion in the thread points out, the requirements for texture compression to be effective in GPU BW saving are very different to those schemes used to reduce picture size and save on storage.

I guess one option could be substantial texture caches and a more complex compression scheme like JPEG2000 which decompresses into the texture cache. In the context of what schemes are the next-gen consoles providing though, I don't think there are any new tricks on the hardware. At least we haven;t heard of any to my knowledge.

Simon F · May 8, 2006

Shifty Geezer said:
If there were a mechanism for the GPU to request a texture via CPU, the CPU would have to fetch, decompress and deliver the texture to GPU with very high latency.

I guess you meant "low latency"

The option you described is trivial

I guess one option could be substantial texture caches and a more complex compression scheme like JPEG2000 which decompresses into the texture cache. In the context of what schemes are the next-gen consoles providing though, I don't think there are any new tricks on the hardware. At least we haven;t heard of any to my knowledge.

The trouble is that it may not help for cases where you are only sampling a small subset of a large texture. JPEG schemes are not great for random access and so decompressing the lot for a 32x32 pixel subset could be very expensive

Shifty Geezer · May 8, 2006

Yep, mighty typo!

Fafalada · May 8, 2006

SimonF said:
JPEG schemes are not great for random access and so decompressing the lot for a 32x32 pixel subset could be very expensive

Well macroblocks can be decoded inependantly, you don't need to decode the whole map to get one out.
Yes they aren't aligned to anything, and variable size will make any hw implementation a pain in the ass, but then again macroblock decode block itself will be orders of magnitude more complex then something that unpacks your typical VQ schemes, so the overhead from dealing with variable-sized blocks could still be minor, relatively speaking.

The real question is if all the realestate dedicated to such a decoder would be worth it.

Jawed · May 8, 2006

JPEG 2000 appears to support random access:

http://www.jpeg.org/jpeg2000/j2kpart9.html

This is for Xenos:

Textures are fetched from main memory through a 32-KB, 16-way set-associative texture cache. The texture cache is optimized for 2D and 3D, single-element, high-reuse data types. The purpose of the texture cache is to minimize redundant fetches when bilinear-filtering adjacent samples, not to hold entire textures.

The texture samplers support per-pixel mipmapping with bilinear, trilinear, and anisotropic filtering. Trilinear filtering runs at half the rate of bilinear filtering. Anisotropic filtering is adaptive, so its speed varies based on the level of anisotropy required. Textures that are not powers of two in one or more dimensions are supported with mipmapping and wrapping.

The texture coordinates can be clamped inside the texture polygon when using multisample antialiasing to avoid artifacts that can be caused by sampling at pixel centers. This is known as centroid sampling in Direct3D 9.0 and can be specified per interpolator by the pixel shader writer.

The following texture formats are supported:

8, 8:8, 8:8:8:8

1:5:5:5, 5:6:5, 6:5:5, 4:4:4:4

10:11:11, 11:11:10, 2:10:10:10

16-bit per component fixed point (one-, two-, and four-component)

32-bit per component fixed point (one-, two-, and four-component)

16-bit per component floating point (limited filtering)

32-bit per component floating point (no filtering)

DXT1, DXT2, DXT3, DXT4, DXT5

24:8 fixed point (matches z-buffer format)

24:8 floating point (matches z-buffer format)

New compressed formats for normal maps, luminance, and so on, as described following

Fetching up to 32-bit deep textures runs at full speed (one bilinear cycle), fetching 64-bit deep textures runs at half-speed, and fetching 128-bit deep textures runs at quarter speed. A special fast mode exists that allows four-component, 32-bit-per-component floating-point textures to be fetched at half-speed rather than quarter-speed. The packed 32-bit formats (10:11:11, 11:11:10, 2:10:10:10) are expanded to 16 bits per component when filtered so they run at half speed. Separate nonfilterable versions of these formats exist that run at full speed.

When filtering 16-bit per component floating-point textures, each 16-bit value is expanded to a 16.16 fixed-point value and filtered, potentially clamping the range of the values. The total size of the expanded values determines at which rate the sampling operates (full speed for one-component sampling, half speed for two-component sampling, and quarter speed for four-component sampling). Separate nonfilterable 16-bit-per-component floating-point formats also exist.

DXT1 compressed textures are expanded to 32 bits per pixel, resulting in a significant improvement in quality over the expansion to 16 bits per pixel that existed on Xbox.

The following new compressed texture formats are available:

DXNâ€”a two-component 8-bit-per-pixel format made up of two DXT4/5 alpha blocks

DXT3Aâ€”a single-component 4-bit-per-pixel format made up of a DXT2/3 alpha block

DXT5Aâ€”a single-component 4-bit-per-pixel format made up of a DXT4/5 alpha block

CTX1â€”a two-component 4-bit-per-pixel format similar to DXT1 but with 8:8 colors instead of 5:6:5 colors

DXT3A_AS_1_1_1_1â€”a four-component format encoded in a DXT2/3 alpha block where each bit is expanded into a separate channel

The texture formats with eight or fewer bits per component, including the compressed texture formats, can be gamma corrected using an approximation to the sRGB gamma 2.2 curve to convert from gamma space to linear light space. This correction happens for free and is applied to sampled data prior to any texture filtering.

A special type of texture fetch can index into a texture â€œstackâ€ that is up to 64 textures deep. A texture stack is a set of 2D textures (potentially with mipmaps) that are stored contiguously in memory.

Jawed

Simon F · May 8, 2006

Fafalada said:
Well macroblocks can be decoded inependantly, you don't need to decode the whole map to get one out.

But (with JPEG) they are huffman encoded so, unless you store an auxilary structure with bit-level pointers for the start of each macroblock, then you do have to "decode" the huffman data until you get to the end of the macroblock in question.

Now if a scheme uses adaptive huffman or arithmetic encoding then you can't even do that.

Squeak · May 8, 2006

Didn't the Jaguar use some kind of simple scanlinebased realtime jpeg-like decompression?

Some kind of simple realtime DCT based compression must be possible nowadays.

MfA · May 8, 2006

Simon, the best you can do with fixed rate coding is screw around in the margins. There is just not much more to gain there anymore, I think you screwed around enough in that respect

You wouldn't want to use huffman or arithmetic coding for coefficients obviously, but I would guess you could get a decent 4x4 transform coder well below 10 cycles of latency (not that I have really sat down and try to check my intuition about how well such a coder could compress or the feasibility of the necessary parallel decoding).

As for adaptive coding, I think backwards adaption is highly overrated.

Heinrich4 · May 8, 2006

Maybe those links can help in some way.

http://blogs.guardian.co.uk/games/archives/2006/01/27/possession_and_the_art_of_ps3_programming.html

http://ps3.ign.com/articles/684/684400p1.html

http://www.myps3.be/archives/develop-a-playstation-3-ps3-now-with-optix-image-studio-for-ps3/

link to Optix Imge Studio perhaps in ps3(japanese):

http://www.webtech.co.jp/istudio/ps3/index.html

Squeak · May 8, 2006

MfA said:
Simon, the best you can do with fixed rate coding is screw around in the margins. There is just not much more to gain there anymore, I think you screwed around enough in that respect

I wouldn't say the latest compression scheme from mister F is screwing around in the margins. 2bpp with alpha, and minimal silicon footprint is pretty darn impressive and a substantial improvement over S3TC.
But of course you are right that there is limited scope for further improvement on a general fixed rate scheme beyond this point.

Richard · May 8, 2006

You guys might want to take a look at this: http://www.beyond3d.com/forum/showthread.php?t=30490

They're getting 10:1 ratio.

ShootMyMonkey · May 8, 2006

They're getting 10:1 ratio.

Except it's not really entirely attributable to "compression" in the same sense being talked about -- that is to say file-level compression as opposed to "effective compression" of some "virtual" result constructed out of multiple files. It's more about deconstructing down to the tiled constituents out of which the image was built meaning that you get an effective compression ratio. It's those constituents that are further compressed, and the ratio on those is not 10:1 at all.

JPEG 2000 appears to support random access:

Makes sense given that J2k is a wavelet based scheme, so you can probably just point to any old single pixel in the low-pass and based on where it is in the image, you'll know where to look in all the highpass subbands (assuming you've decoded down to an image of the transform itself). Also suggests that the image could be used in such a way as to get implicit miplevels for free (i.e. just don't decode all the way down the band hierarchy).

On textures and compression

n00body

Squeak

pc999

Shifty Geezer

uber-Troll!

pc999

Rockster

rounin

DudeMiester

Shifty Geezer

uber-Troll!

Simon F

Tea maker

Shifty Geezer

uber-Troll!

Fafalada

Jawed

Simon F

Tea maker

Squeak

MfA

Heinrich4

Squeak

Richard

Mord's imaginary friend

ShootMyMonkey

Similar threads