FX color compression

nVidia, however, seems to be mostly insisting about how good they are at using their 128 bit memory bus when enabling 4x MSAA
Why would they want to brag about that, anyway? Well, that's easy to answer. If all they did was using some type of flags, waste could become quite problematic.
So, nVidia says that waste is nearly not existant when using 4x MSAA.
ATI, on the other hand, brags about having 6:1 Color Compression. Without ever telling how much of their 256 bit memory bus is wasted when they've got with such a compression. So, we can't really say which is the most efficient.

nVidia's "4:1" compression claim would mean that instead of writing/reading 4 (sub)pixels @ 32BPP with their 128 bit memory bus, they actually write/read up to *16* (sub)pixels with it in optimal cases.
It would make sense to have a flag to say if a subpixel color is equal to the last subpixel color.
So, you've already got 32 bits reserved for flags: There's 16 maximum pixels accessing a maximum of 4 colors. 4 is 2 bits, so it's 16x2=32

And you thus got 96 bits available for the (sub)pixels colors. IIRC, writing Alpha to the frame buffer is useless in most cases. So there's really only 24 bits to write. And 24*4 = 96.
Someone please correct me if writing alpha isn't useless.

Does those calculations make sense?


Uttar
 
Uttar said:
Does those calculations make sense?

No...

Utilization of the memory bus is not anything like you describe (for one thing, NVidia has 4 32 bit channels that are supposed to function independently; for another, information sent on the memory bus can't be "interpreted" in any way by the memory. Just reads and writes of such-and-such word--no way for flags and such to be interpreted by the memory chips.)
 
antlers4 said:
Uttar said:
Does those calculations make sense?

No...

Utilization of the memory bus is not anything like you describe (for one thing, NVidia has 4 32 bit channels that are supposed to function independently; for another, information sent on the memory bus can't be "interpreted" in any way by the memory. Just reads and writes of such-and-such word--no way for flags and such to be interpreted by the memory chips.)

Thanks for correcting me.
The first 32x4 argument does seem to make my idea illogical.
However, where did I say flags had to be interpreted by the memory chips? What I meant is that those flags were written into memory, then interpreted if you read the memory by the GPU ( that's decompression )

After rethinking about what I wrote, it sounds like it makes no sense anyway. What happens when those 16 pixels aren't of one of the four colors? Such a case is not supposed by that. I'll have to rethink a little more about it.
BTW, could someone answer me as to if writing Alpha to the framebuffer is useless?


Uttar
 
Uttar said:
BTW, could someone answer me as to if writing Alpha to the framebuffer is useless?
Uttar

It can be used for quite a few applications. For example Tenebrae uses it for attenuation of lights by distance, when GF3/4's four textures are all used for other things, by drawing the attenuation into the color buffer alpha and then using source * destination alpha + destination * 1 for the blending.

Also other kinds of uses like drawing polygons front to back and putting coverage value to the alpha of color buffer and using it to blend further polygons comes to mind and was pretty popular at one time and probably still used for line AA.
 
With many games writing alpha is useless, but some games require it. For example some Quake 3 levels will look incorrect if alpha is not written.
 
So let's see.

When the R300 does MSAA, colour compression does take place. Also doesn't Z-buffer compression? I would say Z-buffer compression does work with Ati's MSAA as it did on nVidia's Geforce 3 IIRC.

Or am I mixing everything up again?
 
3dcgi said:
With many games writing alpha is useless, but some games require it. For example some Quake 3 levels will look incorrect if alpha is not written.

So those levels look incorrect in 16bit mode too?
Any examples?
 
Hyp-X said:
So those levels look incorrect in 16bit mode too?
Any examples?

I remember a quake 3 level with a tower and a fairly open area, but that's it. I'm sure that's not helpful, but I don't know the name of the level. I never thought about 16 bit mode so I'm not sure how that is handled. Because quake 3 has its own shader system every level can be unique in how the lighting and effects are applied. It depends on the preference of the level designer. Sometimes destination alpha is used for lighting/blending although I'm not sure exactly how it is used.
 
Quake 3 used Dest Alpha to Blend Static Lightmapped textures with Moving Unlighted textures.

You'd lay the lightmap down first.
Then blend the first texture with the lightmap and writing alpha.
The the unlighted texture is blended using the dest alpha.
 
I agree with Dave.
FX colour compression is active even without AA. This has been declared by nVidia in response to ATi (after nvidia presentation ATI claimed to have CC, nVidia replayed to have a CC even without AA). Colour compression is the MAIN cause of the high bandwidth efficiency of the GF FX.

If you read the fillrate of an ATI 9700 in single and multi texturing you will notice that the multitexturing rate is very close to the maximum allowed by 8 pipeline but the single texturing rate is far away! This is dued to the high bandwidth required to manage the 8 frame buffer pixel per clock.

Static colour compression reduce by 4 time the bandwidth required by GF and allows nVidia to use a buffer of only 128bit.

The real question is....HOW DOES IT WORK ?
 
After reading a LMA II techbrief, I just realized we left out a factor when we try to figure out the compression method.
DDR doubles the effective memory bus width.
Also, that techbrief clearly says that the GF4 is able to process 256-bit chunks ( 128x2, you got to consider DDR as I said above ) if that's what's optimal. So, antlers4 , it sounds like they aren't really independant.
And what's the use of being able to process bigger chunks? Well, compression works better with bigger chunks. So it makes a lot of sense to allow it.

Am I right on that? Now, if I am...

Supposing 256-bit chunks ( which can't always happen due to small triangles, but I'd guess that's what nVidia is considering for their 4:1 ratio ) , how in the world would nVidia get a 4:1 ratio?
With 256 bit, you could write/read eight 32BPP pixels at once.
So that means, to get such a 4:1 ratio, you've got to be able to write/read thirty-two 32BPP pixels at once. Woah, congrats to the nVidia engineers who figured out how to do that in real time...


Uttar
 
Uttar said:
After reading a LMA II techbrief, I just realized we left out a factor when we try to figure out the compression method.
DDR doubles the effective memory bus width.
Also, that techbrief clearly says that the GF4 is able to process 256-bit chunks ( 128x2, you got to consider DDR as I said above ) if that's what's optimal. So, antlers4 , it sounds like they aren't really independant.
Regarding the independent channels:
In the memory controller, there are four memory channels (let's call them A B C D), each connected to one or more memory chips (reqiures a chip select signal; let's call the chips/chip-pairs a b c d) via 32 data traces.
Additionally, there's an AGP connection.

DDR2 transmits two bits per clock per pin, and has a burst length of four. So each access to memory through one of the 32-bit channels sends a chunk of 32 * 2 * 4 = 256 bits in 4 clock cycles.
Now, each channel being independent means that as channel A is reading the 28th 256bit chunk from chip a, channel B can be writing to the 5th 256bit chunk in memory chip b simultaneously.
Of course you cannot access multiple chunks in chip b at the same time, because each channel is hardwired to one memory chip (pair).

That's why the data is usually spread across the mem chips ('interleave'), so you get an even distribution of accesses per chip. I don't know how graphics chips manage that, as there are different buffers with different access profiles (framebuffer, Z-buffer, textures). I guess that's one of the secrets for efficient bandwidth use.
 
Hmm, yes.
But couldn't you store in a cache that a specific part of memory GOT to be read in 512-bit chunks? So that then, you'd have to wait for two memory channels to be free to read the two different spots in memory where the two parts are stored.
It would be useful for compression, no? Of course, such a design would be very complex the implement. But it could have benefits.

Anyway, assuming that's not true for a while...
You're saying that with DDR2, you've got 256-bit chunks if you use 32-bit memory channels?
And with DDR1, you'd have 128-bit chunks? So, DDR2 would have an advantage for compression?
Or are you just saying DDR2 sends 8 chunks of 32 bits at once, but that those 32 bit chunks cannot be related among themselves?

Anyway, I'd guess saying they "cannot" be related is not quite true. I'd guess they can be, but the design thus becomes a lot more complex ( which means nVidia could very well have done that, seeing how they like to claim their architecture is amazingly well using their 128-bit bus )


Uttar
 
Uttar said:
Hmm, yes.
But couldn't you store in a cache that a specific part of memory GOT to be read in 512-bit chunks? So that then, you'd have to wait for two memory channels to be free to read the two different spots in memory where the two parts are stored.
It would be useful for compression, no? Of course, such a design would be very complex the implement. But it could have benefits.
Why would that be complex? If a part of the chip wants to process data in 512 bit chunks, it has to tell the memory controller to read/write two chunks.


Anyway, assuming that's not true for a while...
You're saying that with DDR2, you've got 256-bit chunks if you use 32-bit memory channels?
And with DDR1, you'd have 128-bit chunks? So, DDR2 would have an advantage for compression?
Or are you just saying DDR2 sends 8 chunks of 32 bits at once, but that those 32 bit chunks cannot be related among themselves?
DDR1 can also have a burst length of 4, IIRC.
What do you mean with 'not related'? Those 8*32bit are of course in a sequence.
 
You guys are expecting miracles from nVidia that are unlikely. It isn't possible to do lossless, fix-ratio compression except in limited circumstances (as in MSAA, when you know all the colors will be identical for any pixel that's not on a polygon edge or intersection.) The amount of compression that the FX will be able to do without MSAA will be negligible in most real-world situations (it might work out well for cell-shading, though).
 
the Z-compression is lossless and I can demostrate to you that is possible to compress a Z-buffer losslessly with very high compression rate.

Compress the frame buffer is quite different but I thing that in various rendering phases this may be possible with or without AA active
 
Xmas got it right except for one detail.

The minimum burst length of 4 on DDRII is the number of data transfers it does in a burst, not the number of clocks.
So on a 32 bit bus, the minimum block is 32*4=128 bit.
And for DDR(I) it's 32*2=64 bit. (With minimum burst length=2.)
 
Back
Top