R600 nugget...

psurge · May 13, 2005

This patent is pretty old. It references another patent which talks about storing 2 samples per pixel in a custom framebuffer (reducing the number of pixels which are incompressible). Finally, the patent makes no mention of quads, or higher precision color values.

Xmas - I don't understand your comment - the coverage mask tells you which samples inside a pixel have a particular color. One way or another, doesn't any lossless compression algorithm have to store information identical to this?

DemoCoder · May 13, 2005

Mintmaster said:
Cool. I always thought it was strange 1080p wasn't 60 fps, and in fact didn't believe it at first when someone here told me. Seemed kind of pointless to me.

Yep, it was pointless, which was why people complained and 60Hz was added. After all, a line doubler/scaler can take a 1080i@60 signal and produce a 1080p@30 signal, so there isn't much gain.

I'm looking forward to the 1080 sets coming out. We almost bought the Panny HD2+ set last christmas, but now that I have an idea about the xHD3 prices I'm glad we didn't. Looks like the pixel shifting (or whatever you call it) is working great for cost reduction in the HD3 and xHD3.

I bought a 4th gen 5065W Samsung last year, and I'm a little bummed. I had heard the first xHD3 sets would cost well over $10k, but they in fact, cost less than my 720p Sammy, AND they have an ATSC tuner built in, AND they havd 10000:1 contrast ratio.

Because TI DMD is built using CMOS tech, I see no end to their cost advantages. Every process shrink, their yields will go up. This means 1080p will get cheaper and cheaper OR they could use a tri-core approach, put three DMDs (red/green/blue) on one chip and eliminate the color wheel. In any case, DLP scales better according to Moore's law than either LCD or PDP (which is the worst in terms of cost scaling). It's been what, 10 years, and they still haven't managed to shrink PDP's dot-pitch (pixelsize) below .6mm!

BTW, you don't need to tell me how DLP chips are made. I'll be studying MEMS at Caltech in the fall

Caltech is a good choice. Not too many girls I hear tho.

Caltech has more nobel prize winners on staff than any other university. MEMs is a great field to study. It's a hard choice between microtech/nanotech and biotech. Both fields have bright futures.

I think PDP is going to die, and the only thing that could possibly replace it is FED (field emission display) which is another semiconductor process oriented tech that essentially embeds a electron gun at each pixel. I think it uses carbon nanotubes.

I'm still dripping wet after seeing some LCD+LED HDR displays at CES. Maybe some DLP hybrid will dispose of the Xenon lamp and use an array of LEDs to make HDR DLP displays.

Xmas · May 13, 2005

Jawed said:
A coverage mask is non-lossy. I don't know why you think it's lossy.

Jawed

I didn't say that a coverage mask itself is lossy. I said that storing coverage masks along color values seems more suitable to lossy compression algorithms to me.

psurge said:
This patent is pretty old. It references another patent which talks about storing 2 samples per pixel in a custom framebuffer (reducing the number of pixels which are incompressible). Finally, the patent makes no mention of quads, or higher precision color values.

Quads are an implementation detail of the shader pipeline, they don't influence antialiasing that much. Higher precision color values seem to be a pretty straightforward extension of the algorithm. Notice how the patent often just says N bits of color information.

Xmas - I don't understand your comment - the coverage mask tells you which samples inside a pixel have a particular color. One way or another, doesn't any lossless compression algorithm have to store information identical to this?

You don't need a coverage mask if your compression algorithm only knows two states for pixel data - compressed or uncompressed. In the first case you only store a single color per pixel, in the latter case you store all samples sequentially. Additionally, you need a single bit flag per data block that indicates whether the block is compressed or not.

It might seem like a good idea now to store coverage masks along with the colors especially for cases where there are only two polygons covering a pixel, but this adds complexity to the compression/decompression unit, it might not buy you much because of memory interface granularity, you need to make sure worst-case doesn't require more space than no compression at all (apart from the flags), you need more on-chip memory for compression flags, etc.

The embedded memory including the ROPs and sample compression very much simplifies handling of an antialiased framebuffer and pixel data transmission to the framebuffer.

Entropy · May 13, 2005

nAo said:
Joe DeFuria said:

It is interesting to me that given a 65nm process, I would estimate it should be possible to include enough eDRAM on a PC chip to finally cover the "standard" 1600x1200 resoultion. This would have similar frame buffer characteristics / abilities as Xenon's 90nm chips at 1280x720 resolution.

Click to expand...

At the same time another IHV use the edram transistors budget to vastly increase their next GPU 'shading power'..who is going to win?
(I'm not being sarcastic, it's a legitimate question..)

Will depend on the benchmark, as usual. However, one benefit of using edram is that it does alleviate the pressure on the external memory subsystem. So you could either use that to improve your price/performance, or for the high-end solutions, to do correspondingly better on any task that is dependant on external memory bandwidth. Also, since you'd design your edram reduntantly, it shouldn't be a major problem in terms of yield due to defects. It will still reduce the number of dies per wafer due to pure size, but if you would exchange the edram for logic, I'd suspect that your yield would be somewhat lower.

So how would it balance out? Beats me, it's not my field and the question assumes accurate crystal balling and further begs the question if the presence of such an architecture would make certain ways of doing things more attractive than they would otherwise be. Again, it would depend on your benchmark, but my general respect for efficient memory subsystems makes me believe the design would be advantageous as long as you can integrate sufficient amounts of edram. That is - I have no idea as to performance, but it should be cheaper in terms of overall system costs.

DemoCoder · May 13, 2005

I noticed that while playing HL2/Counter-Strike Source, much of the aliasing now comes from pixel shaders. I can turn on 16X AF and 6xFSAA, but any surface that has a shader with any high frequency specularity produces a shimmering. (lets say, rough floor tiles). You can clearly see that this is due to a lack of shader-AA. (just move the mouse ever so slightly, and watch how the shaded surfaces "crawl")

There appear to be two solutions: SuperSampling, or using gradient instructions in the shaders. But to implement gradient instructions in the shader, you need neighboring pixel values, and this implies some kind of "lock-step" quad pixel rendering.

So I would say that quads are hear to stay, unless you want to give up on antialiasing textures, and not just edges.

I have always been a heavy proponent of shader-bound future, but any realistic lighting solution is going to need to address ambient/diffuse global illumination, and of course shadows, and these problems are not solved through beefier shaders, they are solved via more fillrate.

So I can't help but wonder if perhaps the way to scale a GPU properly is not fewer ROPs with ALUs piled on high, but "shallow" pipes, 1 ALU unit per pipe, but many many pipes. So I'd take the R500, keep the eDRAM, but add 32 ROPs, delete 16 ALUs, and so we have 32 ROPs each with 1 ALU. I'd also get rid of unified shading, and move vertex processing to the CPU, which is hopefully, a now a CELL-like or Xbox360-like CPU with multiple cores and vector units.

Jawed · May 13, 2005

Xmas - Assume that an AA sample is 8 bytes (4 colour + 4 Z/Stencil) and we're doing 8xAA.

Using a coverage mask:

- The worst case is 8 visible fragments all sharing one pixel. That consumes 72 bytes.

- 7 visible fragments all sharing one pixel consume 63 bytes.

- 6 visible fragments consume 54 bytes.

Without using a coverage mask, 8xAA consumes 64 bytes regardless of the number of visible fragments (except for the special case of 1). You then have to find a compression scheme that will work with very low quantities of input symbols (2 to 8).

In the end, a coverage mask seems like a good solution.

Jawed

ninelven · May 13, 2005

erased

Mintmaster · May 14, 2005

DemoCoder said:
There appear to be two solutions: SuperSampling, or using gradient instructions in the shaders. But to implement gradient instructions in the shader, you need neighboring pixel values, and this implies some kind of "lock-step" quad pixel rendering.

Even using gradient instructions, wouldn't you still be very prone to aliasing every 2 pixels in either direction due to all pixels in a quad getting the same gradient? Also, it would only help half of the aliased edges, right?

Even if the gradient instruction wasn't "localized" to quads, what exactly do you think is the best way to use it to ameliorate the artifact you mentioned? I haven't read much on this stuff.

DemoCoder · May 14, 2005

Some of it can be fixed just by writing better shaders.

Xmas · May 14, 2005

Jawed said:
Xmas - Assume that an AA sample is 8 bytes (4 colour + 4 Z/Stencil) and we're doing 8xAA.

Using a coverage mask:

- The worst case is 8 visible fragments all sharing one pixel. That consumes 72 bytes.

- 7 visible fragments all sharing one pixel consume 63 bytes.

- 6 visible fragments consume 54 bytes.

Without using a coverage mask, 8xAA consumes 64 bytes regardless of the number of visible fragments (except for the special case of 1). You then have to find a compression scheme that will work with very low quantities of input symbols (2 to 8).

In the end, a coverage mask seems like a good solution.

Jawed

Now consider that your memory interface is partitioned into several 64bit channels, with your GDDR3 memory modules having a burst length of four. This means the granularity of memory access is 256bit or 32 bytes.
So you may want to use a block-based compression algorithm so you don't waste as much bandwidth in the common case of a fully covered pixel. Four pixels with color and Z fit into 32 bytes, i.e. at least 2x2 blocks of pixels seem appropriate. For each block you need to store a flag that indicates its compression, and this flag needs to be in low-latency memory, so you know beforehands how many bytes to read when accessing that block. Let's say you want antialiasing to work up to 1920x1440. With 2x2 blocks and 1 bit per flag, that's 86400 bytes of low-latency memory. Not exactly cheap. If you take 4x4 blocks, this goes down to 21600 bytes. A more complex compression scheme might require more than one bit flag.

If you don't use indirection and pointers, i.e. no additional AA sample buffer, you need to allocate space for all possible samples in the framebuffer. In 1280x1024 with 8xAA, back- and Z-buffer alone take up 80 MiB. Thus 4xAA is more feasible which means the savings of a better compression method are smaller.

Assuming this compression method could bring a 4x4 "edge block" down to 192 bytes (rounded to memory access granularity) instead of 512 bytes on average, and edge blocks were 5% of the frame, this better method would require ~11% less framebuffer bandwidth. That is significant saving, but also means more complexity on die and research required.

The more samples you have, the more attractive coverage mask storing seems to become. And the bigger the framebuffer becomes. Therefore I said it's more appropriate to lossy algorithms IMO.

Maybe ATI is actually using it yet, I don't know. NVidia doesn't.

btw, you don't need a coverage mask per visible fragment. An index mask that indicates which fragment is visible at a specific sample position saves some space. E.g. when you have two visible fragments, one mask is enough because the other is the inverse.

One problem your approach and the one from the patent have (same for Matrox' Fragment AA) is that polygon intersection edges don't get antialiased because there's only one Z value per visible fragment, not per sample. But I guess that's acceptable.

Hyp-X · May 14, 2005

Jawed said:
In R500 the AA samples don't live in the EDRAM. The framebuffer is purely for the frame itself. AA samples are held in local memory until they can be resolved into a completed pixel.

Jawed

Yes they are.
AA samples live in EDRAM until they can be resolved into a completed pixel in local memory.
If you cannot fit the entire antialiased framebuffer in EDRAM you need to render it in multiple runs (tiles).

madshi · May 14, 2005

DemoCoder said:
I think PDP is going to die, and the only thing that could possibly replace it is FED (field emission display) which is another semiconductor process oriented tech that essentially embeds a electron gun at each pixel. I think it uses carbon nanotubes.

I'm still dripping wet after seeing some LCD+LED HDR displays at CES. Maybe some DLP hybrid will dispose of the Xenon lamp and use an array of LEDs to make HDR DLP displays.

Interesting topic, though a bit OT. Anyway, I'm drooling for the FED variation called "SED" ( http://www.canon.com/technology/detail/device/sed_display/ ) and for OLED ( http://www.universaldisplay.com/tech.htm ). Production for SED is said to begin in the end of year 2005. OLED will take another 2-3 years.

LCD+LED displays are interesting, too. Sony's Qualia 005 high end LCD display already uses this technology. Mighty expensive right now, though.

Jawed · May 14, 2005

Xmas - the patent describes the use of a linked-list to hold AA samples (when using compressed AA sample sets).

You need a linked list because you don't know how many AA samples (in a compressed storage mechanism, using a coverage mask) you'll need for each pixel.

When a new AA sample needs to be stored (a new visible fragment appears that covers at least one AA sample point) the memory address it's stored at is unpredictable. If the pixel has three AA samples currently, then the new AA sample will be stored at the "first memory address available", leaving the three existing AA samples where they are. The linked list will have each fragment's coverage mask updated (if required), and the linked list's final fragment will have its pointer set to point to the new fragment's AA sample.

An alternative to using a linked list would be to scrap the original memory locations and write the updated AA sample set to a fresh, contiguous, piece of memory.

The patent describes the use of a stack to keep track of memory locations that are freed up when AA samples are discarded.

The patent seems to be quite emphatic about the use of a linked list.

When a pixel is entirely covered by a fragment, there is no need to store any AA sample data. This is the basis of ATI's compressed AA sample set architecture. When the majority of pixels in a frame are not representing one or more edges, AA sample compression is a big win.

This is why the patent is entitled:

Method and apparatus for video graphics antialiasing using a single sample frame buffer and associated sample memory

To be quite honest I can't address the memory incoherency (and therefore bandwidth wastage due to granularity) problem that you describe. I can only say that this is the technique ATI has described.

Perhaps some brave soul will create a model representing the conflicting variables in AA:

- number of triangles per frame
- number of pixels per frame
- overdraw
- average triangle size
- degree of AA
- super-sampling versus edge multi-sampling

and evaluate the memory efficiency (bandwidth versus latency versus consumption) of uncompressed versus various compressed AA sample set schemes

Jawed

Geo · May 14, 2005

geo said:
Edit: Maybe the scaler is for 720p to 1080i? Most (all?) of the first gen HD TV's couldn't do 720p, but did do 1080i. At least our previous CRT Mits was that way.

Duh. Looking at the [H] piece, they aren't supporting 1080p but are supporting 1080i, so that scaler must be for 720p to 1080i, which would increase the number of HDTVs that could use it in HD mode. Given the lifetime of the platform, it is a bit disappointing that they won't be doing 1080p.

Xmas · May 14, 2005

Jawed, my last posting were my thoughts on storing coverage mask in "traditional" architectures without eDRAM. That's why I wrote "If you don't use indirection and pointers". Indirection increases latency significantly, therefore pointers and linked lists may not be feasible if you only have high latency external memory.

I wonder how close this patent is to other implementations that store AA samples separately, like Matrox' FAA and 3DLabs' SuperScene AA.

Jawed · May 14, 2005

Latency isn't an issue if you pipeline the blending/filtering process. At the heart of that is some kind of buffer. Additionally while you're waiting for AA samples to be retrieved, the EDRAM blend/filter module can process the constant stream of brand new fragments coming out of the GPU. It won't go idle.

The first step in blending/filtering brand new fragments is to decide if they need blending or filtering. Only the EDRAM can make that first decision, and it can do so without the AA sample set for the pixel, generated by prior fragments. It can do this because the pixel in the frame buffer has colour + Z plus a flag indicating the need to fetch the AA sample set.

So to anti-alias a new fragment requires at least two passes around the loop centred on the EDRAM blend/filter unit.

The first pass can occur as soon as the new fragment is generated by the GPU.

The second pass (if required) is dependent on the complete AA sample set which is where the latency of a local memory access comes in. While waiting for the second pass, the EDRAM blend/filter unit is processing other new fragments, or performing the "second pass" on older fragments.

The prior patent:

Method and apparatus for video graphics antialiasing

provides some alternative scenarios for AA sample compression, but uses basically the same concepts.

Pipelining is the key here. I'm not familiar with Matrox's or 3DLabs's AA techniques so I can't compare

Jawed

Ailuros · May 15, 2005

Joe DeFuria said:
It is interesting to me that given a 65nm process, I would estimate it should be possible to include enough eDRAM on a PC chip to finally cover the "standard" 1600x1200 resoultion. This would have similar frame buffer characteristics / abilities as Xenon's 90nm chips at 1280x720 resolution.

For the time being I'd speculate that we might see WGF2.0 GPUs either in late 2006 or early 2007. Will 65nm be an option by then?

Assume it will, chip complexity will rise again to new record heights and to add to that on top at least 32MB of eDRAM sounds like an extremely high transistor count estimate to me.

There are no "standard" resolutions and especially for the future, as they continuously scale. I'd expect by 2007 to see quite affordable 24" or bigger displays and with a bit of luck SED displays to find their ways into the TV market first.

Finally I'd expect and want from next generation GPUs to be able to combine antialiasing with float HDR and not just in some mediocre resolution. Neither 10 or 32MB of eDRAM sound enough for such tasks. And since no-one would expect me to break my past track record, there are also other sollutions to increase bandwidth if needed in the future. Feel free to shoot me

KimB · May 15, 2005

Well, fitting the entire framebuffer in eDRAM isn't the only way to use it to gain efficiency. One could use it like modern CPU's use cache, for instance.

DemoCoder · May 15, 2005

Yeah, but if you don't tile, then you find yourself constantly flushing and reloading the cache, so you end up memory bound anyway. At the very least, you've got to be able to keep the z-buffer in memory, but if someone enables FB blending, and you're not tiling, I don't see what keeping the FB in eDRAM buys you.

Ailuros · May 15, 2005

Chalnoth said:
Well, fitting the entire framebuffer in eDRAM isn't the only way to use it to gain efficiency. One could use it like modern CPU's use cache, for instance.

The difference being that in such a case that the amount of embedded ram would be relatively small and thus the hardware cost equally small. And I'm not entirely sure if it's really such a necessity after all with virtual memory showing up in the next API.

Look at DemoCoder's answer; my last sentence in my former reply was targeting more or less something similar. There was a presentation from Kirk (if my memory doesn't betray me), where in one section there was a question of how to gain higher amounts of bandwidth in the future; one option was eDRAM and the other TBDR. Both sound equally unlikely at this stage.

R600 nugget...

psurge

DemoCoder

Xmas

Porous

Entropy

DemoCoder

Jawed

ninelven

PM

Mintmaster

DemoCoder

Xmas

Porous

Hyp-X

Irregular

madshi

Jawed

Geo

Mostly Harmless

Xmas

Porous

Jawed

Ailuros

Epsilon plus three

KimB

DemoCoder

Ailuros

Epsilon plus three

Similar threads