View Full Version : Some comments on 3DLabs SuperScene AA
Its good to see at least one 3d hardware vendor using a sparse matrix sampling pattern even if it is in their professional product. I hope they use it in their consumer P10 as well. It provides excellent quality AA at all edge angles including near horizontal and vertical edges.
Their A buffer technique could easily be extended to perform the same kind of optimizations as Matrox's FAA while keeping all their current AA quality. These optimizations could increase their AA performance considerably.
Dave Baumann
17-May-2002, 13:20
Its good to see at least one 3d hardware vendor using a sparse matrix sampling pattern even if it is in their professional product.
If you mean the 16x16 area grid that up to 16 samples can be filled out on then I beleive P10 will use something similar, although no (initially) to the same number of samples.
I'll be doing a P10 tech preview next so hopefully stuff like this will wash out in that. Although it will be beased off the same presentations other sites have already published I was able to chat to them on the phone and get all the questions that was brought up from them answered, so hopefully ours will go a little deeper.
Dave,
that is great new and I can not wait. Did they give you a ball park time frame of then you might get a real P10 to test?
Section 2.2 of the Z3 paper gives a good explanation of using a sparse matrix sampling pattern (they use the term sparse supersampling to describe it).
http://research.compaq.com/wrl/people/jouppi/Z3/Z3paper.pdf
Dave Baumann
17-May-2002, 13:54
Yeah, they are doing the same in P10.
Hmmm, now that's what I call AA, anyone care making an educated guess as to the estimated street price of a P10 based consumer card? ;)
Jerry Cornelius
18-May-2002, 14:17
This looks effectively very much like Matrox's FAA with a different name to me. If they are doing multisampling as the article states then the only differece is their implementation. Seems to me it's just a committed fragment buffer with better sampling. Matrox doesn't say how they manage their fragment lists, but it's probably pretty similar to the "slots" referred to in the article.
Tagrineth
18-May-2002, 19:56
Actually, because 3Dlabs does real multisampling rather than detecting fragments then multisampling them, 3Dlabs's method is more compatible.
We dont know what exactly 3Dlabs does, but calling it multisampling sure as hell does not cover it :)
SA it looks like superscene already has some of the same optimizations FAA does. (see their comments on dynamic sample allocation). They claim that memory-usage for a typical scene at 16x is 3-4x the framebuffer size.
speculation :
They say they can take 2,4,8, or 16 samples per pixel, so perhaps they have one buffer with 2 samples linked to a 16 sample buffer for those pixels which need it.
It's basically the same as FAA, except it assumes that most pixels will require at least some AA, just not the full 16 samples. If this assumption is true then a 2x + 16x buffer will take less memory than a 1x + 16x buffer.
or...maybe i'm just misunderstanding the FAA optimizations you're referring to...
Jerry Cornelius
18-May-2002, 21:05
Actually, because 3Dlabs does real multisampling rather than detecting fragments then multisampling them, 3Dlabs's method is more compatible.
Ya, it works on more than edges, or does it? They would have to include some slope information in the "slot" in order to AA polygon intersections. Maybe that's mfa's missing bytes.
There actually seems to be quite a bit of difference based on the information they have talked about so far.
Both Matrox and 3Dlabs seem to be using modified A buffer techniques based on their descriptions. The simple A buffer works as follows:
A set of bit masks are stored in each pixel. There is one bit per sample. So for 16x AA you would need 16 bits or a 2 byte mask and for 32x AA you would need a 4 byte mask. Each bit mask is associated with a color, z, stencil, mask, and next level pointer. A set of these 5 values is called a level or slot. There are one or more levels/slots per pixel. The A buffer starts with one level in each pixel and adds more to a particular pixel as needed. The levels are sorted front to back by their z values. When a sample inside a pixel gets an opaque value set for it, the corresponding mask bit is set to one, initially all masks are 0. Masks in levels behind are ANDED with the NOT of the masks in front of them, turning off their sample bits for any opaque values in front of them. After the entire screen has been rendered, the pixels are traversed and processed. The levels for a pixel are processed front to back, as follows. The number of bits that are set to one in the mask for the level is multiplied by the color in the level / total bits in the mask. This color is accumulated for the pixel until all levels for the pixel are processed. The final accumulated color is stored in the frame buffer.
Problems with the Simple A buffer
1. Implicit edges (caused by interpenetrating triangles) are not anti-aliased at all.
2. AA storage requirements are unpredictable and scene dependent. You either allocate far more storage than you typically need and hope you don't exceed it, or you run out of storage in the middle of a render causing aliasing in the remainder of the image.
3. The sampling pattern is typically ordered grid. 16x ordered grid AA only produces 4 different color gradations instead of 16 at edges that are near horizontal or near vertical. This is because the 4 samples in a row (for near horizontal edges) or 4 samples in a column (for near vertical edges) all turn on or off at the same time giving only 4 effective samples instead of 16. As the angle moves from horizontal to an intermediate angle, the number of effective samples slowly improves from 4, to 5, and eventually to 16. So you get much less than 16 effective samples for many if not most edge angles and as few as 4 effective samples in many common cases while doing the work and using storage for all 16 samples.
4. Every pixel in the entire scene must be post processed, even though only a few usually need it.
Solutions:
Z3
Z3 was developed specifically as a 3d hardware AA method based on the A buffer that solves the first three problems mentioned above (but not the fourth).
1. Z3 stores the 2 dz/dy, dzdx slopes (of the triangle plane) as well as the center z values in each level/slot. This fully represents the underlying triangle that was sampled. This allows fragments to be intersected within a pixel so that implicit edges are antialiased. The intersection evaluation is done with very low precision parallel hardware multipliers as the pixel is processed.
2. Z3 uses a fixed number of preallocated levels (usually 3) and no level to level pointer. Whenever the number of levels exceeds the preallocation, levels are merged. Experiments showed that this was extremely effective since multiple levels were mostly caused by different triangles belonging to the same surface. The levels are merged in a way that the same surfaces are merged into the same level. Having more than 3 completely different surfaces in the same pixel is very rare, and even when is does occur the front-most 3 surfaces usually predominate the color of the pixel.
2. Z3 uses a sparsely sampled grid pattern rather than an ordered grid.
By creating a denser grid than you have samples such that the number of sample points is less than either the width or the height of the grid you get 16 effective sample points for all edge angles for 16x AA. This greatly improves the AA effectiveness for a given amount of samples and work. That is for 16x AA instead of a 4x4 grid you use at least a 16x16 grid and sparsely sample that denser grid with a special random pattern. As a result you get 16 effective samples at all edge angles, even those that are nearly horizontal or nearly vertical. The above sampling pattern is known as sparse supersampling. Since it is just a sampling pattern it has nothing to do with supersampling per se, I (rather inappropriately) have been referring to it as a sparse matrix pattern. Of course it has nothing to do with matrices either. A more appropriate term would be a sparsely sampled grid pattern.
Ideally you would use different patterns from a set of say 16 patterns so that nearby pixels would not use the same pattern. This reduces pixel to pixel aliasing artifacts.
4. Not solved. Z3 post processes every pixel in the scene.
3Dlabs SuperScene
This seems to be a modification of the simple A buffer algorithm. There seem to be 2 major modifications to solve problems 1 and 3. Problems 2 and 4 seem to be unsolved.
1. SuperScene seems to be using some modification that allows them to detect implicit edges. This probably accounts for the extra storage it needs per level, though the mechanism hasn't been discussed.
2. This problem seems to be unsolved. It seems to rely on preallocating 2 slots/levels and estimating the amount of dynamic storage and hopefully having enough.
3. SuperScene seems to be using a sparsely sampled grid pattern much like Z3. They should therefore get excellent quality AA for all edge angles.
4. Does not seem to be solved. SuperScene seems to need to post process every pixel in the scene.
Overall, SuperScene is a very effective method of AA, although I prefer the lower storage requirements of Z3's z slopes and fixed number of merged slots/levels. I am glad to see them use a sparsely sampled grid as this provides very effective AA for a given number of samples.
Solving problem 4 by segregating the AAed pixels from the non-AAed pixels and post processing only the AAed pixels, something like FAA, would improve its performance considerably.
Matrox FAA
This seems to be another modification of the A buffer. In this case they solved problem 4 only, problems 1, 2, and 3 seem to be unsolved. In addition, stenciling seems to be an issue.
1. Seems to be unsolved. FAA seems to use a single z value per level/slot. This means that implicit edges will not be anti-aliased.
2. FAA seems to allocate the extra storage it needs dynamically. It does not seem to bound the number of levels and merge them as Z3 does, so they must rely on preallocating more than enough, and hope they don't run out.
3. FAA seems to be using an ordered grid pattern or some similar variation. This substantially reduces the number of effective samples especially for near horizontal or near vertical edges.
4. By separating the AAed pixels from the non-AAed pixels, they only need to post process the AAed pixels. This greatly improves performance over most other AA techniques
FAA has really taken a leap in AA performance by solving problem 4. Unfortunately, they also need to address problems 1 through 3 to make it a complete AA solution. They also need to correctly save and process the stencil values.
SuperScene "solves" 4 by updating the AAed pixels every time a new poly is added to the pixel. It will be faster than a post filter on all pixels, as long as (AAed pixels)*(average number of polys in AAed pixel) is smaller than the total number of pixels.
Processing the total color for all the levels for each pixel update could be very expensive, since you will be accessing all the level colors and computing and saving the total color many times instead of once for each AAed pixel, only to throw all that computation away with the next pixel update.
To calculate the cost of processing the total color as you go you must include extra processing due to depth complexity since some triangles are likely to be processed out of order. In addition, the cost of processing as you go grows with each fragment added. The first update would cost 1 fragment, the next update would cost 2 fragments, the next update 3, etc = (1+2+3+...+n) = n*(n+1)/2 which grows by the square of the number of fragments in the pixel. Also you must transfer the computed colors from the front slots to the actual frame buffer for all the pixels since they all use slots = TotalPixels.
Thus the cost to process as you go is:
(AAed pixels)*(n*(n+1)/2) + TotalPixels
where n is the ave updates to an AAed pixel including those due to overwritten depth complexity. If n were say 6 pixel updates (say 3 due to dc, and 3 kept), it would cost:
21*(AAed pixels)+TotalPixels --- to process them as you go.
if AAed Pixels = 10% of total pixels:
= 3*TotalPixels -- to process them as you go
That is why it makes sense to post process.
In addition you would need 43*1280*1024=~54 meg of xtra frame buffer space for the 2 preallocated slots which is why you don't want to preallocate storage.
If SuperScene post processed the slots then the cost would be m*AAed pixels for the color computation where m is the final ave fragments kept per pixel. The cost for moving them to the frame buffer is TotalPixels so the total cost in this case is:
m*AAed+TotalPIxels
For the example above the cost is:
3*AAedPixels+TotalPixels
= 1.3*TotalPixels --- the cost with post processing all pixels
I have got to think therefore that SuperScene AA is post processing the pixels.
The problem is allocating slots and post processing them for all the pixels.
By segregating the AAed pixels from the non-AAed pixels you would only need to post process the AAed pixels. You only access the level colors and compute the total color once and only for the AAed pixels. The colors for the non-AAed pixels would already be stored in the frame buffer, much like FAA.
In this case it would only cost m*AAed pixels for the color computation. The cost for moving them to the frame buffer is AAed pixels so the total cost in this case is:
Segregating AAed cost = (m+1)*AAedPixels where m<n above
In the example above m=3 so the cost with segregated AAed pixels is:
4*AAed pixels
=~0.4 total pixels --- for segregated AAed pixels
or about 7 times less processing and memory bandwidth than processing as you go and 3 times less than post processing all the pixels
In addition, only slots for AAed pixels is created beyond the standard frame buffer and depth buffer. In this case 3*22*1280*1024*10% = 8.1 meg of slot storage. Colors, z values, stencils, etc. are stored in their final locations in the frame and depth buffers for non-AAed pixels.
You might notice that the cost for SuperScene with post processing all the pixels is only 1.3*TotalPixels for 16x AA in the example which is quite remarkable in fact. One problem is the 54 meg of preallocated storage though.
However, by segregating the AAed pixels and storing the rest in the final frame and depth buffers they could reduce the cost to only 0.4*Total pixels for 16x AA and use only 8.1 meg of addl storage in the example.
I did put the quotation marks around "solved" for a reason. :)
I'm not saying that SuperScene is The Ohh Soo Great Algorithm, I just said what they do. Or at least what I interpret this as: As each pixel is rendered, a final color is produced based on the average of all samples taken. The final color is written in the appropriate image buffer as each multisample pixel is changed.
I agree that I was too optimistic in my calc, but you were on the other hand too pesimistic. Overdraw should be accounted for, but not like that. I can't see any reason for that formula to be true for any case except if all 'n' polys are visible at the final step. When a fragments mask is empty, that fragment should be removed, and if there's just one fragment in the pixel I assume that it's given a "non-fragmented" treatment.
And I assumed that the cards do give the non-AAed pixels a special treatment, storing the color directly (and only) in the back buffer, because otherwise the on-the-fly filtering would loose a big benefit. I also assumed that the filtered pixel is stored in the back buffer.
This means that there is no 'TotalPixels' in the formula for filtering cost. And since unvisible fragments shouldn't be in the filtering, the other part should also look different.
Let's take your example. Assume that the 6 pixel updates comes from poly # 1...6, where 1 is closest and 6 furthest away. Assume that poly 1...3 is visible in the final pixel. Then depending on render order, the cost would vary:
Worst case: order: 126543 => cost = 1+2+3+3+3+3 = 15
front2back: order: 123456 => cost = 1+2+3+0+0+0 = 6
back2front: order: 654321 => cost = 1+1+1+1+2+3 = 9
But if non-AAed pixels are treated differently, the 1's shouldn't count since it's nothing more than what any IMR would have to do. This would give the costs 14/5/5.
Then you do a calculation based on that 10% of the pixels have 3 kept fragments. Isn't that a bit much? From 3Dlabs and Matrox it seems as 10% of the pixels with 2 kept fragments is more likely. And pixels with more fragments are much rarer.
Notice also that I talked about the cost for the filtering calculation, the bandwidth cost doesn't grow like this. By filtering on-the-fly you don't need to read the fragments to do the filtering, since the filtering is done when you've just been working on the pixel anyway. With post processing you have to read all fragments, including the non-AAed pixels.
The result is that it should indeed be competetive with post processing, even in your example. But it does of course not reach the nice numbers of segragated AAed pixels. One should however remember that there's a benefit in the regular memory access you get from preallocated slots, and that's not accounted for here.
Good catch on the quote about SuperScene updating the frame buffer with each pixel update.
I agree that the calculations above are oversimplified on the pessimistic side, especially concerning the overdraw.
If they allocate memory in a way that accesses all the slot information on every update, and it seems they do, then using the entire slot information, including the level colors makes good sense.
TotalPixels would indeed disappear from the compute-as-you-go scenario if you stored the calculated total color directly in the final frame buffer; which I am sure they do. As a result, I agree the compute-as-you-go approach competes well with post processing all the pixels.
As you mention though, segregating the AAed pixels and post processing them should still give SuperScene AA significant performance benefit. To make it worth while with regard to memory bandwidth though, you would need to separate the slot-color from the rest of the slot information by defining the slot-color in a strided fashion relative to the rest of the slot memory. One important benefit, is that segregating AAed pixels would also reduce their total memory requirements substantially, since it effectively uses the frame and depth buffer as the first slot for non-AAed pixels.
Mintmaster
19-May-2002, 15:54
I just want to say that this B3D article really opened my eyes to how important jittered sample locations are. A far cleaner edge is obtained this way. It a shame that the Radeon 8500 doesn't implement jittering correctly (or at least from what I hear). If they had 3x jittered supersampling in the following pattern:
_______
|......x....|
|.........x.|
|.x.........|
"""""""""'
They would probably get a better effect than all of NVidia's 4x methods even on edges, without any MSAA drawbacks.
I'm not a big fan of multisampling or FAA. These will be very short term solutions, as more complex pixel shaders (especially those making use of compare functions) will necessitate more sampling within a polygon.
I don't see how you can really avoid supersampling. Otherwise you'll just be doing edge antialiasing, which only fixes a small part of the screen. Some procedural textures (like marble), reflective bumpy surfaces, and many other things just can't be fixed by these "hack" forms of antialiasing. Also, anisotropic filtering can screw some of these effect up within the interior of a surface.
I think including multisampling in the category of FSAA is wrong. Its really edge and intersection antialiasing, and intersections are not very common AFAIK. Matrox was right to call their method FAA instead of making the consumer think its just very efficient at FSAA, as NVidia did.
Bambers
19-May-2002, 16:56
I just want to say that this B3D article really opened my eyes to how important jittered sample locations are. A far cleaner edge is obtained this way. It a shame that the Radeon 8500 doesn't implement jittering correctly (or at least from what I hear). If they had 3x jittered supersampling in the following pattern:
_______
|......x....|
|.........x.|
|.x.........|
"""""""""'
They would probably get a better effect than all of NVidia's 4x methods even on edges, without any MSAA drawbacks.
I was hoping for that when I got the card back in november. Bit of a dissapointment to find it was only a 3x vertical stretch. The 800x600 max res limit is also odd.
I might have missed this, but what are the memory requirements for superscene? i.e. how much memory must be allocated to aa for 16x at 1280x1024 resolution? This obviously takes away from texture memory and might require the card to have a significant amount of memory.
Dave Baumann
23-May-2002, 08:55
This obviously takes away from texture memory and might require the card to have a significant amount of memory.
Its not a unified frame buffer, so superscene does not inpact on texture space. Wildcat III 6110 has 64MB of framebuffer and I think 1280x1024 was the highest res SS could be enabled.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.