So what do you think about S3's DeltaChrome DX9 chip

You're right that a true fork would cost a lot of state. About N times as much as without the fork. (N=number of samples.) But it could be seen more as a "virtual" fork. The state at the fork position is stored, and then iteratively run the SSAA part completely to the end, one subpixel at a time.
But that's of course just a subset of your idea.

You don't even need dynamic loops to do it your way. The number of subsamples should be fixed at render time. It could be accessed through one new PS instruction: SET_SAMPLE Rn; where Rn holds an integer that tells what sample is currently accessed. Let it automatically add offsets to iterators, and change what framebuffer pixel is read/written. If Rn<0 then switch back to MSAA mode (you'd need to do the downfiltering yourself).

Actually changing the sample pattern within the PS isn't likely easy (at least for geometry sampling), so that's probably best left constant. But even with the method above, you could do it by loading an iterator to a temp reg before you do SET_SAMPLE, and then add your own constants for shifts to change texture sample positions. And in fact, if you just want to supersample textures (or texture+some nonlinear filtering), and then blend the result to one value and use it for MSAA, then it's possible today.

I'm not sure what you mean by changing sample size, it sounds like something that is very hard to do.
Number of samples is also something that I think should be constant (not surprising if I think that the sample pattern should be constant). But again, if you mean a number of samples that will be blended to one MSAA color inside the PS, then yes that should be possible.

But now we need this to be used. As far as I know, no games yet are explicitly coded for FSAA, other than possibly an in-game FSAA level slider. This is far from what we talked about above. So it must be possible for the driver to change a shader to add the "supersampled" fog and frame buffer blending at the end.
As long as the hardware supports shaders that are a bit longer, and has a few more temp register than reported, it should be possible.


But how about the performance?
:(
Well, that is of course a problem. The cost of doing fog and FB-blend in PS in SS style when all else is MS could be a quite big part of the PS instruction budget for short shaders. So the instructions to do that part better be efficient.
 
Basic -

Err sorry ... you are right, the sample pattern should be fixed at render time. (not the number of samples however, since partially covered pixels won't cover all the samples for a pixel)

By sample size, I simply mean "the number of bytes stored for each sample". It should probably be set to a power of 2, for easy addressing. Since programmeable hardware is now doing the blending, and the render target isn't necessarily a bunch of pixels anymore, why limit the sample size to a 32bit color? You might as well say that a sample in your buffer has some fixed size, and leave interpretation of the sample data to the pixel program.

As for performance sucking... besides increasing the clockspeed of the functional units and/or adding more of them, how about this:

The number of AA samples is expected to go up to 16X/32X and above. For an average pixel, most of the N samples taken will contain identical data (colors, for example).

So instead of storing N samples, store J samples S1, S2, ... SJ with coverage masks Cs1, Cs2,...Csj, and 1<=J<=N.
The shader is operating on a set of samples given by a coverage mask C.

Code:
for (i = 0; i < J; ++i)
{
    if ((C & Csi) == 0)
        continue;
    else
    {
         //compute blend result
         ...
         //store result as a sample with coverage mask (C & Csi)
         ...
    }
}
.

The coverage masks would be a away of both reducing storage cost of a buffer being rendered to, and of performing the minimum number of blends required.

So in addition to your instruction SET_SAMPLE Rn, how about a CSET_SAMPLE instruction to set up the next sample Si with Csi intersecting C.
On top of STORE_SAMPLE, you could have a CSTORE_SAMPLE instruction which stores the same result to sample locations (C & Csi).

Regards,
Serge
 
I agree with the idea "have one block of data per (sub)pixel, and let the pixel pack watever it wants into it". I agree that this means that different things packed into the pixel could have varying size. I just thought it sounded as you wanted different amount of data per pixel. And that could be hard to implement.

You're right that going to a SuperScene or maybe even Z3-like framebuffer could save pixel shading power.

Maybe it could be made available by a special kind of loop, or loop index. When it's used the hardware automatically runs the loop once for each sub pixel intersection that needs it.

But even though there are things to gain with it, it could increase complexity a great deal to.
 
Basic,

Re:complexity.... Computing C is already necessary (C is needed and modified by the Z test prior to shading). Finding the samples Si with (Csi & C) != 0 can be done prior to shading a pixel... how about additional pixel shader input registers:

SC : number of samples being operated on
SDi : sample data for sample Si.

A shader would start in one of 2 modes. Supersampling mode would force a separate computation for each sample in C. Multi-sampling mode would setup SDi to the samples with (C&Csi) != 0.

SC would be used as a loop count using the standard instructions... SET_SAMPLE Ri and OUT_SAMPLE Ri would then be sufficient to do everything described...
 
Back
Top