Hardware MSAA

It's kind of arrogant to think me and a lot of people don't understand something. I'ts quite possible we're ignorant about something, but that doesn't mean we wouldn't understand it if properly explained.
Since when does saying you don't understand equivalent to saying you can't understand or won't understand even if I explain it? And why would I bother with explaining it if I felt the latter?

With software rasterization you can keep data movement to a minimum because it can be done more locally.
Of course you are going to think that way if you ignore all the data flow that distributes the information required for rasterization into the shader units.

Nothing about rasterization can't be kept local. First of all, going from vertices to quads necessitates a complete redistribution of data by its very nature, so you can't eliminate that data flow. Then you have early Z/stencil testing to cull quads before they even enter the shader pipeline. That involves specialized functionality which would have to be repeated in each shader unit in order for the rasterizer to be eliminated along with duplicated data flow due to the culling not being centralized. Load balancing is made far more complicated since the pixel generation isn't centralized.

I think the best way to see why centralized, FF rasterization is here to stay is to just think of the rasterizer as part of the dispatcher, which is needed to oversee all the computation units. Rasterization is not a workload that needs to be distributed, but rather a way of distributing the real workload.
Never say never. Fixed-function alpha testing and fog also did't need any flexibility in moving data around. Yet they are gone now.
It's generic because the test has become arbitrary. The actual killing of the 'pixel' is now also used for other functionality.
Alpha testing is not gone. It's simply been exposed as a shader function now (texkill/clip) instead of a state change in the API. Alpha test has always been synonymous with per-pixel culling, and GPUs will always have specialized hardware to handle that differently from Z-culling. There's nothing more generic about it now either aside from aesthetics. If Microsoft took away clip() but gave the alpha test state change back, all you'd have to do is add a boolean to your program and modify the alpha value at the end to match your test.

Fog is a pretty meaningless example, because nobody ever said that fog will never be generalized.

For texturing, remember that I said filtering, not address calculations. Simply getting 4-8 samples per clock into each shader unit instead of one filtered sample (which is required to get close to FF performance) will cost more die space than the FF filtering unit itself.

No matter how inefficient something may seem to hardware designers, there's no other choice but to follow the demands of the software developers. If they want more generic memory accesses, texture samplers will eventually disappear. We can disagree on whether this is what they want, but I don't think fixed-function hardware can turn it around (in the long run).
Unless you think devs will one day prefer 1/4 performance for the majority of texels in order to have filtering flexibility, or forgo filtering for the vast majority of memory reads (i.e. >98%) by the shader, there won't be an impetus to remove FF filtering from a GPU. I emphasize the G, because I'm talking about a processor used primarily for graphics, and filtering hasn't really changed for many decades. If HPC miraculously finds an application that allows them to displace GPUs, that's a different story.

Now, if we move to path tracing and it needs 50x the memory accesses that final pixel texturing does, I guess its possible that devs will do this, but I don't think that's reasonable.
That was the whole point! It's a perfect example. Fixed-function hardware disappears because the usage evolves in favor of something more programmable.
No, it didn't disappear. It was expanded into the programmable unit. What you see with the API is different from what hardware developers implement. ATI never really had fixed function TnL. R100's vertex unit was just short of a DX8 vertex shader. I wouldn't be surprised if GF1/2 were the same. GF3 didn't remove DOT3 from GF1/2 and replace it with a pixel shader. Both had the same register combiner architecture of which DOT3 was a small part. GF3 just added dependent texturing (finally catching up to Matrox G400 pixel ability) and DX8 exposed it all differently.

Now ask yourself why nobody is planning to implement those features in hardware...
Fantastic argument. "Nobody is implementing bubble sort hardware in a CPU. Therefore all fixed function hardware is on the way out..."

There's definitely a huge software challenge ahead of us. But that doesn't make it a bad idea.
I wasn't talking about the software challenge. I was talking about the additional hardware needed to make software rasterization possible. You unfortunately glossed over the paragraph I wrote about the paper on software rasterization.
 
If you look at the area associated with a rasteriser you couldn't put sufficient ALUs down in that area to give equivalent performance, so you end up with ALUs that are bigger comsumers of power active for longer, net result is higher power consumption.

I've actually been round this loop recently with fixed function interpolation, and although the overall area and even utilisation of a prgrammable solution looks better the extra power consumption wasn't even close to being an acceptable compromise.
Short term, absolutely! No argument there. Fixed-function dedicated hardware will remain critically important for some time to come.

But that's for running today's applications. There is no doubt the workloads will continue to get much more diverse. Just look at the evolution in the past ten years. Today's highly programmable shaders made absolutely zero sense for a game like Max Payne. Unified pipelines, dynamic flow control, double precision floating-point? Are you kidding me? Yet it was all added in a few years time and we can no longer live without features like this because games and other applications do make use of them (or benefit from it).

So as crazy as software rasterization and texture sampling may sound today, people will continue to buy new hardware both for higher performance and new features. Vertex and pixel processing has already become highly programmable, and it wasn't cheap to implement but the resulting flexibility allowed some amazing progress. The next expensive steps, once the transistor budget allows it, is to make rasterization and texture sampling more generic.

Do not for a second think of what effect that would have on the performance of today's games. Mark my words; by then Crysis 2 will be running smoothly on a CPU (just like today Max Payne runs smoothly on a CPU). So future GPUs better have some really impressive capabilities!
To do what you appear to want will require a significant increase in complexity which means higher power and less performance.
That's also true for all the programmability that has been added in the past decade. But the higher (relative) power and lower (relative) performance was worth it because the goal wasn't to run Max Payne at 10,000 FPS, the goal was to run Crysis 2 and a whole bunch of other applications at adequate performance.
We onyl have to look to intel for exampls of this.
Larrabee was simply too much too soon (or too little too soon depending on your point of view).

Basically, the entire software world can't be forced to adopt a new programming approach overnight just by introducing new hardware with a radically new architecture and extensive new capabilities. Despite the potential of running existing applications more efficiently (with a significant rewrite) and enabling whole new classes of applications, it's just too big an investment and risk for software companies.

But that doesn't mean it's proof that this kind of architecture is not the future. It's just proof that things evolve faster when taking baby steps rather than trying to take leaps.
 
Now that many of the major developers of 3D engines have chosen a deferred shading pipeline should we expect hardware accelerated MSAA to be dropped from future architectures? I'm not referring so much to rasterization or MSAA buffer generation but the compression, blending and resolve steps of the process.

That hardware appears to be pretty much useless on the latest engine technology so why keep it around? I expect new hardware to be fast enough to eat the bandwidth and compute costs of emulating these steps on the shader units for older forward renderers.

Thoughts?

Some of us still play older games that do support "traditional" MSAA just fine. But if it can be done just the same using any new method, then i'm fine with it. For as long as i can use FSAA in older games as well without any adverse effect.
 
Back
Top