About MLAA SMS update more explanations (source gaf):
'“It was extremely expensive at first. The first not so naive SPU version, which was considered decent, was taking more than 120 ms, at which point, we had decided to pass on the technique. It quickly went down to 80 and then 60 ms when some kind of bottleneck was reached. Our worst scene remained at 60ms for a very long time, but simpler scenes got cheaper and cheaper. Finally, and after many breakthroughs and long hours from our technology teams, especially our technology team in Europe, we shipped with the cheapest scenes around 7 ms, the average Gow3 scene at 12 ms, and the most expensive scene at 20 ms.
In term of quality, the latest version is also significantly better than the initial 120+ ms version. It started with a quality way lower than your typical MSAA2x on more than half of the screen. It was equivalent on a good 25% and was already nicer on the rest. At that point we were only after speed, there could be a long post mortem, but it wasn’t immediately obvious that it would save us a lot of RSX time if any, so it would have been a no go if it hadn’t been optimized on the SPU. When it was clear that we were getting a nice RSX boost ( 2 to 3 ms at first, 6 or 7 ms in the shipped version ), we actually focused on evaluating if it was a valid option visually. Despite of any great performance gain, the team couldn’t compromise on quality, there was a pretty high level to reach to even consider the option. And as for the speed, the improvements on the quality front were dramatic. A few months before shipping, we finally reached a quality similar to MSAA2x on almost the entire screen, and a few weeks later, all the pixelated edges disappeared and the quality became significantly higher than MSAA2x or even MSAA4x on all our still shots, without any exception. In motion it became globally better too, few minor issues remained which just can’t be solved without sub-pixel sampling.
There would be a lot to say about the integration of the technique in the engine and what we did to avoid adding any latency. Contrarily to what I have read on few forums, we are not firing the SPUs at the end of the frame and then wait for the results the next frame. We couldn’t afford to add any significant latency. For this kind of game, gameplay is first, then quality, then framerate. We had the same issue with vsync, we had to come up with ways to use the existing latency. So instead of waiting for the results next frame, we are using the SPUs as parallel coprocessors of the RSX and we use the time we would have spent on the RSX to start the next frame. With 3 ms or 4 ms of SPU latency at most, we are faster than the original 6ms of RSX time we saved. In the end it’s probably a wash in term of latency due to some SPU scheduling consideration. We had to make sure we could kick off the jobs as soon as the RSX was done with the frame, and likewise, when the SPU are done, we need the RSX to pick up where it left and finish the frame. Integrating the technique without adding any latency was really a major task, it involved almost half of the team, and a lot of SPU optimization was required very late in the game.”
“For a long time we worked with a reference code, algorithm changes were made in the reference code and in parallel the optimized code was being optimized further. the optimized version never deviated from the reference code. I assume that doing any kind of cheap approximation would prevent any changes to the algorithm. There’s a point though, where the team got such a good grip of the optimized version that the slow reference code wasn’t useful anymore and got removed. We tweaked some values, made few major changes to the edge detection code and did a lot of testing. I can’t stress it enough. every iteration was carefully checked and evaluated.”