As I understand it, it scales better for higher-resolutions. Example of 720p not being suitable for CUDA but how about resolutions that is used on PC which ranges in the 2-4x pixel amount?
They can also use it for a large batch of images; the higher res, the better the gain.
But would you really have to gain a lot to have it run fast? The example of a Q3.0GHz having the MLAA take 5ms render time on a 1024x1024 render.
Depends on the architecture. The speed up may not be linear due to the data dependency and other overhead that only the implementor would know. You'll need to integrate the whole processing into the GPU pipeline also.
GOWAA runs on an architecture where the GPU, CPU and memory are arranged to enable sharing efficiently. So T.B. could throw the more flexible CPU cores at the image data via DMA. Perhaps future PC will be like this, I have no idea.
If we wait a while, I suspect we will see different implementations on the GPU for various problem sizes. If a single CPU thread implementation can solve the problem in 5ms, I don't think they need to throw massively many cores at the problem (Will introduce too much overhead). So the per-core implementation needs to be very very efficient.