So every GPU thread processes a single pixel in this approach. In the MLAA algorithm however, pixels are not independent, but have a rather strict order in which they need to be processed. In other words, MLAA is not embarrassingly parallel and thus hard to implement on a GPU. Edge detection is not the issue.
I'll go back into my cage now.
Reading the MLAA paper, it notes that the first step is to find the "edges", where only the longest are considered (primary edge), which then are split up into L-shape structure to apply the color averaging using a connecting triangle (or its respective area)!
Now, when you want to make a parallel version of this algorithm to fire up all SPUs...for instance with domain decomposition technique:
-Considering 4 SPUs, one should split the image at least in for equal pieces to process each piece independently.
-If you use this patern detection indepentently for each piece of the image...the number of pattern and especially their shapes ('longest primary edge') could change, right?
-Especially at the 'artificial' boundaries of the single sub-domains...
-If the number and form of the pattern changes, the triangle you use to determine the new color of the pixels differs compared to the single SPU case, thus the resulting color differs, thus the anti-aliasing of the image differs
-Typically, if you want good load balancing, you should split the image in more than four pieces, which exaggerates this problems.
-The problem with respect to load balancing I see is that in theory it could well be that one SPU detects no edges in its sub-domain, thus sitting around while the others do their hard averaging work, if no special care is taken in such sitations (i.e. dynamic load balancing!)
What interests me:
- Can one generally say, that the shorter primary edges due to the domain decomposition yield a worse IQ when using the triangles to average, compared to the single SPU case?
If this is right, this could be a major drawback of the algorithm...because the only alternative I see with respect to a parallel version of this algorithm is to somehow communicate with neighbor domains to find the unique pattern - this smells like a difficult "quality verus SPU time" quest!