To tell you the truth, I didn't thoroughly examine that particular algorithm, but the essential idea is centering errors about zero.
Update: I went ahead and thought about it, and since the chance to end up with an error of -1 for an 8-bit output is 1/4 for the case with truncated intermediate results (Assuming you need two odd intermediates that are truncated after the division by two for an error at the end), the proper centering would be to add one to one of the four numbers 50% of the time, making for -1 error 1/8 of the time, +1 error 1/8 of the time. This doesn't examine errors due to truncation that occurs for the final value, so it may need some adjustment.
By centering errors about zero, you can significantly reduce the error accumulation. Centering the errors about zero means that half of the errors encountered are additive, while half are subractive. With the most basic truncation, all errors are additive, meaning that every error compounds every other error.
Anyway, yes, I was wrong about reducing the accuracy of intermediate results. However, the essential idea is that it is not necessary for the number of bits required for certain calculations to increase as quickly as
you might think.
Quick example: I produced a program some time ago that did the following:
Situation one (not centered):
1. 50% chance of adding 1 to the result (represents when one binary value is truncated)
2. 25% chance of adding 2 to the result (represents when both binary values are truncated)
3. 25% chance of doing nothing (simulating no error).
Situation two (centered):
1. 25% chance of adding 1 to the result.
2. 25% chance of subtracting 1 from the result.
3. 50% chance of doing nothing.
Basically, situation two does nothing more than subtracts 1 from situation one. Right off the bat, it looks like it should produce less error. The two loops were run until they reached a certain constant number. This constant number (We'll call it T) simulates the amount of "acceptable error," which would usually consist of bits that are truncated before being used. N is the number of loops executed:
For the non-centered case:
N = T (in the real world, this means error is proportional to the number of passes, which occurs whenever errors are always additive)
For the centered case:
N = (T^2)/2 (Which, in the real world, means that error is proportional to the square of the number of passes, which occurs whenever errors are centered about zero).
Anyway, I'd have to examine precisely how this applies to real-world situations, but the general idea should be obvious. It's not absolutely necessary to have N bits of extra accuracy for 2^(N-1) averages.