Differences between refras and real hw are not just limited to FP24. The ordering of operations in the vertex shader, and the implementation of that HW can lead to slight vertex data lsb changes. As well, clipping can performed differently on different architectures, which can, again, lead to differences in lsbs of data.
Then, setup and rasterization can lead to more differences. Typically, a hw unit will sort vertices for each vertex, so that a permutation of vertices will not change what is rendered. However, there's no set way to do that sort; each method is different. As well, what precision is maintained for iteration and for Z computations is going to be different. At this point, you could do Z in FP or integer, but how many bits? Each hardware version will give slightly different results. You could even imagine a 1/2 pixel shift, between HW versions. Shading and precision in shading will lead to more differences. As well, reduction to the final frame buffer (from shader precision downto 8b/comp) can be done in several ways; another 1/2 lsb of change is trivial there.
In general, we shoot for "self-consistency" -- We assume that data coming in does not have gaps or T-vertices, and we guarantee not to introduce any. But small differences will be abudant -- And have always been there.
At the end, even WHQL certification is aware of that process, and uses a fuzzy compare to a reference image, on very SIMPLE geometry. With a complexe scene, the differences will increase even more.
At the end, the following seems true to me:
1) Doing pixel compares to the refras is a waste of time.
2) Doing pixel compares between two different HW implementations, is a waste of time.
3) Comparing images between two drivers on the same HW can give you a little information, but it's very fluffy. Generally a new driver means *something has* changed.
4) Comparing the same hw, the same drivers, but with changes in the apps or with app detect, can have value, if done properly.