That's basic stuff for graphics programmers. We have all burned our hands with WB several times on many platforms.Once a buffer is full (all 64 bytes written) it is written and released. The problems start when you have partially filled buffers. If your NT stores don't align to 64 bytes, the buffer will sit around, waiting for the last bytes to be stored.
Of both threads on a core is doing NT stores, only four buffers, on average, is available to each. If each thread has multiple simultaneous streams you can end up exhausting the write buffers, causing flushes. Now, imageine a partially filled buffer gets flushed, that would be filled with 64 bytes if it sat around long enough. The next store in this 64 byte region will allocate a new fill buffer. This fill buffer will now be marked partially filled, because the first writes are already flushed, the fill buffer is not released when it is supposed to. You now end up with multiple partially filled buffers that sit around taken up this very limited precious resource.
So:
1. Align you NT stores to 64 byte regions.
1a. If you can't flush the NT stream when you have written the unaligned start in the first 64 byte region to free the buffer
2. Keep NT streams to a minimum (rewrite code, spread threads with NT writes on different physical cores)
I am mostly interested about about the specific remarks, such as "When a store hits on a write buffer that has been written to earlier with a different memory type than that store, the buffer is closed and flushed.". What does "different memory type" mean? Does it mean that each standard store instruction (to standard cached memory) causes also all open WB buffers to flush? And does this also happen if the other SMT thread does standard stores and the other does NT/WB stores?