I wonder how they manage to avoid partial flushes when they have two SMT threads running on the same core. Let's say both SMT threads are doing heavy non-temporal writes (to their own locations obviously). One of the thread stalls because of a cache miss (very common), but hasn't completed a whole line of NT writes. The other thread continues without stalls. I would assume that this soon leads to the partially written NT line to be evicted. Then this happens, the evicted line needs to be combined with existing contents (as not all bits of the line were overwritten). When the thread later recovers, it will start writing from halfway of NT line, and again the data needs to be combined with existing memory contents as the whole NT line wasn't written. This would be awfully slow. This is why I would assume that each SMT thread has their own ("L0") write combine buffers. I am just trying to find out whether there's a difference between Intel and AMD in this regard. This is AMDs first SMT design after all. It might be that they have overlooked something.
Intel's optimization document recommends paying attention to the number of independent buffers being written to and trying to segregate different types of write traffic. The software is expected to do what it can to increase the chances that a full run of non-temporal writes will occur before eviction or some other condition for writeback is met. That means trying to avoid regular loads in the middle of filling a buffer, and capping the number of independent buffers being written to in an inner loop. For prior generations, Intel only guaranteed 4 such arrays for simultaneous use. (3.6.10)
If the code is multithreaded, section 8.5.5 includes a recommendation of using a small buffer in writeback memory to coalesce write-combine partial writes into a single final copy to WC memory rather than engaging in a stream of partial writes that might be interrupted. Section 7.7.1 advocates a similar use of small buffers and final full-line WC copies.
The details for write-combining are more sparse as time goes on, but the overall tone is that software is responsible for avoiding problems with partial writes or thrashing.
As to how AMD is different, it's unclear. AMD's older cores had a more detailed list on how they might flush their buffers, whereas Intel does not. The lack of transparency was part of the motivation of some earlier references to those encouraging avoiding write-combining or non-temporal instructions, since the exact behaviors are so variable and not always communicated.
I did see a rather oblique tweet to the effect that maybe some of the uncached operations under discussion for Zen's performance in certain games might not be as uncached as some are assuming, but I did not see additional detail.