AMD RyZen CPU Architecture for 2017

Once a buffer is full (all 64 bytes written) it is written and released. The problems start when you have partially filled buffers. If your NT stores don't align to 64 bytes, the buffer will sit around, waiting for the last bytes to be stored.

Of both threads on a core is doing NT stores, only four buffers, on average, is available to each. If each thread has multiple simultaneous streams you can end up exhausting the write buffers, causing flushes. Now, imageine a partially filled buffer gets flushed, that would be filled with 64 bytes if it sat around long enough. The next store in this 64 byte region will allocate a new fill buffer. This fill buffer will now be marked partially filled, because the first writes are already flushed, the fill buffer is not released when it is supposed to. You now end up with multiple partially filled buffers that sit around taken up this very limited precious resource.

So:
1. Align you NT stores to 64 byte regions.
1a. If you can't flush the NT stream when you have written the unaligned start in the first 64 byte region to free the buffer
2. Keep NT streams to a minimum (rewrite code, spread threads with NT writes on different physical cores)
That's basic stuff for graphics programmers. We have all burned our hands with WB several times on many platforms.

I am mostly interested about about the specific remarks, such as "When a store hits on a write buffer that has been written to earlier with a different memory type than that store, the buffer is closed and flushed.". What does "different memory type" mean? Does it mean that each standard store instruction (to standard cached memory) causes also all open WB buffers to flush? And does this also happen if the other SMT thread does standard stores and the other does NT/WB stores?
 
I am mostly interested about about the specific remarks, such as "When a store hits on a write buffer that has been written to earlier with a different memory type than that store, the buffer is closed and flushed.". What does "different memory type" mean? Does it mean that each standard store instruction (to standard cached memory) causes also all open WB buffers to flush? And does this also happen if the other SMT thread does standard stores and the other does NT/WB stores?

Write combining buffers are not coherent with the rest of the memory system (heck, they can get flushed in any order the core decides), so would be extremely surprised if cached stores have any influence at all. I think "memory type" has to do with the size of the store. I'm guessing it flushes a partially filled buffer in chunks of the stores used to fill the buffer, when the size changes, the buffer is flushed.
 
I am mostly interested about about the specific remarks, such as "When a store hits on a write buffer that has been written to earlier with a different memory type than that store, the buffer is closed and flushed.". What does "different memory type" mean? Does it mean that each standard store instruction (to standard cached memory) causes also all open WB buffers to flush? And does this also happen if the other SMT thread does standard stores and the other does NT/WB stores?

I interpret it to refer to the different memory types supported by the MTRR register. From 2.13: "AMD Family 17h processor supports the memory type range register (MTRR) and the page attribute table (PAT) extensions, which allow software to define ranges of memory as either writeback (WB), write-protected (WP), writethrough (WT), uncacheable (UC), or write-combining (WC). Defining the memory type for a range of memory as WC allows the processor to conditionally combine data from multiple write cycles that are addressed with in this range into a merge buffer."

If true then this warning about flushes is rather a corner case. Just don't go interleaving writes to a buffer with VirtualProtect calls to change that buffers MTRR and you'll be okay.
 
"All EPYC 7000 Processors have 8 Channels DDR4 and 128 PCIe Lanes"

That makes the EPYC 7251 an odd product. A lot of silicon to be selling for $400-$600 with 3/4 of the cores disabled, and neither does it turbo higher than the other models.
 
"All EPYC 7000 Processors have 8 Channels DDR4 and 128 PCIe Lanes"

That makes the EPYC 7251 an odd product. A lot of silicon to be selling for $400-$600 with 3/4 of the cores disabled, and neither does it turbo higher than the other models.

Perhaps, but based on the TDP, we're talking about pretty crappy silicon. Recycling 4 very poor dies into a >$400 SKU doesn't sound like such a bad idea to me.
 
Pretty basic stuff. However it specifically mentions "may not be closed for significant periods of time". I am just wondering what kind of effect this has if the CPU reads the data again soon (CPU->CPU NT writes & reads in the same frame like Ashes of Singularity).
If the data being written is expected to be re-read soon after, is it desirable to skip the cache hierarchy that allows it to be re-read quickly?

Does this mean that BAD THINGS HAPPEN if I have two SMT threads running on the same core and one is doing WC writes and the other is just running normal code (lots of writes & reads). Do my write combine buffers get partially flushed all the time?
If one thread is writing non-combining data to the same 64-byte region as a write-combining buffer, it would appear that this counts as an event that will close an ongoing write-combining buffer. That event does seem to apply to all write-combining buffers in the core.

Per AMD's documentation, there are a number of events that can prompt a flush. It would seem that AMD's implementation tries to be conservative in the face of any ambiguities or long-latency events that could interact with an a WC buffer of uncertain state.

Intel's line-fill buffer method for write combining seems like it could be more aggressive, or it has not chosen to state flush conditions as exhaustively.
There are fewer events that would flush all WC buffers, but it seems like sufficient cache traffic could evict individual lines more frequently.
Since not every Intel core does well with write-combining or actually handles NT stores non-temporally, it may not be consistently worse/better.

Same question for code that has function calls (function parameters written to stack). Do I need to avoid function calls (other writes) while writing WC data? I have always done my WC writes in tight inner loops (no function calls), but I am just wondering whether not doing it is a problem for Ryzen.
I think it comes down to whether there is an event that would prompt other pipeline flushes or if traffic subject to more stringent ordering/visibility rules might clash with WC data or potentially any data co-resident on its cache line. POP or PUSH don't appear on the table.


The document seems like it's missing a lot of sections, relative to the 15h guide.
Also, I think it might be incorrect on the load buffer size and FP scheduler size, compared to earlier Zen presentations.

The data cache seems to be interesting:
The L1 DC is a banked structure. Two loads per cycle can access the DC if they are to different banks. The DC banks that are accessed by a load are determined by address bits 5:2, the size of the load, and the DC way. DC way is determined using the linear-address-based utag/way-predictor (see section below).
Rather than a static 16 bank scheme for determining conflicts, there's a dynamic component in the DC way determination built into the way prediction/microtag array.
There's an undisclosed hash function based on the virtual address access history and dynamic behaviors related to aliasing or hash conflicts.
 
I have a feeling that hyperscalers are going to be all over EPYC. Massive savings at scale from 1P systems and potential performance benefits from ditching NUMA.

I suspect EC2/GCE VMs will still be on Intel for marketing reasons but for managed services like object storage EPYC is a winner.
 
"All EPYC 7000 Processors have 8 Channels DDR4 and 128 PCIe Lanes"

That makes the EPYC 7251 an odd product. A lot of silicon to be selling for $400-$600 with 3/4 of the cores disabled, and neither does it turbo higher than the other models.

It might be comical, 8 CCX with only one core per CCX enabled. But if you're in a situation where you only care about fitting as much memory as you can, it will make for an affordable machine with at least 256GB upgradable to well more than that. E.g., a statistician who has to fit everything into memory because that's easier.
 
It might be comical, 8 CCX with only one core per CCX enabled. But if you're in a situation where you only care about fitting as much memory as you can, it will make for an affordable machine with at least 256GB upgradable to well more than that. E.g., a statistician who has to fit everything into memory because that's easier.
NVDIMM based storage servers or some heterogeneous setups might make use of that. NVDIMMs would have a far larger capacity and benefit more from a large L3 than bunch of cores. Similarly an array of GPUs probably wouldn't need all the cores if serving largely as a PCIe backplane.
 
Massive savings at scale from 1P systems and potential performance benefits from ditching NUMA.
EPYC is "NUMA on chip" just like Ryzen and Threadripper. Clusters of 4 cores (8 threads) have their own dedicated L3 cache. There's no huge shared LLC like in Intel designs.
 
You know things are going well for your side when you can make bland statements like these and somehow they're very funny.


giphy.gif
 
EPYC is "NUMA on chip" just like Ryzen and Threadripper. Clusters of 4 cores (8 threads) have their own dedicated L3 cache. There's no huge shared LLC like in Intel designs.
I was aware of the CCX/L3 layout and the infinity fabric but I didn't think memory latency would be as high as going through another socket over QPI.
 
If 7nm is as good as they claim and on time, next gen zen will be insane!
What's even better, it will plug in to my AM4 board, so I will only have to eBay my R7 1700 and buy something with Zen 2 under heatspreader :)

PS. This is now officially fully stable - not a single crash since Monday and I did a lot of things with my computer:

bdrvyx.jpg
 
Last edited:
I was aware of the CCX/L3 layout and the infinity fabric but I didn't think memory latency would be as high as going through another socket over QPI.
Most likely not as slow as going from socket to socket, but still the software needs to be NUMA aware to achieve the best performance. We already know that Ryzen (8 cores divided in 2 clusters) has some performance problems in consumer software (including games), because these software isn't NUMA aware. Intel consumer chips (since Nehalem) have had a big shared LLC for all cores. Consumer software programmers didn't need to care about this stuff, but now they do. Enterprise software obviously has been NUMA aware for long time, and will continue to be so, meaning that EPYC and Threadripper have no problems there.
 
That's a very nice overclock! I probably need better cooling if I want to run mine at 4.0GHz :)

Cooling = clocks with Ryzen.
On AMD Spire RGB I couldn't get any reasonable stability above 3.9GHz even with 1.4V. Water AIO enabled easy 4GHz and today I will find out if not more than that.
 
Back
Top