Zen2 architecture already hints to such move: unified AGU scheduler, wider load/store pipe, double the micro-op cache and etc. If AMD keeps widening the core, there's certainly a place for two more threads.There's some reemerging rumor about Zen3 being SMT4, at least for servers.
Adding thread per core increases the overall utilization, but the heat too, so wouldn't the top frequency be lower?
I have no doubt about the benefits on a server workload, but on desktops?
This leaves two options: keep the 4 threads on the ryzen series decreasing top frequencies and losing in the gaming space, or lower it to the mainstream 2 thread/core, but keeping bigger overhead, latencies, and silicon that the server core's architecture demands.
On top the OS scheduler must choose if to put a thread on the fourth virtual core or all alone on another chiplet, far away from his siblings.
And the OS scheduler hates to make choices, in particular in the morning.
The new Milan chips will feature the 7nm+ node, a refreshed version of the current node with higher performance. The chips also have two threads per core, silencing the rather dubious rumors that AMD would switch to four threads per core (SMT4) as we see with some competing chips.
The next-gen Milan chips still feature the same nine-die arrangement as the current-gen Rome models, with eight compute die and one I/O die, along with eight cores assigned to each compute chiplet. The largely unchanged specifications, at least in key areas, implies Milan is merely a "Tock"-equivalent, or just a move to the second-gen of the 7nm node (7nm+).
However, AMD also disclosed that the company had made a significant alteration to the cache alignments inside the chip, which indicates that there is significant work being done under the hood to improve instruction per cycle (IPC) throughput and reduce latency, both of which are key focus areas for AMD as it evolves its architecture. AMD currently splits its chiplets into two four-core Compute Complexes (CCX), each armed with 16MB of L3 cache. For Milan, that changes to eight cores connected to a unified 32MB slice of L3 cache, which should eliminate a layer of latency within the compute die.
A question that might be a bit stupid:
- If Zen's architecture is evolving to be able to fill more threads from their current 2-threaded cores, why go right away for twice the threads per core instead of adding just one more thread?
Is a 3-threaded core not feasible?
It's feasible, but a good deal of the development you'd have to do to enable SMT3 would cover SMT4, since any binary number able to hold 3 different values would have 2 bits, and therefore be able to hold 4 different values—which is not to say that this is the only thing you'd have to worry about, of course. This is why computer hardware tends to have many things in powers of two.
Doubt it , if you look at the one change we now know for Milan and look at the workloads EPYC is weakest at ( transactional DB) and the workloads SMT4 would actually help ( I/O heavy, like a transactional DB) then AMD is already doing the right thing to improve performance while also helping general workloads far more then SMT4 ever will. While i would never recommend it in general(unless the vm is pinned) on Milan you could have 16 thread VM's and not have to worry about smashing the memory subsystem, currently you can only do 8, generally avoid going past 4 because of hypervisor scheduling issues.There's no mention of SMT4 in any AMD Milan document, so the rumor can be put aside.
It's interesting that it resurfaces every year, sign that at least there's some level of discussion at the design teams.
Yeah the next "low-hanging fruit" in the Zen architecture is to improve the caching. This willin turn improve the biggest weak point fo the Zen arch in Server workloads, ie the database heavy workloads.
[...]
What about changes to how the cache works? Would database workloads profit if AMD switched from a victim cache to an inclusive cache?
I remember reading that games (which is another area in which AMD is slightly behind Intel) prefer a large shared inclusive L3 cache over a smaller L3 victim cache (smaller because it's split between the two 4 core CCX).
The reason given was that games frequently move data between cores or are accessing immutable world state from multiple threads at once per frame.
The existing cache is perfect for datacenters, where you divvy a CPU into 2, 4 or 8 core virtual instances. However for larger instances, the separate nature of the L3s means higher latency sharing.
When the IO chip is moved to a smaller feature size we might see a big shared cache there. A non-intrusive way (ie. without changing cache protocols) would be to implement it as a memory side victim cache, similar to how Intel implemented L4 in Crystal Well.
Cheers
The existing cache is perfect for datacenters, where you divvy a CPU into 2, 4 or 8 core virtual instances. However for larger instances, the separate nature of the L3s means higher latency sharing.
When the IO chip is moved to a smaller feature size we might see a big shared cache there. A non-intrusive way (ie. without changing cache protocols) would be to implement it as a memory side victim cache, similar to how Intel implemented L4 in Crystal Well.
Cheers