AMD: Zen 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
There's some reemerging rumor about Zen3 being SMT4, at least for servers.
Zen2 architecture already hints to such move: unified AGU scheduler, wider load/store pipe, double the micro-op cache and etc. If AMD keeps widening the core, there's certainly a place for two more threads.
 
Adding thread per core increases the overall utilization, but the heat too, so wouldn't the top frequency be lower?
I have no doubt about the benefits on a server workload, but on desktops?
This leaves two options: keep the 4 threads on the ryzen series decreasing top frequencies and losing in the gaming space, or lower it to the mainstream 2 thread/core, but keeping bigger overhead, latencies, and silicon that the server core's architecture demands.
On top the OS scheduler must choose if to put a thread on the fourth virtual core or all alone on another chiplet, far away from his siblings.
And the OS scheduler hates to make choices, in particular in the morning.
 
Yes, the server SKUs will probably be the excursive recipients for quad-way SMT for the time being. Database workloads and large scale VMs instances will definitely benefit much more from gobs of threads, combined with the massive I/O capabilities of the EPYC platform. AMD already modifies the Zen architecture for EPYC with adjusted HW data prefetching to map better for the specific software environments.
Workstation and consumer markets would still rather prefer yet another generational boost in ST/IPC performance, while keeping the TDP in check.
 
A question that might be a bit stupid:

- If Zen's architecture is evolving to be able to fill more threads from their current 2-threaded cores, why go right away for twice the threads per core instead of adding just one more thread?
Is a 3-threaded core not feasible?
 
Adding thread per core increases the overall utilization, but the heat too, so wouldn't the top frequency be lower?
I have no doubt about the benefits on a server workload, but on desktops?
This leaves two options: keep the 4 threads on the ryzen series decreasing top frequencies and losing in the gaming space, or lower it to the mainstream 2 thread/core, but keeping bigger overhead, latencies, and silicon that the server core's architecture demands.
On top the OS scheduler must choose if to put a thread on the fourth virtual core or all alone on another chiplet, far away from his siblings.
And the OS scheduler hates to make choices, in particular in the morning.

If you get 20% higher utilization and you have to reduce clock speeds by 10% because of the extra power, it's a win, whether the workload is a web server or a game.

If your workload doesn't scale well and you don't get higher utilization, then you don't get much more power usage, so you don't necessarily have to reduce clock speeds, and the only significant drawback is the extra silicon, which is a drawback for the manufacturer, but doesn't matter to the end user, except to the extent that they may (or may not) have to pay for it.
 
oa25j3W.jpg


The new Milan chips will feature the 7nm+ node, a refreshed version of the current node with higher performance. The chips also have two threads per core, silencing the rather dubious rumors that AMD would switch to four threads per core (SMT4) as we see with some competing chips.
The next-gen Milan chips still feature the same nine-die arrangement as the current-gen Rome models, with eight compute die and one I/O die, along with eight cores assigned to each compute chiplet. The largely unchanged specifications, at least in key areas, implies Milan is merely a "Tock"-equivalent, or just a move to the second-gen of the 7nm node (7nm+).

However, AMD also disclosed that the company had made a significant alteration to the cache alignments inside the chip, which indicates that there is significant work being done under the hood to improve instruction per cycle (IPC) throughput and reduce latency, both of which are key focus areas for AMD as it evolves its architecture. AMD currently splits its chiplets into two four-core Compute Complexes (CCX), each armed with 16MB of L3 cache. For Milan, that changes to eight cores connected to a unified 32MB slice of L3 cache, which should eliminate a layer of latency within the compute die.

Source: https://www.tomshardware.com/news/a...noa-architecture-microarchitecture,40561.html
 
When I saw that diagram my brain leapt to the idea that its a giant mobo size MCM with 8* Zen2 sockets = 512 cores :runaway:
 
A question that might be a bit stupid:

- If Zen's architecture is evolving to be able to fill more threads from their current 2-threaded cores, why go right away for twice the threads per core instead of adding just one more thread?
Is a 3-threaded core not feasible?

It's feasible, but a good deal of the development you'd have to do to enable SMT3 would cover SMT4, since any binary number able to hold 3 different values would have 2 bits, and therefore be able to hold 4 different values—which is not to say that this is the only thing you'd have to worry about, of course. This is why computer hardware tends to have many things in powers of two.

Beyond that, a good number of problems would be easier to split into blocks of 4 threads, and the Zen 3's designers would likely want their efforts put into more threads to yield significant results, which is more likely to happen with 4 threads than 3.
 
It's feasible, but a good deal of the development you'd have to do to enable SMT3 would cover SMT4, since any binary number able to hold 3 different values would have 2 bits, and therefore be able to hold 4 different values—which is not to say that this is the only thing you'd have to worry about, of course. This is why computer hardware tends to have many things in powers of two.

Also, - a lot of structures have to be sliced per thread, ROB, rename registers, store buffers etc. Divvying by four is trivially easy in hardware, - three? No so much.

That said, I don't think we will see SMT 4.

Cheers
 
There's no mention of SMT4 in any AMD Milan document, so the rumor can be put aside.
It's interesting that it resurfaces every year, sign that at least there's some level of discussion at the design teams.
 
There's no mention of SMT4 in any AMD Milan document, so the rumor can be put aside.
It's interesting that it resurfaces every year, sign that at least there's some level of discussion at the design teams.
Doubt it , if you look at the one change we now know for Milan and look at the workloads EPYC is weakest at ( transactional DB) and the workloads SMT4 would actually help ( I/O heavy, like a transactional DB) then AMD is already doing the right thing to improve performance while also helping general workloads far more then SMT4 ever will. While i would never recommend it in general(unless the vm is pinned) on Milan you could have 16 thread VM's and not have to worry about smashing the memory subsystem, currently you can only do 8, generally avoid going past 4 because of hypervisor scheduling issues.
 
Yeah the next "low-hanging fruit" in the Zen architecture is to improve the caching. This willin turn improve the biggest weak point fo the Zen arch in Server workloads, ie the database heavy workloads.
One way they get the [performance they currently do is by throwing big gobs of cache at each core, ie the 16MB that is shared between each ccx, if they can make that 2 x 16, perform the same way for 1 x 32, they improve not only their weak points, but also further provide more cache to a single threaded workload.

If course this is a lot easier to say than it is to do.
some options.. (ie. guesses )
- similar cache arrangement but move to 8 core ccx's
OR
- keep the 4 core ccx module, but modify the L3 cache to better serve the 8 cores / 2 ccx's.
OR
- bigger modification to the entire Cache structures, and eg. faster IF, and massive combined L3 or L4 cache on the IO die?
( eg. chiplet gets even smaller and only contains the L1 and L2, IF gets faster/wider, and then put 512Gb L3 on the IO die)
 
Yeah the next "low-hanging fruit" in the Zen architecture is to improve the caching. This willin turn improve the biggest weak point fo the Zen arch in Server workloads, ie the database heavy workloads.

[...]

What about changes to how the cache works? Would database workloads profit if AMD switched from a victim cache to an inclusive cache?

I remember reading that games (which is another area in which AMD is slightly behind Intel) prefer a large shared inclusive L3 cache over a smaller L3 victim cache (smaller because it's split between the two 4 core CCX).

The reason given was that games frequently move data between cores or are accessing immutable world state from multiple threads at once per frame.
 
What about changes to how the cache works? Would database workloads profit if AMD switched from a victim cache to an inclusive cache?

I remember reading that games (which is another area in which AMD is slightly behind Intel) prefer a large shared inclusive L3 cache over a smaller L3 victim cache (smaller because it's split between the two 4 core CCX).

The reason given was that games frequently move data between cores or are accessing immutable world state from multiple threads at once per frame.


Yeah the structure of the cache could easily change too, However imho a lot of the perf increase they got in Zen2 was due to the existing cache structure, so they might not want to mess with it too much.
Having said that, a shared inclusive L3 on the IO die, of 512mb or more might be an option. - would be amazing for server workloads, not sure how well that sort of structure would scale down to 1 and 2 chiplet consumer CPUs though.
However with a uber small chiplet they might be able to run them a good bit faster? Also if they did move the the L3 to the IO die, they would probably need to inscrease the speed if the IF bus too, otherwise suffer too much latency, in getting data and instructions to the cpu cores.
 
The existing cache is perfect for datacenters, where you divvy a CPU into 2, 4 or 8 core virtual instances. However for larger instances, the separate nature of the L3s means higher latency sharing.

When the IO chip is moved to a smaller feature size we might see a big shared cache there. A non-intrusive way (ie. without changing cache protocols) would be to implement it as a memory side victim cache, similar to how Intel implemented L4 in Crystal Well.

Cheers
 
The existing cache is perfect for datacenters, where you divvy a CPU into 2, 4 or 8 core virtual instances. However for larger instances, the separate nature of the L3s means higher latency sharing.

When the IO chip is moved to a smaller feature size we might see a big shared cache there. A non-intrusive way (ie. without changing cache protocols) would be to implement it as a memory side victim cache, similar to how Intel implemented L4 in Crystal Well.

Cheers

I would guess doing a cache on each memory controller on the I/O die, they have a patent about it, cant find it right now.
 
The existing cache is perfect for datacenters, where you divvy a CPU into 2, 4 or 8 core virtual instances. However for larger instances, the separate nature of the L3s means higher latency sharing.

When the IO chip is moved to a smaller feature size we might see a big shared cache there. A non-intrusive way (ie. without changing cache protocols) would be to implement it as a memory side victim cache, similar to how Intel implemented L4 in Crystal Well.

Cheers

Since AMD can mix and match different processes with different chiplets, I guess including an eSRAM die of very large capacity to act as a shared L4 ought to be relatively easier than it would be on a more traditional design. I don't know whether it would be worth it, but it sure sounds tempting.
 
Status
Not open for further replies.
Back
Top