AMD: Zen 3 Speculation, Rumours and Discussion

Discussion in 'PC Industry' started by fehu, Sep 26, 2019.

  1. fehu

    Veteran Regular

    Joined:
    Nov 15, 2006
    Messages:
    1,480
    Likes Received:
    432
    Location:
    Somewhere over the ocean
    There's some reemerging rumor about Zen3 being SMT4, at least for servers.
     
  2. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,494
    Likes Received:
    405
    Location:
    Varna, Bulgaria
    Zen2 architecture already hints to such move: unified AGU scheduler, wider load/store pipe, double the micro-op cache and etc. If AMD keeps widening the core, there's certainly a place for two more threads.
     
  3. fehu

    Veteran Regular

    Joined:
    Nov 15, 2006
    Messages:
    1,480
    Likes Received:
    432
    Location:
    Somewhere over the ocean
    Adding thread per core increases the overall utilization, but the heat too, so wouldn't the top frequency be lower?
    I have no doubt about the benefits on a server workload, but on desktops?
    This leaves two options: keep the 4 threads on the ryzen series decreasing top frequencies and losing in the gaming space, or lower it to the mainstream 2 thread/core, but keeping bigger overhead, latencies, and silicon that the server core's architecture demands.
    On top the OS scheduler must choose if to put a thread on the fourth virtual core or all alone on another chiplet, far away from his siblings.
    And the OS scheduler hates to make choices, in particular in the morning.
     
  4. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,494
    Likes Received:
    405
    Location:
    Varna, Bulgaria
    Yes, the server SKUs will probably be the excursive recipients for quad-way SMT for the time being. Database workloads and large scale VMs instances will definitely benefit much more from gobs of threads, combined with the massive I/O capabilities of the EPYC platform. AMD already modifies the Zen architecture for EPYC with adjusted HW data prefetching to map better for the specific software environments.
    Workstation and consumer markets would still rather prefer yet another generational boost in ST/IPC performance, while keeping the TDP in check.
     
    Lightman likes this.
  5. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,073
    Likes Received:
    4,651
    A question that might be a bit stupid:

    - If Zen's architecture is evolving to be able to fill more threads from their current 2-threaded cores, why go right away for twice the threads per core instead of adding just one more thread?
    Is a 3-threaded core not feasible?
     
  6. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,499
    Likes Received:
    918
    If you get 20% higher utilization and you have to reduce clock speeds by 10% because of the extra power, it's a win, whether the workload is a web server or a game.

    If your workload doesn't scale well and you don't get higher utilization, then you don't get much more power usage, so you don't necessarily have to reduce clock speeds, and the only significant drawback is the extra silicon, which is a drawback for the manufacturer, but doesn't matter to the end user, except to the extent that they may (or may not) have to pay for it.
     
  7. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,494
    Likes Received:
    405
    Location:
    Varna, Bulgaria
    [​IMG]

    Source: https://www.tomshardware.com/news/a...noa-architecture-microarchitecture,40561.html
     
    BRiT and Lightman like this.
  8. hoom

    Veteran

    Joined:
    Sep 23, 2003
    Messages:
    2,966
    Likes Received:
    512
    When I saw that diagram my brain leapt to the idea that its a giant mobo size MCM with 8* Zen2 sockets = 512 cores :runaway:
     
  9. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,499
    Likes Received:
    918
    It's feasible, but a good deal of the development you'd have to do to enable SMT3 would cover SMT4, since any binary number able to hold 3 different values would have 2 bits, and therefore be able to hold 4 different values—which is not to say that this is the only thing you'd have to worry about, of course. This is why computer hardware tends to have many things in powers of two.

    Beyond that, a good number of problems would be easier to split into blocks of 4 threads, and the Zen 3's designers would likely want their efforts put into more threads to yield significant results, which is more likely to happen with 4 threads than 3.
     
    ToTTenTranz likes this.
  10. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,533
    Likes Received:
    888
    Also, - a lot of structures have to be sliced per thread, ROB, rename registers, store buffers etc. Divvying by four is trivially easy in hardware, - three? No so much.

    That said, I don't think we will see SMT 4.

    Cheers
     
    TheAlSpark, Lightman and Alexko like this.
  11. fehu

    Veteran Regular

    Joined:
    Nov 15, 2006
    Messages:
    1,480
    Likes Received:
    432
    Location:
    Somewhere over the ocean
    There's no mention of SMT4 in any AMD Milan document, so the rumor can be put aside.
    It's interesting that it resurfaces every year, sign that at least there's some level of discussion at the design teams.
     
  12. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,301
    Likes Received:
    397
    Location:
    Australia
    Doubt it , if you look at the one change we now know for Milan and look at the workloads EPYC is weakest at ( transactional DB) and the workloads SMT4 would actually help ( I/O heavy, like a transactional DB) then AMD is already doing the right thing to improve performance while also helping general workloads far more then SMT4 ever will. While i would never recommend it in general(unless the vm is pinned) on Milan you could have 16 thread VM's and not have to worry about smashing the memory subsystem, currently you can only do 8, generally avoid going past 4 because of hypervisor scheduling issues.
     
    Lightman likes this.
  13. vjPiedPiper

    Newcomer

    Joined:
    Nov 23, 2005
    Messages:
    65
    Likes Received:
    25
    Location:
    Melbourne Aus.
    Yeah the next "low-hanging fruit" in the Zen architecture is to improve the caching. This willin turn improve the biggest weak point fo the Zen arch in Server workloads, ie the database heavy workloads.
    One way they get the [performance they currently do is by throwing big gobs of cache at each core, ie the 16MB that is shared between each ccx, if they can make that 2 x 16, perform the same way for 1 x 32, they improve not only their weak points, but also further provide more cache to a single threaded workload.

    If course this is a lot easier to say than it is to do.
    some options.. (ie. guesses )
    - similar cache arrangement but move to 8 core ccx's
    OR
    - keep the 4 core ccx module, but modify the L3 cache to better serve the 8 cores / 2 ccx's.
    OR
    - bigger modification to the entire Cache structures, and eg. faster IF, and massive combined L3 or L4 cache on the IO die?
    ( eg. chiplet gets even smaller and only contains the L1 and L2, IF gets faster/wider, and then put 512Gb L3 on the IO die)
     
  14. Laniakea

    Newcomer

    Joined:
    Apr 16, 2019
    Messages:
    66
    Likes Received:
    83
    What about changes to how the cache works? Would database workloads profit if AMD switched from a victim cache to an inclusive cache?

    I remember reading that games (which is another area in which AMD is slightly behind Intel) prefer a large shared inclusive L3 cache over a smaller L3 victim cache (smaller because it's split between the two 4 core CCX).

    The reason given was that games frequently move data between cores or are accessing immutable world state from multiple threads at once per frame.
     
    BRiT likes this.
  15. vjPiedPiper

    Newcomer

    Joined:
    Nov 23, 2005
    Messages:
    65
    Likes Received:
    25
    Location:
    Melbourne Aus.

    Yeah the structure of the cache could easily change too, However imho a lot of the perf increase they got in Zen2 was due to the existing cache structure, so they might not want to mess with it too much.
    Having said that, a shared inclusive L3 on the IO die, of 512mb or more might be an option. - would be amazing for server workloads, not sure how well that sort of structure would scale down to 1 and 2 chiplet consumer CPUs though.
    However with a uber small chiplet they might be able to run them a good bit faster? Also if they did move the the L3 to the IO die, they would probably need to inscrease the speed if the IF bus too, otherwise suffer too much latency, in getting data and instructions to the cpu cores.
     
  16. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,533
    Likes Received:
    888
    The existing cache is perfect for datacenters, where you divvy a CPU into 2, 4 or 8 core virtual instances. However for larger instances, the separate nature of the L3s means higher latency sharing.

    When the IO chip is moved to a smaller feature size we might see a big shared cache there. A non-intrusive way (ie. without changing cache protocols) would be to implement it as a memory side victim cache, similar to how Intel implemented L4 in Crystal Well.

    Cheers
     
  17. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,301
    Likes Received:
    397
    Location:
    Australia
    I would guess doing a cache on each memory controller on the I/O die, they have a patent about it, cant find it right now.
     
  18. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,195
    Likes Received:
    591
    Location:
    France
    Has AMD already used an L4 cache in a product ?
     
  19. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,499
    Likes Received:
    918
    Since AMD can mix and match different processes with different chiplets, I guess including an eSRAM die of very large capacity to act as a shared L4 ought to be relatively easier than it would be on a more traditional design. I don't know whether it would be worth it, but it sure sounds tempting.
     
  20. fehu

    Veteran Regular

    Joined:
    Nov 15, 2006
    Messages:
    1,480
    Likes Received:
    432
    Location:
    Somewhere over the ocean
    Or maybe a 12nm boring IO chip for desktop, and a fabulous shiny 7nm IO chip for server with L4 and colored leds.
     
    Lightman and vjPiedPiper like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...