AMD RyZen CPU Architecture for 2017

Blazkowicz · Apr 12, 2015

I have the same concern about e.g. OS kernels and regular programs not made with two pools of memory in mind, we have the luxury of just one big fat pool of memory even on multicore or SMP systems. Though for about a decade we've had NUMA on the PC, so there is actually and already some infrastructure to deal with some memory that is faster/lower latency than some other memory.

AMD also did work on coherency between APU and dGPU : not sure if it's working, but I believe that from what they've announced over the years, a Carrizo APU and GCN 1.2 external GPU would be able to share common address space (with the caveat that because of latency, that will be useful in very limited cases)

Worst case for the APU with HBM and DDR4 : the HBM is only "GPU memory", and the DDR4 is "CPU memory". That should do for an old/unmodified OS, and GPU memory if full could spill into system memory (ddr4) the old way.
Better : special HSA programs would allocate memory in such and such pool depending on the needs.
Better still : you might actually want to run some CPU code in the HBM. If HBM is lower latency, then regular CPU code might be a good deal faster as memory latency is a bad old bottleneck that doesn't get better with each DDR generation.
You might want to run some game (if this is a desktop) on the ddr4, is performance is not too important (e.g. a Valve game at 1080p), or framebuffer in the HBM and textures in the ddr4 (perhaps that latter idea works in a legacy OS if the graphics driver is managing it)

So, the dual pools of memory are a complication or source of headaches and may have you need a modified OS, or modified applications or both. and say a BIOS setting to disable either pool.
But it could be a survivable complication.

Raqia · Apr 12, 2015

I think you might want to use the GPU and HBM bandwidth for super-computing parts but a single server socket would need about a dozen threads of integer throughput. It isn't enough to saturate dual channel DDR4 and they might elect to use a very similar if not the same server socket as before, so I think it could be DDR4 only for server and workstation parts. EDIT: Don't forget ECC requirements for servers, I don't think HBM has that though I could be wrong.

I do hope the HBM acts as an L4-like cache in a unified hierarchy as that would be elegant programming wise, but my point is AMD needs to execute since it's operating on a meager budget, and a 4 tiered, unified memory hierarchy adds massive risk to their bottom line. (I could believe HBM being exclusive GPU memory though.) They can't afford another TLB bug recall disaster or a very late part like Llano (they're actually being sued over this). They're probably going to just have one piece of high end silicon that fits all those higher TDP needs, and it's going to be a GPU bolted to whatever CPU design they have. The HBM memory controller is either just fused off or only talks to the GPU.

pTmdfx · Apr 13, 2015

Blazkowicz said:
I have the same concern about e.g. OS kernels and regular programs not made with two pools of memory in mind, we have the luxury of just one big fat pool of memory even on multicore or SMP systems. Though for about a decade we've had NUMA on the PC, so there is actually and already some infrastructure to deal with some memory that is faster/lower latency than some other memory.

Video memory manager is already managing that pool, which gets better under WDDM 2.0. On Linux side, there is the Heterogeneous Memory Management project. Regular applications do need to explicitly make use of it, but it is already serving lots of applications today. Your DirectX textures and views, your OpenGL texture objects and framebuffers, your OpenCL buffers, and your array views in AMP. Now in DirectX 12 and similar low-level APIs, the pools are even exposed, and applications have a say in residency.

I wouldn't say there is nothing to worry about, but this is not an alien but working and evolving model.

AMD also did work on coherency between APU and dGPU : not sure if it's working, but I believe that from what they've announced over the years, a Carrizo APU and GCN 1.2 external GPU would be able to share common address space (with the caveat that because of latency, that will be useful in very limited cases)

No idea about this. But the GPU should at least support shared virtual memory and residency control, i suppose.

Worst case for the APU with HBM and DDR4 : the HBM is only "GPU memory", and the DDR4 is "CPU memory". That should do for an old/unmodified OS, and GPU memory if full could spill into system memory (ddr4) the old way.
Better : special HSA programs would allocate memory in such and such pool depending on the needs.

Talking about HSA, I suggest you giving the 1.0 final specification a read. You will find a feature called coarse-grain regions and allocations, which semantically fits to "local memory".

Having that said, even the capacity is there, it is still up to developers to pick it up. Otherwise, the default allocation is always the cache coherent system memory. HSA is strictly NUMA and not trying to hide it from the applications. It there is one to hide it, it would be the higher level runtime and libraries.

Better still : you might actually want to run some CPU code in the HBM. If HBM is lower latency, then regular CPU code might be a good deal faster as memory latency is a bad old bottleneck that doesn't get better with each DDR generation.
You might want to run some game (if this is a desktop) on the ddr4, is performance is not too important (e.g. a Valve game at 1080p), or framebuffer in the HBM and textures in the ddr4 (perhaps that latter idea works in a legacy OS if the graphics driver is managing it)

HBM is still ordinary DRAM after all. Bandwidth is higher, but latency is unlikely. Especially when you optimize it for maximum streaming bandwidth for best possible GPU performance, the latency usually suffers.

So, the dual pools of memory are a complication or source of headaches and may have you need a modified OS, or modified applications or both. and say a BIOS setting to disable either pool.
But it could be a survivable complication.

Just want to emphaize that "dual pool" is how our discrete graphics work for years. It was just not evolving enough to take the boarder challenge of general purpose computing until the recent times.

Blazkowicz · Apr 13, 2015

pTmdfx said:
Video memory manager is already managing that pool, which gets better under WDDM 2.0. On Linux side, there is the Heterogeneous Memory Management project. Regular applications do need to explicitly make use of it, but it is already serving lots of applications today.

Thanks for spelling it in full words :smile:. It's the first time I hear of this HMM.

Blazkowicz · Apr 13, 2015

3dilettante said:
There's nothing that screams "wrong" outright. It's a decent enough extrapolation from things that are ongoing, although a Greenland time-frame device with those bandwidth numbers might not be that great, unless Greenland closer to being a Fiji variant rather than of the generation after next.

There are other things that just seem sloppy, like having an APU with "Zen" cores and "Greenland" stream processor (sic), or "1Gbit" (assuming Ethernet would be in that box?).

The 1Gb ethernet would be meaningful if it's the management interface, what is called IPMI on Intel server motherboards. (/edit : available on AMD G34 motherboards, which ironically use Intel ethernet chips)
The AMD Seattle SoC has that same 1Gb interface I believe (in addition to dual 10Gb ethernet).
I speculate the 1Gb ethernet interface is tied to the security processor in some way.

Though of course, it might be some bullshit rumor and the author playing smart.

aaronspink · Apr 14, 2015

Blazkowicz said:
I have the same concern about e.g. OS kernels and regular programs not made with two pools of memory in mind, we have the luxury of just one big fat pool of memory even on multicore or SMP systems. Though for about a decade we've had NUMA on the PC, so there is actually and already some infrastructure to deal with some memory that is faster/lower latency than some other memory.

They can likely do what Intel will be doing with KNL. DDR4 is the baseline memory and starts at 0x0. HBM/HMC is fast memory and is stacked down from the top of the physical memory space. Intel has already added Malloc support for this, btw.

Intel has a bunch of other memory models for HMC/DDR4 but I would think that this default one will find the majority of the actual use.

pTmdfx · Apr 14, 2015

aaronspink said:
They can likely do what Intel will be doing with KNL. DDR4 is the baseline memory and starts at 0x0. HBM/HMC is fast memory and is stacked down from the top of the physical memory space. Intel has already added Malloc support for this, btw.

If they gonna let the memory be "owned" by the GPU just like the private video memory of today's APUs, it would be mapped into the system physical address space via PCIe BAR, and has the day-one support from WDDM 2. Malloc still has to be done through runtime calls through HSA or anything higher in the stack though, and it is not as pageable as generic system memory per WDDM 2.

fellix · Apr 28, 2015

The first thing we can spot is that there is only one integer cluster in a Zen core rather than two as in the Excavator module on the left. These two integer clusters are what forms the two separate CPU cores / threads in each Excavator module. Zen takes on a more traditional AMD CPU layout resembling that of Phenom and Athlon K series cores. Featuring a single large Integer cluster and one equally large floating point unit.

This is an important distinction because in contrast, the Bulldozer family of cores achieved very high integer throughput but also sacrificed floating point performance. That’s because each pair of cores shared one floating point unit. Although the floating point unit itself was larger and more capable than the one found in AMD’s previous K10 CPU core in the Phenom II line of chips. Floating point performance was still lacking compared to integer, merely because the design was heavily weighted towards integer as can be seen above.

Obviously because Zen forgoes the CMT design of the bulldozer family we can see that AMD has returned to a single fetch and single decode unit on the front end. As opposed to the double decoders that were introduced with Steamroller, Excavator’s predecessor found in the 7000 series Kaveri APUs.

AMD Zen CPU Core Block Diagram Leaked

Pretty much ditches the whole CMT concept and backtracks to the pre-Bulldozer designs, this time with wider FP/SIMD pipes and 3-way (?) INT/AGU core.

Gubbi · Apr 28, 2015

fellix said:
Pretty much ditches the whole CMT concept and backtracks to the pre-Bulldozer designs, this time with wider FP/SIMD pipes and 3-way (?) INT/AGU core.

I wonder if the six integer pipelines (exe units) are 3 x AGU/LS/ALU pairs like in K8 and derivatives or they decoupled address generation and load/store from the ALU ops.

I hope its the latter

Cheers

sebbbi · Apr 28, 2015

Two arrows down from the decoder to int and fp pipes... I suppose this means that the core is running two independent hardware threads (hyperthreading). Would be first for AMD (all the other high performance CPU vendors have done this for ages).

Alexko · Apr 28, 2015

sebbbi said:
Two arrows down from the decoder to int and fp pipes... I suppose this means that the core is running two independent hardware threads (hyperthreading). Would be first for AMD (all the other high performance CPU vendors have done this for ages).

If this image is real and the arrows actually mean anything, it could just be that Zen can decode up to 4 instructions per cycle (which might be too little, actually).

fehu · Apr 28, 2015

It was rumored that zen wouls drop cmt in favor of smt

sebbbi · Apr 28, 2015

Alexko said:
If this image is real and the arrows actually mean anything, it could just be that Zen can decode up to 4 instructions per cycle (which might be too little, actually).

That is also a possibility. It's not too bad, since four is twice as many as Jaguar core (or a Piledriver/Excavator core) can do. It would be a clear improvement. Kind of a turbocharged Jaguar that is twice as wide and has twice as wide vector units. I would like to have a 16 core Zen in my console

3dilettante · Apr 28, 2015

If the visual representation is consistent between the BD and Zen diagrams, I would think the dual arrows mean SMT. The FPU was considered to be SMT, and it wouldn't use those arrows to represent decode bandwidth because the FPU could not take 8 ops per cycle, nor was it restricted to half any given decoder's throughput.

Bulldozer didn't make the FMAC pipes exclusively FP, as integer FMAC was on one pipe, so this is possibly two ALU pipes and an FMISC type of situation.
As for the generic integer pipelines, I suppose we'll see what meat there is to them in another slide.

There's no longer a shared L2, which might lead to a less complex local memory hierarchy.
If Zen does widen its sustained throughput, it might lead to a buffed decoder or enhanced instruction fusion. At these widths and likely performance target range, I would wonder if AMD is going to adopt some kind of uop store or cache for power reasons.
Jaguar's branch prediction scheme has already rolled into the latest BD variants, so that might be elaborated here. Part of its benefit is the ability to predict two branches if they are in the right predictor, which might lead to a doubling of branch units for the sake of sustained performance, like Cyclone.

The rest of the diagram is so generic that there isn't an AMD (or for that matter many other designers') cores from the last few decades that wouldn't have the same basic set of rectangles.
Zen's rollout period is a little short in years for the time window for a clean-sheet design if we were to assume a reconstituted team was launched after Keller's hire, so there might be some outstanding projects or direct evolutions from existing units pulled in. I have not seen any notable complaint about the FPU schedulers, and perhaps an FP cycle or two could be shaved off now that it is no longer shared.
The BD front end already knows how to be shared by two threads.
The integer scheduler and OOE engine might be a bigger difference.

There's a lot of really important and potentially interesting stuff not covered with something I am not sure I would call a block diagram. It's almost a CPU caricature.

pTmdfx · Apr 28, 2015

sebbbi said:
Two arrows down from the decoder to int and fp pipes... I suppose this means that the core is running two independent hardware threads (hyperthreading). Would be first for AMD (all the other high performance CPU vendors have done this for ages).

Not really first for AMD. They did it in Bulldozer, but just for the SIMD pipelines. Moreover, the front-end was shared in the same way as common SMT implementations.

pTmdfx · Apr 28, 2015

3dilettante said:
If the visual representation is consistent between the BD and Zen diagrams, I would think the dual arrows mean SMT. The FPU was considered to be SMT, and it wouldn't use those arrows to represent decode bandwidth because the FPU could not take 8 ops per cycle, nor was it restricted to half any given decoder's throughput.

Bulldozer didn't make the FMAC pipes exclusively FP, as integer FMAC was on one pipe, so this is possibly two ALU pipes and an FMISC type of situation.
As for the generic integer pipelines, I suppose we'll see what meat there is to them in another slide.

There's no longer a shared L2, which might lead to a less complex local memory hierarchy.
If Zen does widen its sustained throughput, it might lead to a buffed decoder or enhanced instruction fusion. At these widths and likely performance target range, I would wonder if AMD is going to adopt some kind of uop store or cache for power reasons.
Jaguar's branch prediction scheme has already rolled into the latest BD variants, so that might be elaborated here. Part of its benefit is the ability to predict two branches if they are in the right predictor, which might lead to a doubling of branch units for the sake of sustained performance, like Cyclone.

The rest of the diagram is so generic that there isn't an AMD (or for that matter many other designers') cores from the last few decades that wouldn't have the same basic set of rectangles.
Zen's rollout period is a little short in years for the time window for a clean-sheet design if we were to assume a reconstituted team was launched after Keller's hire, so there might be some outstanding projects or direct evolutions from existing units pulled in. I have not seen any notable complaint about the FPU schedulers, and perhaps an FP cycle or two could be shaved off now that it is no longer shared.
The BD front end already knows how to be shared by two threads.
The integer scheduler and OOE engine might be a bigger difference.

There's a lot of really important and potentially interesting stuff not covered with something I am not sure I would call a block diagram. It's almost a CPU caricature.

http://www.planet3dnow.de/vbulletin/attachment.php?attachmentid=32395&d=1430230519

Here comes the second alleged slide. It suggests Zen is getting 512KB dedicated L2 Cache, together with an L3 Cache that is shared by 4 cores and doesn't look like to be globally shared by context. Things gonna be fully inclusive, and four cores form a building block (quad-core unit). Building blocks will be interconnected with a new fabric design (that they have envisioned a few years ago).

Seems like they are not going to introduce a system-shared like Intel's LLC. So snooping for low quad-core unit count, and likely a directory for NUMA, high unit count or perhaps performance GPUs. Interested in knowing if GPU cache hierarchy gonna get some love too, say QuickRelease.

Alexko · Apr 28, 2015

sebbbi said:
That is also a possibility. It's not too bad, since four is twice as many as Jaguar core (or a Piledriver/Excavator core) can do. It would be a clear improvement. Kind of a turbocharged Jaguar that is twice as wide and has twice as wide vector units. I would like to have a 16 core Zen in my console

Bulldozer/Piledriver was limited to 4 decoded instructions per cycle and per module, and I think there was the additional constraint that the decoded instructions had to come from a single thread, so the decoders would alternate between threads, but don't quote me on that.

Steamroller and thus Excavator had one 4-wide decoder per core (thus two per module). While this may have been slightly overkill, it might not be sufficient if Zen is as wide as it appears to be, which is hard to say without knowing more about the pipelines.

Still, while interpreting arrows as instructions is most natural (to me, anyway) I agree that SMT would make more sense, mostly for the reasons cited by 3dilettante. Even so, it's not an easy feature to implement, test and validate, and given AMD's past exploits, well, let's just say that I wouldn't be shocked if this turned out to be non-functional and fused-off upon release.

Oh, and I don't know about 16 Zen cores in a console, but 8 cores should be very doable, and between the increased IPC and the likely >3.0GHz clocks, you should get plenty of performance to play with.

fellix · Apr 28, 2015

pTmdfx said:
Seems like they are not going to introduce a system-shared like Intel's LLC. So snooping for low quad-core unit count, and likely a directory for NUMA, high unit count or perhaps performance GPUs. Interested in knowing if GPU cache hierarchy gonna get some love too, say QuickRelease.

Looks like scaled up Jaguar quad-core module. Probably with the same tight integration for macro-level implementation in server/APU SoCs.

Smells like prepping for another console generation. j/k

Deleted member 13524 · Apr 28, 2015

It does seem that Zen derivates more from Bobcat/Jaguar than Bulldozer..
The same way Intel ditched Netburst back then and used Conroe as basis to their higher performing parts.

The curious thing is that AMD got stuck to Bulldozer for about as long as Intel got stuck with Netburst, which is 5 years.

512KB L2 cache per core, 8MB L3 cache per 4-core cluster.
Haswell-Ex has 256KB L2 cache per core, and unlocks 5MB L3 cache for each unlocked set of 2 cores.

3dilettante · Apr 28, 2015

pTmdfx said:
Here comes the second alleged slide. It suggests Zen is getting 512KB dedicated L2 Cache, together with an L3 Cache that is shared by 4 cores and doesn't look like to be globally shared by context. Things gonna be fully inclusive, and four cores form a building block (quad-core unit). Building blocks will be interconnected with a new fabric design (that they have envisioned a few years ago).

Inclusion should simplify snooping from outside the CPU unit. It seems like it maintains a hierarchy of small numbers of local clients, which might mean local crossbars or something more complex than a ring bus. The global interconnect would be something else, perhaps.

Seems like they are not going to introduce a system-shared like Intel's LLC. So snooping for low quad-core unit count, and likely a directory for NUMA, high unit count or perhaps performance GPUs. Interested in knowing if GPU cache hierarchy gonna get some love too, say QuickRelease.

At least the QuickRelease proposal seems to give a shared L3, which with Zen appears more closely tied due to the inclusive nature and external interface. Whether AMD intends to re-divide the GPU memory hierarchy into read-only and write-only zones is an item of debate, although that split is not mandatory. There's no clear sign of this for the most recently released GCN revision.

Region-based coherence might fit with the separated CPU section. Perhaps some optimizations can be made for checking since the new L3 hierarchy means at most 1/8 (edit: 1/4, I was thinking of the wrong L2 size) of the cache could ever be shared between a CPU and GPU.
The presence of HBM, and whether the GPU maintains more direct control over it, might have some other effects.
Assigning HBM a fixed location in the memory space could do something like make only certain parts of the L3 require heterogenous snooping.

fellix said:
Looks like scaled up Jaguar quad-core module. Probably with the same tight integration for macro-level implementation in server/APU SoCs.

In some aspects, it does. Although if the Zen cores perform as they should, the amount of misses from the various levels of the CPU caches would need to ratchet higher. Bulldozer's number of outstanding misses from its L2 was over 20, which the Jaguar module's per-core L1 to L2 count was 8.

Jaguar's inter-module coherence, or general lack of it, is something Zen has not inherited.

ToTTenTranz said:
It does seem that Zen derivates more from Bobcat/Jaguar than Bulldozer..

At a core level, a lot might not come from Jaguar. The performance level in question has fetch requirements that cannot be satisfied by Jaguar, since some of its optimizations include a lack of predecode bits.
It seems like the cache hierarchy has taken some more sanity injected into it, but there is no inheriting scalability from Jaguar.

The curious thing is that AMD got stuck to Bulldozer for about as long as Intel got stuck with Netburst, which is 5 years.

That's around the time it takes to make a new design. However, it should be noted that if Zen was a restart that began after Keller was hired, it's not 5 years for Zen's gestation.

AMD RyZen CPU Architecture for 2017

Blazkowicz

Raqia

pTmdfx

Blazkowicz

Blazkowicz

aaronspink

pTmdfx

fellix

Gubbi

sebbbi

Alexko

fehu

sebbbi

3dilettante

pTmdfx

pTmdfx

Alexko

fellix

Deleted member 13524

Guest

3dilettante

Similar threads