AMD RyZen CPU Architecture for 2017

Possibly, they want to integrate GPU cores into the count, making them available for general processing via HSA? Or won't it be necessary to assign a core# to them?
 
The more interesting part is the (older, but I didn't notice that patch at the time) confirmation that LLC now lives among cores, not on the NB which likely has latency implications. I wonder if writes from a core preferentially access their own L3 slice or if they do a system similar to Intel where each cache line belongs to a specific L3 slice depending on it's address, and all cores access all L3 evenly.
It seems less useful to know the L3 ID if all locations are hashed across all the L3s in a socket. It seems useful to know how to exploit locality for a non-uniform access setup, which hashing across L3 IDs would interfere with.

Possibly, they want to integrate GPU cores into the count, making them available for general processing via HSA? Or won't it be necessary to assign a core# to them?
It doesn't seem too out of line for where the higher-count workstation/server CPUs are getting to.
I don't know about involving the CUs. The ID is being used in the context of the CPU socket and L3 hierarchy, and basing part of it on APIC ID would denote a change as CUs don't individually plug into that. The GPU as a whole might hook into things from an IO standpoint, so CUs usually hide behind layers of abstraction. There were indications that Polaris is introducing even more hardware scheduling, which might insulate individual CUs further.
 
So the L0 cache that appeared in other slides really is micro-op cache after all, and its functionality resembles the same micro-op cache as Intel has been using since Haswell.

I am drawing a blank on slides for an L0 cache for Zen.
Historically, I've seen more about an L0 operand cache or stack cache--which is the wrong cache type for where the only L0 reference is listed.
As an Icache TLB, the L0 TLB might not be tied to a cache AMD calls an L0, since TLBs in prior CPUs have multiple levels hanging off of the L1.
It might be a micro-TLB used for way prediction or reducing the number times the core needs to check tags. That's where I've seen mentions of the term "microtag" that showed up in the expanded error reporting.
That might also be expanded to serve as a uop-cache hit check, if it shares Intel's constraint that the uop cache map to the L1.


The oddity with the FMA FP0+FP3 FP1+FP3 pipe sharing was dropped, although why it was mentioned in the first place is curious. It seems elaborate for a typo.
If FP0 and FP1 are FMA pipes, it might explain why IADD is possible on them as well as FP3.

I tried looking up the context of the error reporting for shadow tag ECC errors. At least one instance I found it mentioned was a paper on non-uniform cache allocations, where shadow tags are used to determine which cores' workloads would benefit most from expanding the "local" allocation given to them from a last-level cache. That leads to lower latency within the local partition and not having to check across all the tags in the common case.

The error reporting for multi-way hits in the L3 and L2 is something I am not familiar enough about to know if this is something peculiar to a segmented L3, or a possible transient state in the long memory pipeline that could happen but has not been subject to reporting.

The error-reporting and checkpointing functionality in general seems to tie into Zen's server focus. The error reporting is something that is either not going to be in all new CPUs, or something that can be toggled.
 
Hmm confirmed GPU/HBM of the APU as a separate MCM on the package.
Well that crushes my dream of using the HBM from the APU as a massive full bandwidth L3/4 a-la Crystalwell :(

Unless the consumer version goes for 8 core half CPU & GPU on the other half of same die... but then that would probably not be a big enough GPU to justify HBM -> would just use the system DDR I guess?

Edit: on the other hand Crystalwell has 100GB/s too, though on a more direct link & higher clock.
Off chip to the GPU die then back off that to the much lower clocked HBM sounds like latency would be pretty bad.
 
Last edited:
I was wondering what was you were talking bout and that is a server product, not a consumer APU. (announced in link above)

It's rather huge, goes in dual socket motherboards and could easily cost $2000, as a ballpark figure.
If you want some Crystalwell-type memory even a severely overpriced Broadwell-K or Kaby Lake + eDRAM will be cheaper still.

I also thought, what they're thinking? If GPU is off-die, would AMD have to call that "Unfusion"? :)
But seeing the 32-core MCM goes up to four sockets and the 16-core + GPU goes to two sockets, AMD must have designed some new interconnect that can be used between CPUs or between CPU and GPU ; multiple links are likely used in the CPU + GPU MCM.

There doesn't seem to be a word on it, but here we have AMD's answer to NVLink.
 
AMD’s upcoming AM4 socket will be based on a µOPGA design with 1331 pins

AMD has been a devout supporter of Pin Grid Array socket types and it looks like the AM4 will be no different. OPGA stands for Organic Pin Grid Array (the ‘organic’ in the term stands for the plastic attached to the silicon die, out of which the pins protrude), and according to this report, the company is deploying a new standard called the µOPGA socket. The micro in the term indicates that AMD will be using pins with less diameter, which will presumable be weaker than OPGA based pins. Going up from 940 pins to 1331 is an increase of approximately 40% and it is implied that AMD will be decreasing the distance between the pins.

This means that while the µOPGA AM4 socket size will remain approximately the same, it will be much more fragile than previous OPGA based iterations. AMD hopes to use this particular socket for all its mainstream and enthusiast platforms – including APUs. AMD’s AM4 will combine the best points of AM1+, AM3+ and FM2 sockets. These will be deployed in everything from a budget AIO motherboard to the integrated PCH schematics of Bristol Ridge.
 
With 1331 pins I guess we can say it's probably not a quad-channel capable platform.
 
Edit: on the other hand Crystalwell has 100GB/s too, though on a more direct link & higher clock.
Keep in mind, Intel's XEON E7 line of quad-channel equipped Haswells are rated at 102GB/sec in main memory throughput when using DDR4/1866 memory. If you stack up sockets with a NUMA-aware app, you can see benchmark numbers exceed 180GB/s...
The new SMIv2: http://www.anandtech.com/show/9193/the-xeon-e78800-v3-review/3

The benchmarks: http://www.anandtech.com/show/9193/the-xeon-e78800-v3-review/10

While the latency isn't likely as low as HBM, that's still an incredible amount of throughput available right now.
 
Last edited:
It might have 32x PCIe lanes or so (less could be enabled on lower end APUs if it makes sense)
That would explain a bit more pins than desktop Intel.

I believe that otherwise we're expecting dual channel.
If there's a gigantic eight-channel socket it would make sense for it to be a quad-channel socket inbetween (similar to Intel's 2011), or a dual socket dual-channel (each) at least, the latter would replace Socket C32.
But then again, AMD lacks some means and C32 is hardly even known about so I guess the plan is, you want high end? Fuck it, let's bring the giant eight-channel socket (like a high-level Gillette executive would say)
 
AMD’s upcoming AM4 socket will be based on a µOPGA design with 1331 pins
Why are they sticking to pin grid sockets? Surely it must complicate high-speed signalling if you have more capacitance in your socket.

Or triple channel like LGA1366?
That'd be nice! :D I still have my i7 920-based rig running strong. That phat array of DIMMs lined up never ceases to be impressive IMO, hehe.
 
AM4 is desktop and replaces both AM3 and FM2+. Said to be like AM1 too so the CPU should integrate a good part of the chipset. (SATA, USB running from the socket)

Server uses an unknown huge socket. Competition goes to a big socket too if we consider Intel's future socket for Skylake-EP/Skylake-EX, or even "OpenPOWER" stuff.

Lower end x86 servers might use AM4, but with no special emphasis on it (roll your own for home or small business? Sure. Fill datacenter racks I think not)
 
Last edited:
Back
Top