AMD RyZen CPU Architecture for 2017

Any indication AMD is attempting to accelerate AVX512 and similar instructions with a GPU.
Surely that'd be infeasible, as how would you avoid getting hit by hundreds of clocks instruction latency when handing over AVX calculations to the GPU, due to the way GPUs are pipelined?

Wouldn't simply including proper AVX512 unit in the CPU be the proper way to go.
 
At some point (Bulldozer), AMD almost depreciated the role of the integrated FPU, in favor of tighter GPU integration (HSA), but that didn't took off very well. What AMD did with Ryzen is very balanced SIMD ISA implementation, with power/area efficiency in mind and attention to the existing code base.
On the other hand, the way Intel pumps those new vector extensions is as if they want to turn the CPU into a quazi-GPU sort of, just not for rendering pretty games... yet.
 
There is no point in AMD trying to be mini Intel, spend the power and transistor budget on IPC, platform services ( i want to see SOC level AES/ SHA/ key mgmt engines) and more cores over 512 SIMD, seriously so much of x86 market space is never going to care about large vectors.........
 
Surely that'd be infeasible, as how would you avoid getting hit by hundreds of clocks instruction latency when handing over AVX calculations to the GPU, due to the way GPUs are pipelined?
That's really no different than scheduling a new thread and the pipeline isn't all that long. A single core assigned a subset of NCUs could bypass much of the CP work to avoid latency. Talking to an ACE(kernel agent I think for HSA) directly. It wouldn't be ideal for a one off instruction, but that's already the case for SSE etc. So long as a thread stayed within the GPUs instruction set it could reside in the GPU with the added bandwidth and caching. That'd be the case with most math heavy code blocks. That's the heterogeneous model AMD specifically has been working towards for a while. Shared memory space helps and HBCC handling caching would be part of that.

It's AVX512 targeted code with large vectors and bandwidth requirements that would be targeted. Not uncommon with image manipulation, audio processing, and spreadsheets/Matlab where dealing with significant chunks of data. Just a matter of automatically vectorizing the loops. LLVM is already doing this along with the latest C models adding support.

Wouldn't simply including proper AVX512 unit in the CPU be the proper way to go.
Not necessarily as the requirements are rather hefty. Memory bandwidth becomes a concern, so more channels are required to keep the hardware busy. Then all the ALUs and schedulers the majority will never use with a discrete GPU. For Raven they would be in the same die, but Raven and Ryzen are distinct chips. They are also relatively small at around 200mm2. HBM keeps getting drug I to the discussion because it's one of only a few practical methods to get the bandwidth required without building out the entire socket.
 
The two L3 clusters are virtually addressed as a single uniform space, regardless of thread allocation.
I'm under the impression that the L3 is like everything else past the TLBs and operates on physical addresses. On top of that, if they were addressed as a single space cores on one CCX could spill to the L3 in the other, which tests and architectural descriptions do not support.

At some point (Bulldozer), AMD almost depreciated the role of the integrated FPU, in favor of tighter GPU integration (HSA), but that didn't took off very well.
I haven't seen a clear indication of this. Bulldozer's FPU wasn't close to being deprecated. In terms of throughput it was relying on FMA and clocks to move above prior CPUs. AMD's core layout choice seemed to be based more on a mistake on physical trends and multithreaded performance, with an emphasis on server conflicting with the chip's other duties. Possibly, there were other internal considerations related to the CMT proposal that Bulldozer in the end could not implement.
Given the delays in getting Bulldozer out (originally developed and intended for 45nm, years of development work before that) HSA seems like it is too recent to have been what AMD was counting on.
Since this supposed future still hasn't become acceptable in 2017, I seriously doubt AMD's engineers would have deprecated their FPU in the full knowledge that nothing they had besides a CPU would be acceptable for years/decades.

HSA failed or is only late but on track on the promises?
The HSA foundation has been courting initiatives in China, and some of its members are trying to provide hardware and some way of interfacing with home-grown IP.
The main members have varying levels of hardware adoption of the HSA feature set, with the most notable members abstaining from significant chunks of the interoperability or software layer.

Progress appears to be very slow, and initial offerings like Kaveri potentially flawed. Maybe something could come out of the China efforts, or it's a way to entice a difficult to approach market rather than a computing paradigm worthy on its own merits.
It doesn't seem to be making that much of an impact outside of its glacial bubble.
 
Progress appears to be very slow, and initial offerings like Kaveri potentially flawed. Maybe something could come out of the China efforts, or it's a way to entice a difficult to approach market rather than a computing paradigm worthy on its own merits.
It doesn't seem to be making that much of an impact outside of its glacial bubble.
I'd consider it a work in progress with the OpenCAPI initiative an obstacle to be overcome. The interconnect seems a stumbling block that Infinity/CAPI would address, however we haven't seen much of them yet.
 
Some people are still thinking there will be RR APUs with HBM or quad channel memory controller.
Quad-channel was actually inside Kaveri, though unfortunately it was never used.
My guess is AMD took down the functionality when they realized OEMs would only be putting Kaveri (or any Bulldozer-based) APUs into bottom-of-the-barrel laptops or OEM PCs.
Since so many OEMs even resorted to using single-channel memory to spare a few bucks in PCB size and complexity in AMD's APUs so far, we can now say that was a good call.


As for Raven Ridge, I don't see the 15W FP5 version getting HBM2 or quad-channel, but I was indeed hoping for 128bit 4266MT/s LPDDR4 (~70GB/s) in some high-end implementations like a new Macbook Air or Surface Pro competitor. If that memory arrangement is good for smartphones and tablets, why wouldn't it be good for ultrabooks and 2-in-1s? However, those slides from 2016 indicate that only the tiny 4W Ryzen Mobile with 2 CPU cores and 3 NCUs will be getting LPDDR4 support in a 64bit width.

But FP5 supports CPUs up to 35W, and I could indeed see a top-end version getting that HBM stack. A single 2-Hi stack at 1400MT/s (2GB VRAM at 180GB/s) or even a 1-Hi stack at 1000MT/s (1GB at 128GB/s) using HBCC would do wonders for an APU that could ultimately get PS4 performance at a 35W TDP.
But again, this would depend on big clients demanding these chips for high-end devices, like apple for a new macbook or microsoft for a new surface. Or HP+Dell+Asus+Acer getting together and making orders of this to make new ultraportables that would get rid of the usual Core U + Geforce MX150 combo.
 
Last edited by a moderator:
That's really no different than scheduling a new thread and the pipeline isn't all that long. A single core assigned a subset of NCUs could bypass much of the CP work to avoid latency. Talking to an ACE(kernel agent I think for HSA) directly. It wouldn't be ideal for a one off instruction, but that's already the case for SSE etc. So long as a thread stayed within the GPUs instruction set it could reside in the GPU with the added bandwidth and caching.
While an interesting concept to mull over, there is no thread hopping between instruction sets. The ISAs, software model, memory model, and most everything else are poorly matched or non-equivalent. The latency of throwing execution context out of a core and across the chip is also likely being underestimated. GPU kernel launch latencies relative to in-core execution are completely out of the realm of sane.
In-CPU latency regimes are in the sub-nanosecond range, and CU communications are likely operating in microseconds in their dreams, and realistically are significant fractions of a millisecond to multiple milliseconds.

Given AMD's purported goals with chiplets, the separation is likely to get significantly worse.

I'd consider it a work in progress with the OpenCAPI initiative an obstacle to be overcome. The interconnect seems a stumbling block that Infinity/CAPI would address, however we haven't seen much of them yet.
The coherent interconnects and user-space abstracting should help, although OpenCAPI has non-member IBM hardware as a basis, and the biggest collaborator is Nvidia.

Currently, it seems like HSA was able to get a raft of vendors to agree to a general outline of how to make their hardware less stupid, but their next big step has been taking their less-stupid hardware and going their own way with it.
 
But FP5 supports CPUs up to 35W, and I could indeed see a top-end version getting that HBM stack. A single 2-Hi stack at 1400MT/s (2GB VRAM at 180GB/s) or even a 1-Hi stack at 1000MT/s (1GB at 128GB/s) using HBCC would do wonders for an APU that could ultimately get PS4 performance at a 35W TDP.
Larger stacks may make sense to do away with DIMMs completely. Would be an interesting configuration to see as it would use less power and provide far more bandwidth than anything we've seen at that scale. Smaller with an order of magnitude more bandwidth.

While an interesting concept to mull over, there is no thread hopping between instruction sets. The ISAs, software model, memory model, and most everything else are poorly matched or non-equivalent.
LLVM should be a start at covering those deficiencies. Abstracted high enough the jumps to suitable hardware are plausible. Transitioning logic to the CPU and math to the GPU. The model I'd consider in flux, but coming in the future. Good chance that's part of the goal with most software companies transitioning towards LLVM.

The latency of throwing execution context out of a core and across the chip is also likely being underestimated. GPU kernel launch latencies relative to in-core execution are completely out of the realm of sane.
They don't need to move the whole context though. Just treat them as separate threads and synchronize with limited communication. Synchronizing through atomics or suitable locking mechanism. The workloads where acceleration with a SIMD are involved are relatively large.

In-CPU latency regimes are in the sub-nanosecond range, and CU communications are likely operating in microseconds in their dreams, and realistically are significant fractions of a millisecond to multiple milliseconds.
Time slicing within an OS however is on the upper end of that around 10-20ms. Retaining a basic FPU the CPU could run everything, farming off threads to the GPU when suitably large workloads are encountered. The typical embarrassingly parallel tasks involving media or batched operations.

The coherent interconnects and user-space abstracting should help, although OpenCAPI has non-member IBM hardware as a basis, and the biggest collaborator is Nvidia.
They are still based around PCIE last I checked and that's the same for Infinity. Development still underway, but AMD possessing CPU and accelerator may have pushed ahead.
 
That's really no different than scheduling a new thread and the pipeline isn't all that long.
The difference is, GPU CUs is across a big chip, or potentially, on a different chip entirely. So you'll have to shuffle a bunch of data over there, and then shuffle a bunch of result data back again. Infinity fabric is from what I understand basically re-named hypertransport, which is akin to PCIe from what I understand (serialized, packetized). Possibly with similar latency concerns as well? And CUs don't run AVX code do they? So there has to be ISA transcoding hardware shoved in there too, more latency there.

Also, there' no mechanism from what I understand for interrupting a working CU and handing it new work, so you'd have to wait for it to finish a potentially lengthy shader program before it'll start chewing through your data. Even if you can interrupt it, you'd have to wait for cache flushes and whatnot, and if the GPU L2 is packed to the rafters, with memory latencies being what they are on GPUs (pretty horrendous) that'd probably be a hefty delay. We'd already be finished by now on the CPU AVX unit and the GPU hasn't even started chewing through our data! :p

So, unless we keep a dedicated CU idle on hot standby for our AVX work, there's probably going to be cripplingly huge latency by offloading work, and if we're leaving CUs idling then what the eff is the gain?!?!?!?! We're just overcomplicating something that doesn't need to be complicateded! :D

So no, I remain rather unconvinced by this approach. Also, power savings. Punting data back and forth also has big implications these days as far as power use is concerned. So again, what's the point of all this? So we can save a few sqmms of die space on an AVX unit, which might not be any genuine saving at all really when considering all the extra plumbing and glue needed to make this approach work? Paint me as extremely sceptical, in phosphorescent radium paint in fact! :D
 
LLVM should be a start at covering those deficiencies. Abstracted high enough the jumps to suitable hardware are plausible.
The hardware isn't suitable for this. It's not giving LLVM the paths to implement this. This is handwaving a host of thornier problems than the purportedly excessive thickness of trying to get draw calls to a DX11 GPU.

They don't need to move the whole context though. Just treat them as separate threads and synchronize with limited communication. Synchronizing through atomics or suitable locking mechanism. The workloads where acceleration with a SIMD are involved are relatively large.
What mechanisms are involved?
Zen's synchronization latencies range from 30 to over 100ns, depending on how far it must range within the chip.
This leverages the cache coherence protocol, strong memory model, and robust memory pipeline throughout the hierarchy.

CUs are barely coherent, and it requires uncached memory traffic, heavy cache invalidation, or interrupts to hop between the CPU and GPU domain. HSA doesn't resolve this, and assumes the shader-like execution and coordination model.

They are still based around PCIE last I checked and that's the same for Infinity. Development still underway, but AMD possessing CPU and accelerator may have pushed ahead.
Infinity Fabric is a superset of Hypertransport. The transmission layer is one level of the implementation of a protocol, and xGMI is a case where the transmission lines can be used to carry out the needs of the higher-level elements of the protocol. The specific wire configuration and signalling scheme are abstracted by design from the higher-level elements and the non-link elements of the fabric. Using copper traces on a PCB isn't what creates coherent user-space memory.
 
The difference is, GPU CUs is across a big chip, or potentially, on a different chip entirely. So you'll have to shuffle a bunch of data over there, and then shuffle a bunch of result data back again.
Why would you need to shuffle data in a unified memory model? The data should be rather stratified with bulky arrays not being shared anyways. Along the lines of the CPU dealing with pointers and GPU the referenced data. Most tasks involving a SIMD are likely to flood or bypass a cache anyways as you'll encounter arrays that exceed the cache with possibly little to no reuse.

CUs don't run AVX code do they? So there has to be ISA transcoding hardware shoved in there too, more latency there.
They could, just depends on how well those 512bit (16x32bit SIMD lanes) map to a 16 lane SIMD for vector operations. I'll take a look at all the AVX instructions later, but my cursory understanding is it's fairly straightforward.

Also, there' no mechanism from what I understand for interrupting a working CU and handing it new work, so you'd have to wait for it to finish a potentially lengthy shader program before it'll start chewing through your data. Even if you can interrupt it, you'd have to wait for cache flushes and whatnot, and if the GPU L2 is packed to the rafters, with memory latencies being what they are on GPUs (pretty horrendous) that'd probably be a hefty delay. We'd already be finished by now on the CPU AVX unit and the GPU hasn't even started chewing through our data! :p
A CU can preempt and save/restore if desirable. GPU base memory latencies aren't all that worse than CPU, it's just a question of caching. CPU attempting to predict what is needed to keep those latencies low. For a SIMD there will be less branching or even caching do to the nature of the data. The dataset simply exceeds the cache in many cases or exhibits no reuse of data. Take an audio buffer for example. Manipulate it and move on.

So no, I remain rather unconvinced by this approach. Also, power savings. Punting data back and forth also has big implications these days as far as power use is concerned.
Again, most if that data shouldn't need to move or be in cache already. It'd be on par with the CPU trying to manipulate framebuffer data.

What mechanisms are involved?
Zen's synchronization latencies range from 30 to over 100ns, depending on how far it must range within the chip.
...
CUs are barely coherent, and it requires uncached memory traffic, heavy cache invalidation, or interrupts to hop between the CPU and GPU domain. HSA doesn't resolve this, and assumes the shader-like execution and coordination model.
http://spectrum.ieee.org/semiconductors/processors/breaking-the-multicore-bottleneck

Not that far off of what GPUs have been doing, but hardware queues to manage thread synchronization. At which point CPU and GPU threads could run independently with only the hardware system needing to communicate. An ACE could probably be programmed to do that, no idea on the Zen equivalent.

Infinity Fabric is a superset of Hypertransport. The transmission layer is one level of the implementation of a protocol, and xGMI is a case where the transmission lines can be used to carry out the needs of the higher-level elements of the protocol. The specific wire configuration and signalling scheme are abstracted by design from the higher-level elements and the non-link elements of the fabric. Using copper traces on a PCB isn't what creates coherent user-space memory.
Wires aside, AMD would've been in a position to push their own protocol without waiting on a consortium to reach a conclusion. I'd imagine they adopt whatever the group decides in the future. Infinity however seems tied to PCIE revisions from a hardware standpoint. The protocol they would be able to standardize internally much faster and their CPU and GPU already understand X86 addressing which should simplify the engineering.
 
Why would you need to shuffle data in a unified memory model?
I'll answer your question with another question: what good is FPU data doing sitting over on the GPU? Once it has been processed by the GPU, often you'd want to further process it by the CPU. You don't send a bunch of float data through a FPU (whether it is on a CPU or a GPU), only then to immediately throw away the results. Not every data processing task is going to be streaming a huge amount of data processing directly from and back to main memory...

Anyhow, any new hardware approach to enable replacing FPUs with GPUs that involves a paradigm shift in how we write software is going to be - pardon the french - fucking dead on arrival. Quad-core CPUs first appeared over ten years ago, and it's taken us until now where there's everyday (in most cases this reads "games") software that reliably takes advantage of more cores than that.

We can't wait another ten years or more to make our new shiny AMD FPU-less CPUs with GPU offload start performing as well as it possibly can, or even nearly as well as the competitor's offerings. AMD can't wait that long. They'd go under relying on a product like that. Ryzen is as big a hit now as it is because it targets software such as it exist NOW, instead of going off on a crazy exotic tangent into the great blue beyond. And how are you going to make Intel get on this bandwagon? It's not in their interest. Anyone using Intel's compiler is going to get Intel-approved code spitting out the other end.

They could, just depends on how well those 512bit (16x32bit SIMD lanes) map to a 16 lane SIMD for vector operations.
Yes they could, but would they? Even Intel itself probably don't use the same opcodes for its GPU shaders as their AVX units. Does AVX ISA fit graphics processing needs? Is the AVX instruction set inclusive of everything a GPU needs? Is it overly burdened with stuff CUs don't need? There's that aspect to consider as well!
 
A CU can preempt and save/restore if desirable. GPU base memory latencies aren't all that worse than CPU, it's just a question of caching.
Average latencies for a CPU are likely in the tens of ns. Trips to DRAM in the 70-140ns range, 100 or below for a single-chip Ryzen and fast enough memory.

CU L1s have very low residency periods, when they aren't obligated to write back for something like coherent accesses. GPU memory latencies are measured in hundreds of GPU cycles, and the loaded latencies can probably push things to the order of magnitude higher range.


http://spectrum.ieee.org/semiconductors/processors/breaking-the-multicore-bottleneck

Not that far off of what GPUs have been doing, but hardware queues to manage thread synchronization.
I've noted who is involved in that research, and why they wouldn't care to let HSA and AMD in particular use it.
I'd prefer a more direct paper or other reference. They're focused on data, not kernel command movement, which HSA's AQL is about. It seems more like a controlled push model for a subset of memory traffic, which for compute GPUs have not been doing.

Wires aside, AMD would've been in a position to push their own protocol without waiting on a consortium to reach a conclusion. I'd imagine they adopt whatever the group decides in the future. Infinity however seems tied to PCIE revisions from a hardware standpoint.

Zen has PCIe interfaces that it can switch to xGMI, it also has multiple on-package GMI links that are not PCIe for EPYC integration. The transmission path isn't limiting the protocol or fabric.
 
Apparently the "dummy dies" in ThreadRipper are in fact real dies: https://videocardz.com/72555/there-are-no-dummy-dies-in-ryzen-threadripper

Why? I guess there must be something about the design that makes those dies necessary for the package to work at all. But it seems incredibly wasteful.

This leaves uncertainty as to what the situation is when outlets like Extremetech and OC3D relayed communications with AMD indicating they were inserts.

I would assume from a mechanical standpoint that there would be no closer match to a Zen die than another silicon die (and a Zen die to be extra specific), if one wanted to eliminate any possible mismatches. It's also potentially cheaper to reach into the bin of discarded dies rather than slicing up a $7k wafer just for that.

I suppose other items to check are whether the dead dies could have been conceivably active at some point, or if the package is set up such that they are cut off.
 
I'll answer your question with another question: what good is FPU data doing sitting over on the GPU? Once it has been processed by the GPU, often you'd want to further process it by the CPU. You don't send a bunch of float data through a FPU (whether it is on a CPU or a GPU), only then to immediately throw away the results. Not every data processing task is going to be streaming a huge amount of data processing directly from and back to main memory...
With unified memory, "on the GPU" is a bit abstract. This would be vector data, not all FPU data. I can't think of very many cases you would bulk process data with a vector unit and then directly interact with the CPU barring some sort of reduction. Vector processing more often than not is filling buffers to stream out. In the case of Ryzen, AXV2 is supported, but not AVX512. The latter being SIMD sized bulk processing. Data which could reside in separate pools as the associated processor is likely the only one to use it. Filling buffers really doesn't require the CPU, just bandwidth and ILP. The real work for a CPU is addressing and logic, not heavy processing. All that prediction hardware and caching is useless when linearly processing giant buffers.

Anyhow, any new hardware approach to enable replacing FPUs with GPUs that involves a paradigm shift in how we write software is going to be - pardon the french - fucking dead on arrival. Quad-core CPUs first appeared over ten years ago, and it's taken us until now where there's everyday (in most cases this reads "games") software that reliably takes advantage of more cores than that.
I'm not suggesting to replace the FPUs, but the larger vector units. The CPU will still process floating point data. Vector instructions and data tend to live in their own world. The CPU only accessing that data to make a decision in some instances. As mentioned above, said data is generally streamed somewhere.

And how are you going to make Intel get on this bandwagon? It's not in their interest. Anyone using Intel's compiler is going to get Intel-approved code spitting out the other end.
It's already in LLVM and Intel's compiler isn't the only option.

Yes they could, but would they? Even Intel itself probably don't use the same opcodes for its GPU shaders as their AVX units. Does AVX ISA fit graphics processing needs? Is the AVX instruction set inclusive of everything a GPU needs? Is it overly burdened with stuff CUs don't need? There's that aspect to consider as well!
The opcodes can be translated easily enough in hardware and the GPU likely a superset of the instructions. AVX instructions would be wave level instructions to a GPU, not unlike tensor cores with Nvidia. With a FPGA or fixed logic the conversion could be handled easily enough. It's not difficult to turn a 5 into a 9 with binary logic. It would be up to the compiler to support the different target and there has been work on even supporting multiple paths through the compiler. I'd have to go find a reference for that, but it's somewhat recent as of this year.

I checked the instructions in AVX and nearly all of them would directly translate to GCN ISA. The permute instructions would be a bit different. The easier solution would be to compile vector instructions into GCN ISA and a parallel thread to start.

CU L1s have very low residency periods, when they aren't obligated to write back for something like coherent accesses. GPU memory latencies are measured in hundreds of GPU cycles, and the loaded latencies can probably push things to the order of magnitude higher range.
Part of that is with the pipelining. Take HBM2 with pseudo-channel and latency and power drop. Then consider the workload will mirror the current parallel one. When dealing with vectors on a GPU they will be sufficiently large and that overlapped latency less of an issue. They don't need the lower latency as they are focused on throughput and graphics memory isn't all that different from system memory. The model will entail heterogeneous memory with low latency, low bandwidth and high latency, high bandwidth.

I've noted who is involved in that research, and why they wouldn't care to let HSA and AMD in particular use it.
AMD is on the board of OpenCAPI. Doesn't mean they can't play politics with it, but OpenCAPI should be agnostic of the HSA stack. Shared memory and low latency transactions are all that's required.

I'd prefer a more direct paper or other reference. They're focused on data, not kernel command movement, which HSA's AQL is about. It seems more like a controlled push model for a subset of memory traffic, which for compute GPUs have not been doing.
I'm not suggesting to move commands, but alter the threading model slightly. Not unlike how GPUs currently work, but with the CPU side waiting and more efficient hardware synchronization mechanisms added. Mechanisms that operate on bit fields and significantly reduce the overhead of locking and synchronization.

Zen has PCIe interfaces that it can switch to xGMI, it also has multiple on-package GMI links that are not PCIe for EPYC integration. The transmission path isn't limiting the protocol or fabric.
All I'm saying is that the signaling standards are the same. Infinity being a superset of whatever PCIe standard gets adopted. In current chips this means the signals can be driven faster thanks to guarantees of shorter distances.
 
Back
Top