AMD RyZen CPU Architecture for 2017

3dilettante · Nov 30, 2016

I.S.T. said:
So, yeah, that has little to do with Intel dumbassness and more to do with Intel... not QAing enough.

It's definitely a very complex thing to implement, and possibly a point where Intel might have been served by being even more restrictive in what SKUs got TSX, or relegating it to a specific enterprise or customer set. The rumor that Skylake with AVX-512 might be deploying to specific customers might be how large changes with potentially large blowback due to errata could be handled.

Seriously, though. With this many issues, TSX has to be insanely complex to implement in the hardware.

Speculation within the core, particularly at the level of Intel's processors, is massively complex.
A high-performance cache subsystem is very complex.
The memory model implemented, the systems that depend on all of it, sharing/updating/interupts, are complicated.
Making something like all that appear to be handled atomically across multiple internal pipelines--breaking some of the abstraction that isolated them from one another, is complex.

It's also the case where I think the fallback to microcode that is often used to protect against errata might have reached a limit.
Buggy instructions or problems within the OoO engine can be replaced by an equivalent sequence of uops by the microcode engine, but if the bug was outside of the core and involved something becoming externally visible, it might have been outside the reach of a fix.

Something like a timing issue with an interrupt, cache transition, power state change, snoop, an eviction, a possible failure to update to the possible bloom filter Intel uses to allow the read set to spill out of the L1, could allow TSX transactions to somehow appear to be non-atomic or to miss a really badly timed update that should have aborted them.

AMD's TLB bug is a case where a badly timed cache transition could push out stale page data whose integrity is required for the functioning of the system, and which was effectively game-over for any environment that hit it.
And that was was something that just involved an L2/L3 cache eviction for standard TLB functionality.

sebbbi said:
It is complex. IIRC AMD tried to bring similar extensions to Bulldozer, but no luck either. And no rumors that Zen is going to support it either.

AMD had a paper detailing a proposal for its Advanced Synchronization Facility, but no sign of that particular idea getting as far as Bulldozer.
I believe comp.arch had a discussion from someone involved in that proposal indicating that some of AMD's internal evaluations that might not have made it near Bulldozer had already walked back some of the provisions for ASF.

ASF did allow for more feedback on transactional state and failure counts/modes, which could allow for more intelligent handling and performance gains that a simple abort-only transactional model could not achieve. I hope I can find the discussion on it, since it seems like the walk-back on some of the paper's proposals caused it to lose some of its hoped-for improvements. However, that can also point to how something can look compelling from a conceptual or algorithmic standpoint but runs into a reality where an implementation proves to be beyond the means of the manufacturer, or winds up compromising other things so much that it's a pyrrhic victory. For example, for all the risk taken to implement it, it is still something whose gains are variable and speculation that trades heavily on memory transfers--which means more speculation on operations that come with a higher power cost than mis-speculated in-core ALU ops.

Possibly, that work was contemporaneous with one of the projects cancelled before Bulldozer. There are oddities to Bulldozer's memory subsystem and unexplained penalties in multithreaded scenarios that might point to there being "things" that could have been tried or hoped for, but the product was pretty much a "well we've got a ship something" where most of the reasons for doing CMT were discarded and the penalties from it could not be fully mitigated.

IBM has great tech in this field. More complex memory versioning and huge EDRAM based LLC. IIRC Intel's implementation is L1 cache only, so transaction size must be very small. IIRC IBM also uses their tech for speculative execution. Way ahead of Intel.

IBM's also been willing to deploy architectures with transactional memory earlier, and presumably is more willing to adjust its architectures for more esoteric design points and memory subsystems than Intel was able/willing to do for a core that had to span so many markets. In addition, some of IBM's experience with an internally more baroque multiprocessor system and cache architecture (various and many disclosed sharing states, and if I recall correctly hardware/system recovery for transitions that lead to failure).
It also likely helps that IBM has so much control over the platform and software stack, since it gives them more levers to pull should corner cases arise.

3dilettante · Nov 30, 2016

Sorry for the double-post, but I was able to google an archive of the discussion about ASF I referenced.
http://yarchive.net/comp/atomic_transactions.html

sebbbi · Nov 30, 2016

3dilettante said:
Sorry for the double-post, but I was able to google an archive of the discussion about ASF I referenced.
http://yarchive.net/comp/atomic_transactions.html

Great, they mention CompareTwoSwapTwo

Some years ago I did research on parallel data structure generation on GPUs, and came to the conclusion that if I had CompareTwoSwapTwo eveything would be much easier. It is the minimal required atomic transaction. And surprisingly powerful one. Almost as good as TSX

. Please give me this instruction on both CPUs and GPUs

I.S.T. · Nov 30, 2016

3dilettante said:
It's definitely a very complex thing to implement, and possibly a point where Intel might have been served by being even more restrictive in what SKUs got TSX, or relegating it to a specific enterprise or customer set. The rumor that Skylake with AVX-512 might be deploying to specific customers might be how large changes with potentially large blowback due to errata could be handled.

Just to clarify with regards to this part of your reply: my comment was intended to be an objective statement rather than an opinion. Something like this is very hard to QA pre-launch compared to a lot of the other things Intel does. Thus, not enough. I don't blame them, considering how hard this is to fix. Sometimes, shit just goes wrong and it's not your fault or anyone's fault per se. I feel this is one of those cases.

The rest of your reply as well as sebbbi's were quite informative, though. Thank you both.

xEx · Nov 30, 2016

sebbbi said:
It is complex. IIRC AMD tried to bring similar extensions to Bulldozer, but no luck either. And no rumors that Zen is going to support it either. IBM has great tech in this field. More complex memory versioning and huge EDRAM based LLC. IIRC Intel's implementation is L1 cache only, so transaction size must be very small. IIRC IBM also uses their tech for speculative execution. Way ahead of Intel.

Why can;t AMD license the tech? AMD and IBM doesn't compete or at least they could get an agreement about it

3dilettante · Nov 30, 2016

sebbbi said:
Great, they mention CompareTwoSwapTwo

Some years ago I did research on parallel data structure generation on GPUs, and came to the conclusion that if I had CompareTwoSwapTwo eveything would be much easier. It is the minimal required atomic transaction. And surprisingly powerful one. Almost as good as TSX . Please give me this instruction on both CPUs and GPUs

The discussion does mention how difficult it can be to maintain ownership of two separate cache lines even for a handful of cycles, while maintaining the illusion of atomicity. Also, in the case of CPUs and GPUs having it, it might be even more of a problem if that were to happen in the context of an APU. Extended atomic functionality or transactions already threaten to break a number of illusions the CPU, cache, uncore, and data fabric use to hide a lot of their internal complexity. GPUs are an even weaker abstraction with an even more messy internal structure.
A DCAS situation that somehow manages to get a CPU and GPU contending with a deep divide in latency, architecture, and safety could be an even greater nightmare.

xEx said:
Why can;t AMD license the tech? AMD and IBM doesn't compete or at least they could get an agreement about it

It's conceptually possible, but it wouldn't be up to AMD. Also, it would be more about licensing specific elements rather than the overall implementation unless AMD creates copy-paste Z or POWER core clusters.
The overall concept is not owned by IBM, but its specific implementation is also tailored to IBM's needs, system stack, and architectural features, vast portions of which may not make sense with the architecture AMD has.

Also, the only reason AMD doesn't seem to compete is that it has virtually no presence in servers any longer--which Zen better change.
IBM is trying to extend Power down, and if Zen manages to do anything like get server presence and multi-socket capability, the possibility of competing exists.

sebbbi · Dec 1, 2016

3dilettante said:
The discussion does mention how difficult it can be to maintain ownership of two separate cache lines even for a handful of cycles, while maintaining the illusion of atomicity. Also, in the case of CPUs and GPUs having it, it might be even more of a problem if that were to happen in the context of an APU. Extended atomic functionality or transactions already threaten to break a number of illusions the CPU, cache, uncore, and data fabric use to hide a lot of their internal complexity. GPUs are an even weaker abstraction with an even more messy internal structure.
A DCAS situation that somehow manages to get a CPU and GPU contending with a deep divide in latency, architecture, and safety could be an even greater nightmare.

I am not personally interested in doing DCAS over CPU<->GPU coherent memory. I am talking about pure GPU compute only (just "garlic" bus transactions in AMD terminology). Currently it is difficult to generate and update complex data structures efficiently on GPU. 64 bit DCAS would make this both easier and faster. I am aware that full blown TSX (L1 cache) wouldn't work well on GPUs, since there are up to 40 waves (64 warps on Nvidia) fighting over the same L1 cache -> L1 cache line lifetime is very short. But DCAS is just a single instruction that modifies two cache lines. Much less risk of eviction than full blown TSX that might modify dozens of cache lines with dozens of instructions.

GPU manufacturers have recently added good primitives to help building massively parallel algorithms. Intel's PixelSync was adapted in DirectX 12 as Rasterizer Ordered Views. Nvidia already adapted it, and I am sure AMD will follow the suit (as it is a requirement for DX12.1). AMD GCN also has similar feature to enforce scheduling order in compute shaders (DS_ORDERED_COUNT). It allows you to build global prefix sum inside a single shader (no multipass through memory required). But these features limit scheduling. DCAS allows parallel programming with no synchronization. Obviously hardware needs to sometimes serialize two DCAS atomics (like it does with any atomics), but the data structures generated/updated by GPU tend to be huge, so there's a very small chance of a collision.

I would also like everybody to support DS_ORDERED_COUNT. But this isn't easy either. Groupshared memory version of ordered count would be also awesome to have. Much more manageable to implement (basically a group sync barrier + atomic that much be executed in wave order). But Kepler didn't even have hardware local memory atomics (software emulated - around 3x slower than GCN/Maxwell), so my hopes on this aren't really high. Hardware design seems to be harder than I think

AlexV · Dec 1, 2016

sebbbi said:
It allows you to build global prefix sum inside a single shader (no multipass through memory required).

This can be done now, without DCAS support at the cost of some extra storage (not a ton of it though).

sebbbi · Dec 1, 2016

AlexV said:
This can be done now, without DCAS support at the cost of some extra storage (not a ton of it though).

This sentence was about DS_ORDERED_COUNT. I haven't even considered using DCAS for prefix sums.

AlexV · Dec 1, 2016

sebbbi said:
This sentence was about DS_ORDERED_COUNT. I haven't even considered using DCAS for prefix sums.

Apologies, I misread! Having said that, the underlying idea is still valid, I think: you can do single-pass prefix sums without DS_ORDERED_COUNT, in a relatively portable fashion, using only API level primitives, with a minimal extra space. Which while interesting is probably out of scope for this thread, so I apologise for the derailment

3dilettante · Dec 1, 2016

sebbbi said:
I am not personally interested in doing DCAS over CPU<->GPU coherent memory. I am talking about pure GPU compute only (just "garlic" bus transactions in AMD terminology). Currently it is difficult to generate and update complex data structures efficiently on GPU. 64 bit DCAS would make this both easier and faster. I am aware that full blown TSX (L1 cache) wouldn't work well on GPUs, since there are up to 40 waves (64 warps on Nvidia) fighting over the same L1 cache -> L1 cache line lifetime is very short. But DCAS is just a single instruction that modifies two cache lines. Much less risk of eviction than full blown TSX that might modify dozens of cache lines with dozens of instructions.

The way TSX works, or using the L1 for GPUs like GCN wouldn't work because TSX monitors for snoops of its read and write set, and GCN's L1s are incoherent. It relies on the partitioned L2, where the atomic units that work at a maximum granularity of 1 cache line and are limited to their local slice.
Existing atomic infrastructure would be insufficient since the L2 slices and their associated units work independently and do not currently have a need to defer or buffer for a window as long as the time it might take to independently load, check, coordinate, and write two L2 slice lines.

edit: Actually, even hitting one slice is going to require changes, since something would need to maintain atomicity over 2 cycles unless the L2 slice gains the ability to process double the number of atomics on two lines simultaneously.

DCAS allows parallel programming with no synchronization. Obviously hardware needs to sometimes serialize two DCAS atomics (like it does with any atomics), but the data structures generated/updated by GPU tend to be huge, so there's a very small chance of a collision.

If it's a DCAS in the same vein as in the ASF discussion, it's monitoring with respect to any accesses or system actions (evictions, invalidations, errors, GPU hardware events, etc.) that hit either location--generally the cache line if the scheme operates through the cache hierarchy.
It wouldn't be atomic if any observer could read a half-completed DCAS, or if one-half of the DCAS commits and a write occurs. Having atomics that only hit one cache line makes helps maintain the illusion that a lot of this is actually happening simultaneously.

sebbbi · Dec 4, 2016

3dilettante said:
The way TSX works, or using the L1 for GPUs like GCN wouldn't work because TSX monitors for snoops of its read and write set, and GCN's L1s are incoherent. It relies on the partitioned L2, where the atomic units that work at a maximum granularity of 1 cache line and are limited to their local slice.
Existing atomic infrastructure would be insufficient since the L2 slices and their associated units work independently and do not currently have a need to defer or buffer for a window as long as the time it might take to independently load, check, coordinate, and write two L2 slice lines.

Yes. GCN handles atomics in L2 cache. I mentioned L1 because Intel TSX is L1 based, and L1 is definitely too small for TSX style transactions on GPU.

As a GPU programmer, I want better atomic primitives for modifying complex data structures. GPUs already have super fast (local and global) atomics compared to CPUs. IHVs know that atomics on GPUs are very important. Similarly some kind of fine grained memory transactions would be much more important for GPUs compared to CPUs. GPU requires highly parallel code to run fast. DCAS would be far simpler to implement than real transactions, but would already get us long way to the right direction.

pTmdfx · Dec 5, 2016

3dilettante said:
The way TSX works, or using the L1 for GPUs like GCN wouldn't work because TSX monitors for snoops of its read and write set, and GCN's L1s are incoherent. It relies on the partitioned L2, where the atomic units that work at a maximum granularity of 1 cache line and are limited to their local slice.
Existing atomic infrastructure would be insufficient since the L2 slices and their associated units work independently and do not currently have a need to defer or buffer for a window as long as the time it might take to independently load, check, coordinate, and write two L2 slice lines.

edit: Actually, even hitting one slice is going to require changes, since something would need to maintain atomicity over 2 cycles unless the L2 slice gains the ability to process double the number of atomics on two lines simultaneously.

But at least the GPU L2 cache as a whole is appearing as a single point of coherency. So in theory, you could forward these requests to a separate path in the black box.

3dilettante · Dec 5, 2016

sebbbi said:
Yes. GCN handles atomics in L2 cache. I mentioned L1 because Intel TSX is L1 based, and L1 is definitely too small for TSX style transactions on GPU.

The size is insufficient given the thread count, and the probability of hitting something that would abort the transaction. In GCN's case, it would be because anything coherent must miss the L1 and the L2 is automatically globally visible.
TSX may partly use the L2 if the read set spills (L3, so it is not totally limited to the L1. TSX is by design underpinned by the cache coherence protocol, which is where GCN's caches provide nothing, since they rely on a basic static line allocation to a channel to make "coherence" trivially fall out due to the inability to host a location in more than one globally visible cache.
An operation that could span two slices would find it has nothing to support it in the architecture.

GPU requires highly parallel code to run fast. DCAS would be far simpler to implement than real transactions, but would already get us long way to the right direction.

It seems even for complex CPUs that an intermediate step like DCAS is frequently skipped and more complex operations or transactional memory are implemented instead.
The largest investment seems to be getting to the point of being able to make more than one location updatable atomically, and once the core and cache architecture is redesigned, having only two doesn't seem to save much.

pTmdfx said:
But at least the GPU L2 cache as a whole is appearing as a single point of coherency. So in theory, you could forward these requests to a separate path in the black box.

The L2 appears to be a single point of coherence because no operation can affect more than one slice. Normal operations only occupy a slice for one cycle, which an atomic that needs to somehow juggle two separate locations will need to exceed.
Any accesses that might hit the locations in question need to be guarded against, which means even with a separate path something must be done to probe L2 slices and then to continue to monitor the cache subsystem in case anything tries to hit the same address. Then, something must make the new values immediately visible in the L2.
Some operations do selectively clear out data from the L2, like Sony's L2 with volatile flag for compute writeback. It involves a global memory stall and hundreds of cycles to resolve, which would be the opposite of what is desired.

For GPUs, there are proposals for stronger memory consistency being implemented with timed cache invalidations. Perhaps something like that or some other form of global coordination can devote a time window where the caches can be sure they won't be servicing requests or the requests belong to a different period of validity.

pTmdfx · Dec 5, 2016

3dilettante said:
Any accesses that might hit the locations in question need to be guarded against, which means even with a separate path something must be done to probe L2 slices and then to continue to monitor the cache subsystem in case anything tries to hit the same address. Then, something must make the new values immediately visible in the L2.

It is just a novel idea anyway. But if the the L1-L2 network has the capability of broadcasting, you could very well solve the probing issue by broadcasting the DCAS to the affected slices and also the DCAS state machine. Then the L2 slices, each of which presumably maintain a FIFO request queue, would have the necessary information to block accesses to those locations, while allowing other memory accesses to other cache lines to jump the queue. The DCAS state machine can still rely on the L2 slices to read and write data for data consistency, reply to the CU and release the corresponding L2 queue blockages only if the operation has completed. It could be a bit worse than normal atomics, but should be similarly slow when the cache is missed anyway. The cost of such data path is another issue though.

fehu · Dec 5, 2016

http://wccftech.com/amd-zen-cpu-potential-trademarks-spotted-ryzen-threadripper/

Thread the ripper

Anarchist4000 · Dec 5, 2016

Needs more X's in the name to make it edgier. AI based trademarks make more sense for a GPU than just Zen, and I doubt Zen+Polaris is the go to combination for AI.

I.S.T. · Dec 5, 2016

Anarchist4000 said:
Needs more X's in the name to make it edgier. AI based trademarks make more sense for a GPU than just Zen, and I doubt Zen+Polaris is the go to combination for AI.

xx_Zen_xx

3dilettante · Dec 5, 2016

pTmdfx said:
It is just a novel idea anyway. But if the the L1-L2 network has the capability of broadcasting, you could very well solve the probing issue by broadcasting the DCAS to the affected slices and also the DCAS state machine.

From the descriptions of GCN's caching, it would be a change to allow it to broadcast an operation pertaining to another channel to unrelated slices. There's already an atomicity issue in that the DCAS needs two separate cache line broadcasts in a network that only supports one request from a client at a time. The timing for when all entities receive notification of the DCAS attempt is uncertain, and
doing so in parallel also potentially leaves the transaction without a global context like a transaction ID, cycle count, or time stamp.
Broadcasts where there is an asymmetry in which L2 slices are being hit with one or both halves of in-progress DCAS operations also pose a threat of overcomitting one of the entities after one or more of the remainder have been locked-in. That could be handled with a flurry of other broadcasts or storing additional state so that a tentative operation can be rolled back.

The tenor of other schemes like TSX, ASF, and possibly IBM's method is some level of centralization or global data. TSX gathers the sets ahead of time and fails if something impinges on them via a snoop; ASF introduces an external observer of traffic and introduces a new way to respond to a snoop, and IBM's involves versioned memory, which I think implies something is coordinating what version is assigned to an access.

Then the L2 slices, each of which presumably maintain a FIFO request queue, would have the necessary information to block accesses to those locations, while allowing other memory accesses to other cache lines to jump the queue.

Is this a FIFO of all traffic, or just ones assumed to be relevant to an atomic?
I'm not sure where in the hierarchy there is an ordering in a general sense. A least between slices, a FIFO local to each slice would be insufficient since what becomes visible with respect to each FIFO is not universal if there's a mismatch in timing or occupancy in the queue--unless something tags the requests.
What does this do if there are already requests for that location in a FIFO when a DCAS hits the slice?

Blocking accesses is also a choice that needs evaluation, as is the choice of letting other accesses bypass a blocked location.
Currently, non-atomic accesses wouldn't have a lightweight method to register a failure or a retry message.
Since these operations are going through a long pipeline, does a new DCAS that hits a location that leapfrogged an earlier DCAS suddenly promote that location to the blocked list? Also, if this is not centralized or tagged, the determination may not be consistent across all the involved entities.

The DCAS state machine can still rely on the L2 slices to read and write data for data consistency, reply to the CU and release the corresponding L2 queue blockages only if the operation has completed.

It wouldn't just be the CU issuing the DCAS, but also any other observer CU that might be issuing reads to one or both of its locations.
What you woudn't want is one slice being able to service an access prior to the DCAS blocking further attempts while a different slice makes the final DCAS visible for the observer CU.
It usually points to some kind of additional coordination or context for the transactions, and this is also not delving into hardware events or pipeline behaviors that might cause a problem in between the success/failure of each half of the operation.

Alexko · Dec 12, 2016

If I'm not mistaken, there's some kind of event tomorrow, but is there any indication about when reviews might be available?

AMD RyZen CPU Architecture for 2017

3dilettante

3dilettante

sebbbi

I.S.T.

xEx

3dilettante

sebbbi

AlexV

Heteroscedasticitate

sebbbi

AlexV

Heteroscedasticitate

3dilettante

sebbbi

pTmdfx

3dilettante

pTmdfx

fehu

Anarchist4000

I.S.T.

3dilettante

Alexko

Similar threads