AMD RyZen CPU Architecture for 2017

3dilettante · Dec 14, 2016

Gubbi said:
Dispatch is one thing, the other problem in not using a global scheduler is that you instruction window is partitioned according to instruction type/exec unit. If you have dependend loads (linked lists) and you miss the D$ on the first load, you quickly fill up your LS queue with stalled instructions. Then an independent load comes along and can't dispatch because the queues are full.

Example: A loop that looks up stuff in a dictionary, where each hashed entry points to a linked list of 1-10 elements. Missing the first element in the linked list cause all subsequent loads to stall and sit dead in the queue, if the same happens on the same second iteration of your loop you're quickly running out of LS queue slots. With A global scheduler, you schedule the first load for execution The subsequent dependent loads would then sit idle in the ROB until the first load completes. The outer loop would grind on, because traversing each hash slot is independent and we have plenty of ROB ( ~200 )entries to keep things going.

Unless there is a shit ton of LS queue entries, I'd expect Zen to have pathological cases where Intel beats it handsomely.

Cheers

My understanding is that the ROB is a pretty bare-bones structure in the most recent high-end cores with just tracking status.
The allocation and rename stage that would get the rename register and LS entry, which would be in-order and prior to the scheduling stage.
How would the scheduler portion of the pipeline be able to block dependent loads from taking up space when their resources are allocated earlier in the in-order portion of the pipeline? Given that it's a pipeline and the back-pressure is unpredictable, it may also be a variable amount of time before it would be known if the first load missed the cache.

Gubbi · Dec 15, 2016

3dilettante said:
How would the scheduler portion of the pipeline be able to block dependent loads from taking up space when their resources are allocated earlier in the in-order portion of the pipeline?

The dependent loads takes up space in the ROB (and has renamed register resources attached). The dependent loads arent' dispatched to an execution port until its operands are ready. ROBs are big nowadays, so they can sit around for a while without blocking anything.

Without a ROB, you have per execution unit scheduling queues. So you replace your 200 entry ROB in your ten execution unit CPU with ten scheduling queues each with 20 entries. Instructions are now stuffed in these queues right after fetch/decode/rename. With two LS units, you are now limited to having 40 loads/stores in flight at any one time, where before, you were effectively limited by the ROB size.

So the scheduling queues needs to be deep, because filling any one up will halt the machine.

Cheers

itsmydamnation · Dec 15, 2016

Gubbi said:
The dependent loads takes up space in the ROB (and has renamed register resources attached). The dependent loads arent' dispatched to an execution port until its operands are ready. ROBs are big nowadays, so they can sit around for a while without blocking anything.

Without a ROB, you have per execution unit scheduling queues. So you replace your 200 entry ROB in your ten execution unit CPU with ten scheduling queues each with 20 entries. Instructions are now stuffed in these queues right after fetch/decode/rename. With two LS units, you are now limited to having 40 loads/stores in flight at any one time, where before, you were effectively limited by the ROB size.

So the scheduling queues needs to be deep, because filling any one up will halt the machine.

Cheers

Im just a simpleton so you will have to humor me

Sounds like you think the mapper doesn't have any feed back loop from the execution units or intelligence, why? if the mapper knows that a uop has stalled in a queue ( better doesn't issue to a queue until data is in the RPF) it just doesn't schedule the dependent instructions they just sit in the retire queue /PRF until ready. The mapper is the interface between disbatch the per lane schedulers and the retire queue it should have all the interfaces into what it needs to globally track that kind of stuff.

Breaking scheduling into two parts could just be about breaking up a critical path into two stages to buy more time?

Gubbi · Dec 15, 2016

itsmydamnation said:
Sounds like you think the mapper doesn't have any feed back loop from the execution units or intelligence, why?if the mapper knows that a uop has stalled in a queue ( better doesn't issue to a queue until data is in the RPF) it just doesn't schedule the dependent instructions they just sit in the retire queue /PRF until ready.

What you're describing is global scheduling from a central ROB.

The retirement queue doesn't dispatch instructions for execution, the individual scheduling queues for each execution unit does. The retirement queue is just a list where results from already executed instructions sit until the unspeculated PC catch up. It does not know anything about data dependencies and thus cannot do dispatch /schedule for execution.

The scheduling queues are fed directly after rename. When an instruction is completed it is removed from the queue and the result saved in the retirement queue

The problem is when you have a sequence of instructions that cannot start execution because of data dependencies. They sit in the scheduling queues until the dependency is met. If you have a couple of linked list traversals and accessing the head miss all caches, you can quickly fill the queues up.

itsmydamnation said:
The mapper is the interface between disbatch the per lane schedulers and the retire queue it should have all the interfaces into what it needs to globally track that kind of stuff.

The mapper dishes out instructions to scheduling queues according to instruction type, probably filling least full queues first.

Cheers

Gubbi · Dec 15, 2016

Think of individual scheduling queues as really big reservation stations in old school Tomasula OOOe (with the retirement queue added)

Cheers

itsmydamnation · Dec 15, 2016

Im not saying the ROB issues instructions (but the retirement queue is central thats the way AMD depict it) , im saying that the mapper when it recivies uops populates the retire queue with the uop and only issues to a scheduler when ready. The thing is as far as im aware AMD haven't detailed how this works anywhere, there aren't any public white papers or patients and they didn't detail it at hot chips. They do have some white papers/patents around a multi level retirement queue system, there might be something relevant in their.

AMD stated that the execution stage is fully out of order why do you think the mapper is just spray and pray?

tunafish · Dec 15, 2016

Gubbi said:
The dependent loads takes up space in the ROB (and has renamed register resources attached). The dependent loads arent' dispatched to an execution port until its operands are ready. ROBs are big nowadays, so they can sit around for a while without blocking anything.

I might be confused, but as far as I understand it, on today's Intel CPUs instructions are dispatched to the scheduler strictly in order. Jumping over the dependent instructions to independent ones is the job of the scheduler, not some other structure. In this case the Intel scheduler would fill up just like the AMD ones.

Gubbi · Dec 15, 2016

itsmydamnation said:
im saying that the mapper when it recivies uops populates the retire queue with the uop and only issues to a scheduler when ready.

AFAICT the mapper just receives up to six instructions per cycle and stuffs them in scheduling queues. It might buffer a few cycles worth of instructions, but it does not do any dependency checking, it strictly stuffs instructions in the exec unit scheduling queues according to type. In the case of multiple queues accepting the same type of instruction there is probably a heuristic involved, like how full the queues are, how many entries er ready to go (have data ready) etc.

Cheers

Gubbi · Dec 15, 2016

tunafish said:
I might be confused, but as far as I understand it, on today's Intel CPUs instructions are dispatched to the scheduler strictly in order. Jumping over the dependent instructions to independent ones is the job of the scheduler, not some other structure. In this case the Intel scheduler would fill up just like the AMD ones.

In Intel parlance, dispatch is when an instruction is scheduled from the ROB to an execution unit (through a port). This is different than IBM (and old Motorola) parlance. I usually try to avoid the terms issue and dispatch, mea culpa.

Instructions are inserted in-order into the ROB. Instructions are scheduled from the ROB to execution units when data is ready (or known to be ready real soon), ie. out-of-order. The exec unit themselves holds a few instructions and can take advantage of local bypass in many cases for low latency operation.

In Zen (and in ARM Cortex A9/A15/A57/A73), there is no ROB. Instead there is a retirement buffer/queue and per-execution unit scheduling queues, where instructions sit until all operand dependencies are met, then the instruction is executed, the result written to the allocated retirement queue slot and the instruction is removed from the scheduling queue (leaving room for new ones). All un-executed instructions sit in the scheduling queues. If you have long dependency chains of instructions and the first one stalls, you may run out of scheduling queue entries and the front end stalls because it has nowhere to insert new instructions.

Cheers

tunafish · Dec 15, 2016

In Intel CPUs, there is a separate structure, called the scheduler. Things get put onto rob instantly from the decoders, but only enter scheduler once there is space (and do so strictly in-order). On skylake, there are 97 scheduler entries:

This scheduler is the equivalent of the AMD split schedulers. In case of lots of dependent ops, it fills up just like the AMD ones.

3dilettante · Dec 15, 2016

Gubbi said:
The dependent loads takes up space in the ROB (and has renamed register resources attached). The dependent loads arent' dispatched to an execution port until its operands are ready. ROBs are big nowadays, so they can sit around for a while without blocking anything.

Current high-performance x86 cores from Intel have reduced the amount of context carried by the ROB. Register tracking is primarily handled by the allocation table, and the monitoring of address-based dependences involved with memory speculation/forwarding/disambiguation/ordering occur in the L/S portion after a queue entry is allocated.
The ROB's primary job is status tracking and retirement. It can do things like carry a renamed result until retirement, or did/does in cores that do not use a physical register file to remove the power cost of moving data out of the ROB at retirement.

For x86 in particular, the ordering requirements, forwarding snooping, and speculation are beyond the ROB's ability to resolve or handle. Not allocating an L/S entry takes the operation out of view of the L/S pipeline, and it can make decisions that are invalid architecturally in the absence of that information.

Without a ROB, you have per execution unit scheduling queues.

There would be schedulers, but since these are OoO schedulers calling them "queues" is something of a leading term.

So you replace your 200 entry ROB in your ten execution unit CPU with ten scheduling queues each with 20 entries. Instructions are now stuffed in these queues right after fetch/decode/rename. With two LS units, you are now limited to having 40 loads/stores in flight at any one time, where before, you were effectively limited by the ROB size.

So the scheduling queues needs to be deep, because filling any one up will halt the machine.

Cheers

The thing is that Intel's cores do stall with sufficient loads uops in-flight, based on the number of load entries and not the ROB.
https://software.intel.com/en-us/node/595220

Zen's scheme allows for 72 loads outstanding and has two AGU schedulers with 14 entries each. A pure chain of pointer chasing loads can go 28 deep before the in-order front end must stall due to oversubscription of scheduling resources. If any non-dependent loads from a register allocation standpoint reach the schedulers during that run, the AGU schedulers would be able to calculate an address OoO and send it to the 72-entry load unit. In theory, SMT mode might get one of its predictors to block rename for the thread with the massive dependence chain before it starves the other of any forward progress.
Some memory operations may fall out of this, since there is a stack engine and some kind of "memory file" for store forwarding that might be involved in eliding some redundant address/forwarding calculations.

If 72 loads are in-flight, the core stalls on the next load op to hit the allocation stage.
Skylake would stall after 72 loads. Its unified scheduler could exceed Zen after 28 dependent loads single-thread, although you'd be bottlenecked by the 72-load limit. The exact limit would need to be tested, however. The scheduler is unified, but the internal reality may be that some operations may weigh more heavily than the headline count.

Gubbi · Dec 16, 2016

3dilettante said:
Current high-performance x86 cores from Intel have reduced the amount of context carried by the ROB. Register tracking is primarily handled by the allocation table, and the monitoring of address-based dependences involved with memory speculation/forwarding/disambiguation/ordering occur in the L/S portion after a queue entry is allocated.

Right, ROBs used to hold data and capture data from the result-buses. Today, everybody uses a physical register file and the ROB only holds renamed registers indicies.

3dilettante said:
The thing is that Intel's cores do stall with sufficient loads uops in-flight, based on the number of load entries and not the ROB.
https://software.intel.com/en-us/node/595220

Zen's scheme allows for 72 loads outstanding and has two AGU schedulers with 14 entries each. A pure chain of pointer chasing loads can go 28 deep before the in-order front end must stall due to oversubscription of scheduling resources.

The are two limits on both Intel and AMD CPUs. The number of loads executing and the number of loads decoded and waiting for execution. The number of loads executing is limited by load buffers on both Intel and AMD.

The number of loads decoded and waiting for execution is a very different matter. On Intel the upper bound is the size of the ROB (>200). On Zen it is 14 per queue, 28 in all. That seems low to me.

Traverse a tree for each item in a collection with known depth (ie. perfectly predictable), get a cache miss on a non-leaf and you pollute your AGU schedulers with dead loads, do this for a couple of items and you have a significant amount of dead loads in you scheduling queues, lowering scheduling capacity for all the other work going on (and your SMT sibling thread).

I'll make a prediction right now: AMD will increase the depth of each schedule queue in Zen2. BTW, I agree with you that "scheduling buffer" would be more appropiate, but Anandtech uses "scheduling queue" in their Zen articles and I guess the name originates from AMD.

Cheers

Gubbi · Dec 16, 2016

tunafish said:
In Intel CPUs, there is a separate structure, called the scheduler. Things get put onto rob instantly from the decoders, but only enter scheduler once there is space (and do so strictly in-order). On skylake, there are 97 scheduler entries:

Sorry, I don't think this is correct. It wouldn't be a ROB (Re-Order Buffer) if it just pushed instructions forward in-order to the scheduler. In fact, there would be no point in having it at all, since it would just buffer (delay) instructions from decode to actual execution.

I'd expect the scheduler to hold instructions which have data ready, or instructions where data will be ready with known latency (ie. a FADD dependent on a FMUL, where the FMUL is already executing and known to be done in a maximum of 5 cycles). The ROB watches the renamed register indices of the result-buses. When a ROB entry has all its operand depencies met, the physical register file is accessed and the instruction goes to the scheduler. If the operands aren't ready yet, but will be in known time, they will be caught using the bypass network in the scheduler.

Cheers

tunafish · Dec 16, 2016

Gubbi said:
Sorry, I don't think this is correct. It wouldn't be a ROB (Re-Order Buffer) if it just pushed instructions forward in-order to the scheduler. In fact, there would be no point in having it at all, since it would just buffer (delay) instructions from decode to actual execution.

It has been noted that since Sandy Bridge, the ROB is not really a ROB. It's just the backing structure that contains all information needed to retire instructions, and into which instructions are placed to wait for the scheduler. There is no delay caused by the ROB -- if the scheduler has free entries, an instruction can enter the ROB and the scheduler on the same cycle.

I'd expect the scheduler to hold instructions which have data ready, or instructions where data will be ready with known latency (ie. a FADD dependent on a FMUL, where the FMUL is already executing and known to be done in a maximum of 5 cycles). The ROB watches the renamed register indices of the result-buses. When a ROB entry has all its operand depencies met, the physical register file is accessed and the instruction goes to the scheduler. If the operands aren't ready yet, but will be in known time, they will be caught using the bypass network in the scheduler.

This is simply incorrect. Whether or not instructions have data ready is not known until they enter scheduler -- figuring out if data is ready or not is the job of the scheduler.

Laurent06 · Dec 16, 2016

I wonder how Intel can achieve their up to 8 uops per cycle with an 97 entry scheduler. Even with a full custom implementation that looks extremely impressive. I guess this is one of the most critical blocks of the design.

pTmdfx · Dec 17, 2016

Gubbi said:
Sorry, I don't think this is correct. It wouldn't be a ROB (Re-Order Buffer) if it just pushed instructions forward in-order to the scheduler. In fact, there would be no point in having it at all, since it would just buffer (delay) instructions from decode to actual execution.

It is the same case for both AMD and Intel. The ROB — or Retirement Queue as AMD called them — tracks instructions in program order in parallel to the scheduler. The modern ROB entries do not hold data, but instead relying on the PRF and register renaming to maintain the history.

I'd expect the scheduler to hold instructions which have data ready, or instructions where data will be ready with known latency (ie. a FADD dependent on a FMUL, where the FMUL is already executing and known to be done in a maximum of 5 cycles). The ROB watches the renamed register indices of the result-buses. When a ROB entry has all its operand depencies met, the physical register file is accessed and the instruction goes to the scheduler. If the operands aren't ready yet, but will be in known time, they will be caught using the bypass network in the scheduler.

This is the responsibility of the scheduler queues.

The Reorder Buffer is here to maintain the illusion of program order, and commit or rollback the register state & store buffer as necessary. It shouldn't involve in micro-op reordering in the execution domain.

pTmdfx · Dec 17, 2016

Gubbi said:
Sorry, I don't think this is correct. It wouldn't be a ROB (Re-Order Buffer) if it just pushed instructions forward in-order to the scheduler. In fact, there would be no point in having it at all, since it would just buffer (delay) instructions from decode to actual execution.

I am more inclined to the idea that these are all parallel constructs, especially in AMD's architectures if you consider a macro-op can be dispatched to both the integer pipeline and the FP cluster. It is more apparent that the Dispatch Group Buffer is the gatekeeper that opens the gate for the oldest dispatch group only if there is sufficient resource.

Gubbi · Dec 19, 2016

tunafish said:
This is simply incorrect. Whether or not instructions have data ready is not known until they enter scheduler -- figuring out if data is ready or not is the job of the scheduler.

If that is the case, then Nehalem -> Sandy Bridge was a bigger change than just going to a physical register file.

I'm not convinced though. My understanding is this:

In every Pentium Pro-derived core up to and including Nehalem/Westmere, the ROB has been a data capture scheduler. Results are broadcast to the ROB, each instruction slots snoops for results matching the renamed register indices of its operands and captures the value if it matches (hence the name). Once all operands are ready, the instructions are pushed to the scheduler (called the reservation station, RS in PPRO days). The size of the ROB in PPRO was 40 entries, the size of the RS was 20 entries. In Nehalem this was increased to 128 ROB entries and 36 RS (scheduler) entries.

In the PPRO days, 2 32 bit results and one 32/80 bit result were broadcast to 40 ROB entries per cycle, in Nehalem, 4 128bit results could be broadcast to 128 ROB entries per cycle, - 11 times as much work done in the ROB.

Going to 256bit AVX, would double the power spent in the ROB, untenable as Nehalem already burned a lot of power. The solution is to broadcast the renamed register indices in the ROB instead. Sandy Bridge sports 144 FP and 160 Int rename registers, just 4 x 9 bits needs to be broadcast to resolve four resulst. The fact that Sandy Bridge and descendants reads the register file when instructions transit from the ROB to the scheduler means the ROB needs to have knowledge of when registers in the physical register file are valid and the ROB is thus acutely aware of the status of registers/results and is part of the OOO scheduling machinery.

Cheers

xEx · Dec 19, 2016

I have a question.

Can AMD implement some of the SenseMI technologies in their GPUs? That would be interesting.

lanek · Dec 19, 2016

xEx said:
I have a question.

Can AMD implement some of the SenseMI technologies in their GPUs? That would be interesting.

I had the ( maybe wrong ) feeling that they are largely a derivative from the gpu's TDP control and boost technology. Maybe with some amelioration on how they execute it that we can maybe find then on GPU's ( infinite fabric on Pure power ).

AMD RyZen CPU Architecture for 2017

Similar threads