AMD RyZen CPU Architecture for 2017

Discussion in 'PC Industry' started by fellix, Oct 20, 2014.

Tags:
  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    My understanding is that the ROB is a pretty bare-bones structure in the most recent high-end cores with just tracking status.
    The allocation and rename stage that would get the rename register and LS entry, which would be in-order and prior to the scheduling stage.
    How would the scheduler portion of the pipeline be able to block dependent loads from taking up space when their resources are allocated earlier in the in-order portion of the pipeline? Given that it's a pipeline and the back-pressure is unpredictable, it may also be a variable amount of time before it would be known if the first load missed the cache.
     
  2. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,528
    Likes Received:
    862
    The dependent loads takes up space in the ROB (and has renamed register resources attached). The dependent loads arent' dispatched to an execution port until its operands are ready. ROBs are big nowadays, so they can sit around for a while without blocking anything.

    Without a ROB, you have per execution unit scheduling queues. So you replace your 200 entry ROB in your ten execution unit CPU with ten scheduling queues each with 20 entries. Instructions are now stuffed in these queues right after fetch/decode/rename. With two LS units, you are now limited to having 40 loads/stores in flight at any one time, where before, you were effectively limited by the ROB size.

    So the scheduling queues needs to be deep, because filling any one up will halt the machine.

    Cheers
     
    sebbbi likes this.
  3. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,298
    Likes Received:
    396
    Location:
    Australia
    Im just a simpleton so you will have to humor me :shocked:

    Sounds like you think the mapper doesn't have any feed back loop from the execution units or intelligence, why? if the mapper knows that a uop has stalled in a queue ( better doesn't issue to a queue until data is in the RPF) it just doesn't schedule the dependent instructions they just sit in the retire queue /PRF until ready. The mapper is the interface between disbatch the per lane schedulers and the retire queue it should have all the interfaces into what it needs to globally track that kind of stuff.

    Breaking scheduling into two parts could just be about breaking up a critical path into two stages to buy more time?
     
  4. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,528
    Likes Received:
    862
    What you're describing is global scheduling from a central ROB.

    The retirement queue doesn't dispatch instructions for execution, the individual scheduling queues for each execution unit does. The retirement queue is just a list where results from already executed instructions sit until the unspeculated PC catch up. It does not know anything about data dependencies and thus cannot do dispatch /schedule for execution.

    The scheduling queues are fed directly after rename. When an instruction is completed it is removed from the queue and the result saved in the retirement queue

    The problem is when you have a sequence of instructions that cannot start execution because of data dependencies. They sit in the scheduling queues until the dependency is met. If you have a couple of linked list traversals and accessing the head miss all caches, you can quickly fill the queues up.

    The mapper dishes out instructions to scheduling queues according to instruction type, probably filling least full queues first.

    Cheers
     
  5. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,528
    Likes Received:
    862
    Think of individual scheduling queues as really big reservation stations in old school Tomasula OOOe (with the retirement queue added)

    Cheers
     
  6. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,298
    Likes Received:
    396
    Location:
    Australia
    Im not saying the ROB issues instructions (but the retirement queue is central thats the way AMD depict it) , im saying that the mapper when it recivies uops populates the retire queue with the uop and only issues to a scheduler when ready. The thing is as far as im aware AMD haven't detailed how this works anywhere, there aren't any public white papers or patients and they didn't detail it at hot chips. They do have some white papers/patents around a multi level retirement queue system, there might be something relevant in their.

    AMD stated that the execution stage is fully out of order why do you think the mapper is just spray and pray?
     
  7. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    542
    Likes Received:
    171
    I might be confused, but as far as I understand it, on today's Intel CPUs instructions are dispatched to the scheduler strictly in order. Jumping over the dependent instructions to independent ones is the job of the scheduler, not some other structure. In this case the Intel scheduler would fill up just like the AMD ones.
     
  8. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,528
    Likes Received:
    862
    AFAICT the mapper just receives up to six instructions per cycle and stuffs them in scheduling queues. It might buffer a few cycles worth of instructions, but it does not do any dependency checking, it strictly stuffs instructions in the exec unit scheduling queues according to type. In the case of multiple queues accepting the same type of instruction there is probably a heuristic involved, like how full the queues are, how many entries er ready to go (have data ready) etc.

    Cheers
     
  9. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,528
    Likes Received:
    862
    In Intel parlance, dispatch is when an instruction is scheduled from the ROB to an execution unit (through a port). This is different than IBM (and old Motorola) parlance. I usually try to avoid the terms issue and dispatch, mea culpa.

    Instructions are inserted in-order into the ROB. Instructions are scheduled from the ROB to execution units when data is ready (or known to be ready real soon), ie. out-of-order. The exec unit themselves holds a few instructions and can take advantage of local bypass in many cases for low latency operation.

    In Zen (and in ARM Cortex A9/A15/A57/A73), there is no ROB. Instead there is a retirement buffer/queue and per-execution unit scheduling queues, where instructions sit until all operand dependencies are met, then the instruction is executed, the result written to the allocated retirement queue slot and the instruction is removed from the scheduling queue (leaving room for new ones). All un-executed instructions sit in the scheduling queues. If you have long dependency chains of instructions and the first one stalls, you may run out of scheduling queue entries and the front end stalls because it has nowhere to insert new instructions.

    Cheers
     
  10. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    542
    Likes Received:
    171
    In Intel CPUs, there is a separate structure, called the scheduler. Things get put onto rob instantly from the decoders, but only enter scheduler once there is space (and do so strictly in-order). On skylake, there are 97 scheduler entries:

    [​IMG]
    This scheduler is the equivalent of the AMD split schedulers. In case of lots of dependent ops, it fills up just like the AMD ones.
     
    sebbbi likes this.
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Current high-performance x86 cores from Intel have reduced the amount of context carried by the ROB. Register tracking is primarily handled by the allocation table, and the monitoring of address-based dependences involved with memory speculation/forwarding/disambiguation/ordering occur in the L/S portion after a queue entry is allocated.
    The ROB's primary job is status tracking and retirement. It can do things like carry a renamed result until retirement, or did/does in cores that do not use a physical register file to remove the power cost of moving data out of the ROB at retirement.

    For x86 in particular, the ordering requirements, forwarding snooping, and speculation are beyond the ROB's ability to resolve or handle. Not allocating an L/S entry takes the operation out of view of the L/S pipeline, and it can make decisions that are invalid architecturally in the absence of that information.

    There would be schedulers, but since these are OoO schedulers calling them "queues" is something of a leading term.

    The thing is that Intel's cores do stall with sufficient loads uops in-flight, based on the number of load entries and not the ROB.
    https://software.intel.com/en-us/node/595220

    Zen's scheme allows for 72 loads outstanding and has two AGU schedulers with 14 entries each. A pure chain of pointer chasing loads can go 28 deep before the in-order front end must stall due to oversubscription of scheduling resources. If any non-dependent loads from a register allocation standpoint reach the schedulers during that run, the AGU schedulers would be able to calculate an address OoO and send it to the 72-entry load unit. In theory, SMT mode might get one of its predictors to block rename for the thread with the massive dependence chain before it starves the other of any forward progress.
    Some memory operations may fall out of this, since there is a stack engine and some kind of "memory file" for store forwarding that might be involved in eliding some redundant address/forwarding calculations.

    If 72 loads are in-flight, the core stalls on the next load op to hit the allocation stage.
    Skylake would stall after 72 loads. Its unified scheduler could exceed Zen after 28 dependent loads single-thread, although you'd be bottlenecked by the 72-load limit. The exact limit would need to be tested, however. The scheduler is unified, but the internal reality may be that some operations may weigh more heavily than the headline count.
     
    sebbbi, Gubbi and Lightman like this.
  12. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,528
    Likes Received:
    862
    Right, ROBs used to hold data and capture data from the result-buses. Today, everybody uses a physical register file and the ROB only holds renamed registers indicies.


    The are two limits on both Intel and AMD CPUs. The number of loads executing and the number of loads decoded and waiting for execution. The number of loads executing is limited by load buffers on both Intel and AMD.

    The number of loads decoded and waiting for execution is a very different matter. On Intel the upper bound is the size of the ROB (>200). On Zen it is 14 per queue, 28 in all. That seems low to me.

    Traverse a tree for each item in a collection with known depth (ie. perfectly predictable), get a cache miss on a non-leaf and you pollute your AGU schedulers with dead loads, do this for a couple of items and you have a significant amount of dead loads in you scheduling queues, lowering scheduling capacity for all the other work going on (and your SMT sibling thread).

    I'll make a prediction right now: AMD will increase the depth of each schedule queue in Zen2. BTW, I agree with you that "scheduling buffer" would be more appropiate, but Anandtech uses "scheduling queue" in their Zen articles and I guess the name originates from AMD.

    Cheers
     
    #572 Gubbi, Dec 16, 2016
    Last edited: Dec 16, 2016
    Lightman likes this.
  13. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,528
    Likes Received:
    862
    Sorry, I don't think this is correct. It wouldn't be a ROB (Re-Order Buffer) if it just pushed instructions forward in-order to the scheduler. In fact, there would be no point in having it at all, since it would just buffer (delay) instructions from decode to actual execution.

    I'd expect the scheduler to hold instructions which have data ready, or instructions where data will be ready with known latency (ie. a FADD dependent on a FMUL, where the FMUL is already executing and known to be done in a maximum of 5 cycles). The ROB watches the renamed register indices of the result-buses. When a ROB entry has all its operand depencies met, the physical register file is accessed and the instruction goes to the scheduler. If the operands aren't ready yet, but will be in known time, they will be caught using the bypass network in the scheduler.

    Cheers
     
    #573 Gubbi, Dec 16, 2016
    Last edited: Dec 16, 2016
  14. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    542
    Likes Received:
    171
    It has been noted that since Sandy Bridge, the ROB is not really a ROB. It's just the backing structure that contains all information needed to retire instructions, and into which instructions are placed to wait for the scheduler. There is no delay caused by the ROB -- if the scheduler has free entries, an instruction can enter the ROB and the scheduler on the same cycle.

    This is simply incorrect. Whether or not instructions have data ready is not known until they enter scheduler -- figuring out if data is ready or not is the job of the scheduler.
     
  15. Laurent06

    Regular

    Joined:
    Dec 14, 2007
    Messages:
    715
    Likes Received:
    33
    I wonder how Intel can achieve their up to 8 uops per cycle with an 97 entry scheduler. Even with a full custom implementation that looks extremely impressive. I guess this is one of the most critical blocks of the design.
     
  16. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    It is the same case for both AMD and Intel. The ROB — or Retirement Queue as AMD called them — tracks instructions in program order in parallel to the scheduler. The modern ROB entries do not hold data, but instead relying on the PRF and register renaming to maintain the history.

    This is the responsibility of the scheduler queues.

    The Reorder Buffer is here to maintain the illusion of program order, and commit or rollback the register state & store buffer as necessary. It shouldn't involve in micro-op reordering in the execution domain.
     
    #576 pTmdfx, Dec 17, 2016
    Last edited: Dec 17, 2016
  17. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    I am more inclined to the idea that these are all parallel constructs, especially in AMD's architectures if you consider a macro-op can be dispatched to both the integer pipeline and the FP cluster. It is more apparent that the Dispatch Group Buffer is the gatekeeper that opens the gate for the oldest dispatch group only if there is sufficient resource.
     
  18. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,528
    Likes Received:
    862
    If that is the case, then Nehalem -> Sandy Bridge was a bigger change than just going to a physical register file.

    I'm not convinced though. My understanding is this:

    In every Pentium Pro-derived core up to and including Nehalem/Westmere, the ROB has been a data capture scheduler. Results are broadcast to the ROB, each instruction slots snoops for results matching the renamed register indices of its operands and captures the value if it matches (hence the name). Once all operands are ready, the instructions are pushed to the scheduler (called the reservation station, RS in PPRO days). The size of the ROB in PPRO was 40 entries, the size of the RS was 20 entries. In Nehalem this was increased to 128 ROB entries and 36 RS (scheduler) entries.

    In the PPRO days, 2 32 bit results and one 32/80 bit result were broadcast to 40 ROB entries per cycle, in Nehalem, 4 128bit results could be broadcast to 128 ROB entries per cycle, - 11 times as much work done in the ROB.

    Going to 256bit AVX, would double the power spent in the ROB, untenable as Nehalem already burned a lot of power. The solution is to broadcast the renamed register indices in the ROB instead. Sandy Bridge sports 144 FP and 160 Int rename registers, just 4 x 9 bits needs to be broadcast to resolve four resulst. The fact that Sandy Bridge and descendants reads the register file when instructions transit from the ROB to the scheduler means the ROB needs to have knowledge of when registers in the physical register file are valid and the ROB is thus acutely aware of the status of registers/results and is part of the OOO scheduling machinery.

    Cheers
     
    #578 Gubbi, Dec 19, 2016
    Last edited: Dec 19, 2016
  19. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    939
    Likes Received:
    398
    I have a question.

    Can AMD implement some of the SenseMI technologies in their GPUs? That would be interesting.
     
  20. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    I had the ( maybe wrong ) feeling that they are largely a derivative from the gpu's TDP control and boost technology. Maybe with some amelioration on how they execute it that we can maybe find then on GPU's ( infinite fabric on Pure power ).
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...