AMD RyZen CPU Architecture for 2017

Discussion in 'PC Industry' started by fellix, Oct 20, 2014.

Tags:
  1. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    I also have a question for the more technically inclined here. How many ports do the physical register file of modern microprocessors have? Just looking at block diagrams you'd think they might have alot (6) ports, but thats impossible right?

    edit - that should say 16 ports not 6.
     
    #581 Infinisearch, Dec 19, 2016
    Last edited: Dec 20, 2016
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The physical register file is integral to a fundamental change to the execution engine. The shift towards sourcing values via pointers is what makes the concept, and it cannot be isolated from the changes to the ROB and scheduler changes.

    I interpret the following differently:
    http://www.intel.com/content/www/us...-ia-32-architectures-optimization-manual.html

    My understanding is that the in-order front end works with the ROB and allocates a ROB entry, RS entry, and an LS entry if needed. The pipeline will stall if it can't.
    The ROB does serve as a possible data source, but it doesn't do the full monitoring of forwarding and readiness of operands. It's already handling exceptions, writes to the register file, and in-order retirement. The scheduler can be a destination for forwarded results if there are ops already pending, and the wording of the Intel guide indicates the reservation station does have entries before they have all their operands--which is the general point in other architectures that have reservation stations.
    Once an instruction begins execution, it can leave the RS, which is why the ROB can have more operations in latter stages of uop's lifespan.

    There are performance event counters mentioned for Nehalem that do cite specific cases like a stall where there are no memory buffer entries available. There's a MOB for keeping accesses in order, especially with the memory speculation added with Core2 and going forward. The ROB is not in a good position to catch aliased addresses, and the memory pipeline cannot catch aliases if the ROB keeps memory operations invisible to it for an indeterminate period of time.

    The ROB and scheduler allocations occur in parallel. Since the ROB no longer serves as a data source, nor does it update the an architectural register file directly, why would the scheduler need anything except some of the allocation information from the allocation stage?
     
  3. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    939
    Likes Received:
    398
    from the GPUs? threat:

    "Another new algorithm uses self-training neural network techniques to make smarter decisions on when to pre-fetch data or instructions for a core. That could result in systems that get modestly better over time at recurring tasks" this means that my CPU will gain performance over time if I play the same game over and over? It seems like over load of marketing to me tbh.

    http://www.eetimes.com/document.asp?doc_id=1330981&page_number=1
     
  4. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    939
    Likes Received:
    398
    Also we have to think of how cheap are this CPUs to be manufacture because they doesn't need high complex GPUs sucking space energy and heat. So I think AMD can make this CPU and sell them at very very good prices compared to intel and still make good money.

    Also what would be intels answer? will we finally be able to stop financing Intels IGPs department? I really hope so. The next year will very interesting
     
  5. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    They can be significantly ported, depending on design choices. Commonly, ALUs can source 2 read operands and 1 write per cycle, and superscalar CPU register files can be heavily engineered to match the throughput.
    Itanium, for example, had a lot of ports due to its width.
    However, a design can reduce the impact on the file by relying on forwarding or an ROB. Prior to the shift to the current core architecture, Intel's cores could bottleneck on register port limits.
    Other tricks, besides heavy multiporting, include clustering or mirroring so that individual portions of the register file are not physically burdened by having to service the full operand demand.

    Similarly, having separate FP and Integer files helps divide the load, particularly since FP can have some very wide ports.

    Branch prediction and prefetching likely fall under that marketing umbrella. Recurring could just be a loop or frequently called code.
    One possibility within the core itself might be a way to record scheduling conflicts or hot registers in a loop that could be optimized.
     
  6. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Is there any relationship between the design process (metal layers maybe) and the maximum number of ports?
     
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    There's an interplay between the design target and what provides a practical benefit. CPU register files can be dominated by the circuitry around them involved in accessing them, since each port impacts the transistors per cell (area, power, voltage, capacity), the complexity of the access network (endpoints in the network, sense amps, decoders, power) and congestion in a timing-sensitive and congested portion of the chip.
    ed: A lot of these interact with each other in ways I am not versed in sufficiently to dive more deeply into (like speed).

    I would presume that the relative physical and engineering cost would list the process as a significant contributor, but not in a straightforward fashion. Back when more CPU designers had more control over the process, there was a tradeoff in metal layers versus engineering effort for a given design target. More layers would make getting a certain area cost feasible for a given amount of effort, which is one area where Intel was able to use fewer layers for a comparable CPU.
     
    #587 3dilettante, Dec 19, 2016
    Last edited: Dec 19, 2016
  8. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    Bulldozer has a 10R6W FP PRF — two 3R1W pipelines, two 2R1W pipelines and two for LSU writebacks. The integer PRF should have a similar arrangement, i.e. 8R6W (note: AGLU can write PRF). But AFAICR the integer PRF is replicated with write broadcast for each pair of EX and AG, so it is actually two PRFs, probably 4R6W.
     
  9. Laurent06

    Regular

    Joined:
    Dec 14, 2007
    Messages:
    715
    Likes Received:
    33
    One simple way to estimate it is to use the maximum IPC. If a CPU can sustain 3 adds per cycle then you need 6 read ports and 3 write ports.

    This might help: http://my.ece.msstate.edu/faculty/reese/EE8273/lectures/memarch/memarch.pdf
     
  10. Laurent06

    Regular

    Joined:
    Dec 14, 2007
    Messages:
    715
    Likes Received:
    33
    That's marketing BS at its best or worst :p Algorithms that work with training for branch prediction and data prefetching have been described many years ago. Look for perceptron for instance.
     
  11. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,496
    Likes Received:
    910
    I could be wrong, but I think that xEx is under the impression that we're talking about something persistent, that if he plays Crysis 4 every day, the game will get faster day after day.
     
  12. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    939
    Likes Received:
    398
    I just try to deduct what AMD said about "uses self-training neural network techniques to make smarter decisions on when to pre-fetch data or instructions for a core. That could result in systems that get modestly better over time at recurring tasks" means to users. because the literal meaning in that is that the cpu get better over time if hes doing the same task. Whether that "over time" means milliseconds, seconds, minutes, hours we dont know just yet. It may mean that the first couple of ms are bad and then it gets good(like the first couple of frames take longer to render than the rest or that the game will gain performance after some time playing)
     
  13. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    7,038
    Likes Received:
    3,108
    Location:
    Pennsylvania
    If the comment is even possibly related to games at all then any potential gains would flatten very quickly.
     
  14. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Metal layers are what the buses are made of right? Thats basically why I thought it would prove to be a limitation. Also do you know where I can read about the interplay between design and layout? I've taken a course on basic microprocessor design but have always wondered about how layout constraints affect design and would like to read about it.

    Where did you get this information from? Can you post a link?
     
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The demands of a high-clock CPU would keep the complexity of any "neural network" low, and the amount of context low. What a "task" is would be correspondingly simple.
    Without some kind of off-die and possibly non-volatile storage, the time frame would also need to be as short as the life span of a branch predictor or prefetch stream entry, which is closer to the lifetime of a hot code region--and then probably only for that time slice.

    The other limit is the limited scope of what the CPU can perceive of what kind of task it is running. Without some outside storage and software, it sees a limited range of values with no particular idea of what "task" it is globally.
    The branch predictor for AMD has been a perceptron for a while now. Prefetch logic is unclear, but it might be that a perceptron or similar limited neural net could be applied there for non-strided accesses.
    Within a core, maybe some limited level of prediction for scheduler allocation?
     
    Alexko, sebbbi and Laurent06 like this.
  16. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    939
    Likes Received:
    398
    Yeah I agree. you can make a "neural net" that "self train" itself and learn over time but to make something like that inside the die of a cpu is something completely different. But it is a very good marketing material for sure.
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Metal layers provide the connections between transistors and units, and the impact of wire delay and congestion is a major factor. How much can be supplied to a limited area is a function of how many layers could be available, but it's not a straightforward limit on the number of register file ports without knowing what the design's target is, how much is invested in engineering the units for optimal routing, and what trade-offs like banking or clustering are in play.

    It's a scattering of sites references that I try to remember over time.
    One item is rather old now, but the difficulties in getting significant improvement in wires have been well-proven since then.
    http://www.realworldtech.com/shrinking-cpu/4/

    Physical distance impacts delay and power, and items like the number of ports has implications about the size of the SRAM and the number of resources needed to access it (decoders for each register ID, sense amps for each new port, more access network to route the data, etc.). That's more wires, more distance, more complicated characterization of interference, and increasing congestion under the ceiling of the number of available layers.
    The speed of register access and reliability is a tradeoff in density and drive strength--which perversely likes bigger transistors.

    There are also comp.arch discussions including designers from AMD or elsewhere where they talk about how X unit or the forwarding distance would have expanded the physical distance traversed, and would have exceeded the time budget allotted by the design target.
     
  18. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    Probably this one for the FP PRF. I didn't remember exactly which (IIRC both are a ISSCC conference paper), but both were definitely available in IEEE Xplore.
    http://ieeexplore.ieee.org/abstract/document/5746227/

    Edit: This figure sums up those data points in the paper.
    http://pc.watch.impress.co.jp/img/pcw/docs/430/801/9.jpg
     
  19. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Bobcat and Jaguar already had "neural network" branch predictors. Bulldozer didn't, but AMD later took the Jaguar's branch predictor and added it to Piledriver. This was briefly mentioned in technical articles and some reviews. Now neural network has become a buzzword, and marketing department uses it as well. We don't yet know how much AMDs neural network predictor has improved over Excavator, if any.
     
  20. fehu

    Veteran Regular

    Joined:
    Nov 15, 2006
    Messages:
    1,462
    Likes Received:
    395
    Location:
    Somewhere over the ocean
    neural networks on the cloud with b2b 2.0
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...