AMD RyZen CPU Architecture for 2017

Infinisearch · Dec 19, 2016

I also have a question for the more technically inclined here. How many ports do the physical register file of modern microprocessors have? Just looking at block diagrams you'd think they might have alot (6) ports, but thats impossible right?

edit - that should say 16 ports not 6.

3dilettante · Dec 19, 2016

Gubbi said:
If that is the case, then Nehalem -> Sandy Bridge was a bigger change than just going to a physical register file.

The physical register file is integral to a fundamental change to the execution engine. The shift towards sourcing values via pointers is what makes the concept, and it cannot be isolated from the changes to the ROB and scheduler changes.

In every Pentium Pro-derived core up to and including Nehalem/Westmere, the ROB has been a data capture scheduler. Results are broadcast to the ROB, each instruction slots snoops for results matching the renamed register indices of its operands and captures the value if it matches (hence the name). Once all operands are ready, the instructions are pushed to the scheduler (called the reservation station, RS in PPRO days). The size of the ROB in PPRO was 40 entries, the size of the RS was 20 entries. In Nehalem this was increased to 128 ROB entries and 36 RS (scheduler) entries.

I interpret the following differently:
http://www.intel.com/content/www/us...-ia-32-architectures-optimization-manual.html

For Core 2:
2.4.3 Execution Core
The execution core of the Intel Core microarchitecture is superscalar and can process instructions out of
order. When a dependency chain causes the machine to wait for a resource (such as a second-level data
cache line), the execution core executes other instructions. This increases the overall rate of instructions
executed per cycle (IPC).
The execution core contains the following three major components:
• Renamer — Moves micro-ops from the front end to the execution core. Architectural registers are
renamed to a larger set of microarchitectural registers. Renaming eliminates false dependencies
known as read-after-read and write-after-read hazards.
• Reorder buffer (ROB) — Holds micro-ops in various stages of completion, buffers completed microops,
updates the architectural state in order, and manages ordering of exceptions. The ROB has 96
entries to handle instructions in flight.
• Reservation station (RS) — Queues micro-ops until all source operands are ready, schedules and
dispatches ready micro-ops to the available execution units. The RS has 32 entries.
The initial stages of the out of order core move the micro-ops from the front end to the ROB and RS. In
this process, the out of order core carries out the following steps:
• Allocates resources to micro-ops (for example: these resources could be load or store buffers).
• Binds the micro-op to an appropriate issue port.
• Renames sources and destinations of micro-ops, enabling out of order execution.
• Provides data to the micro-op when the data is either an immediate value or a register value that has
already been calculated.

For Nehalem specifially:

The IDQ (Figure 2-11) delivers micro-op stream to the allocation/renaming stage (Figure 2-10) of the
pipeline. The out-of-order engine supports up to 128 micro-ops in flight. Each micro-ops must be allocated
with the following resources: an entry in the re-order buffer (ROB), an entry in the reservation
station (RS), and a load/store buffer if a memory access is required.
The allocator also renames the register file entry of each micro-op in flight. The input data associated
with a micro-op are generally either read from the ROB or from the retired register file.
The RS is expanded to 36 entry deep (compared to 32 entries in previous generation). It can dispatch up
to six micro-ops in one cycle if the micro-ops are ready to execute. The RS dispatch a micro-op through
an issue port to a specific execution cluster, each cluster may contain a collection of integer/FP/SIMD
execution units.
The result from the execution unit executing a micro-op is written back to the register file, or forwarded
through a bypass network to a micro-op in-flight that needs the result.

My understanding is that the in-order front end works with the ROB and allocates a ROB entry, RS entry, and an LS entry if needed. The pipeline will stall if it can't.
The ROB does serve as a possible data source, but it doesn't do the full monitoring of forwarding and readiness of operands. It's already handling exceptions, writes to the register file, and in-order retirement. The scheduler can be a destination for forwarded results if there are ops already pending, and the wording of the Intel guide indicates the reservation station does have entries before they have all their operands--which is the general point in other architectures that have reservation stations.
Once an instruction begins execution, it can leave the RS, which is why the ROB can have more operations in latter stages of uop's lifespan.

There are performance event counters mentioned for Nehalem that do cite specific cases like a stall where there are no memory buffer entries available. There's a MOB for keeping accesses in order, especially with the memory speculation added with Core2 and going forward. The ROB is not in a good position to catch aliased addresses, and the memory pipeline cannot catch aliases if the ROB keeps memory operations invisible to it for an indeterminate period of time.

The fact that Sandy Bridge and descendants reads the register file when instructions transit from the ROB to the scheduler means the ROB needs to have knowledge of when registers in the physical register file are valid and the ROB is thus acutely aware of the status of registers/results and is part of the OOO scheduling machinery.

The ROB and scheduler allocations occur in parallel. Since the ROB no longer serves as a data source, nor does it update the an architectural register file directly, why would the scheduler need anything except some of the allocation information from the allocation stage?

xEx · Dec 19, 2016

from the GPUs? threat:

"Another new algorithm uses self-training neural network techniques to make smarter decisions on when to pre-fetch data or instructions for a core. That could result in systems that get modestly better over time at recurring tasks" this means that my CPU will gain performance over time if I play the same game over and over? It seems like over load of marketing to me tbh.

http://www.eetimes.com/document.asp?doc_id=1330981&page_number=1

xEx · Dec 19, 2016

Also we have to think of how cheap are this CPUs to be manufacture because they doesn't need high complex GPUs sucking space energy and heat. So I think AMD can make this CPU and sell them at very very good prices compared to intel and still make good money.

Also what would be intels answer? will we finally be able to stop financing Intels IGPs department? I really hope so. The next year will very interesting

3dilettante · Dec 19, 2016

Infinisearch said:
I also have a question for the more technically inclined here. How many ports do the physical register file of modern microprocessors have? Just looking at block diagrams you'd think they might have alot (6) ports, but thats impossible right?

They can be significantly ported, depending on design choices. Commonly, ALUs can source 2 read operands and 1 write per cycle, and superscalar CPU register files can be heavily engineered to match the throughput.
Itanium, for example, had a lot of ports due to its width.
However, a design can reduce the impact on the file by relying on forwarding or an ROB. Prior to the shift to the current core architecture, Intel's cores could bottleneck on register port limits.
Other tricks, besides heavy multiporting, include clustering or mirroring so that individual portions of the register file are not physically burdened by having to service the full operand demand.

Similarly, having separate FP and Integer files helps divide the load, particularly since FP can have some very wide ports.

xEx said:
from the GPUs? threat:

"Another new algorithm uses self-training neural network techniques to make smarter decisions on when to pre-fetch data or instructions for a core. That could result in systems that get modestly better over time at recurring tasks" this means that my CPU will gain performance over time if I play the same game over and over? It seems like over load of marketing to me tbh.

http://www.eetimes.com/document.asp?doc_id=1330981&page_number=1

Branch prediction and prefetching likely fall under that marketing umbrella. Recurring could just be a loop or frequently called code.
One possibility within the core itself might be a way to record scheduling conflicts or hot registers in a loop that could be optimized.

Infinisearch · Dec 19, 2016

3dilettante said:
They can be significantly ported, depending on design choices. Commonly, ALUs can source 2 read operands and 1 write per cycle, and superscalar CPU register files can be heavily engineered to match the throughput.
Itanium, for example, had a lot of ports due to its width.
However, a design can reduce the impact on the file by relying on forwarding or an ROB. Prior to the shift to the current core architecture, Intel's cores could bottleneck on register port limits.
Other tricks, besides heavy multiporting, include clustering or mirroring so that individual portions of the register file are not physically burdened by having to service the full operand demand.

Similarly, having separate FP and Integer files helps divide the load, particularly since FP can have some very wide ports.

Is there any relationship between the design process (metal layers maybe) and the maximum number of ports?

3dilettante · Dec 19, 2016

Infinisearch said:
Is there any relationship between the design process (metal layers maybe) and the maximum number of ports?

There's an interplay between the design target and what provides a practical benefit. CPU register files can be dominated by the circuitry around them involved in accessing them, since each port impacts the transistors per cell (area, power, voltage, capacity), the complexity of the access network (endpoints in the network, sense amps, decoders, power) and congestion in a timing-sensitive and congested portion of the chip.
ed: A lot of these interact with each other in ways I am not versed in sufficiently to dive more deeply into (like speed).

I would presume that the relative physical and engineering cost would list the process as a significant contributor, but not in a straightforward fashion. Back when more CPU designers had more control over the process, there was a tradeoff in metal layers versus engineering effort for a given design target. More layers would make getting a certain area cost feasible for a given amount of effort, which is one area where Intel was able to use fewer layers for a comparable CPU.

pTmdfx · Dec 20, 2016

Infinisearch said:
I also have a question for the more technically inclined here. How many ports do the physical register file of modern microprocessors have? Just looking at block diagrams you'd think they might have alot (6) ports, but thats impossible right?

Bulldozer has a 10R6W FP PRF — two 3R1W pipelines, two 2R1W pipelines and two for LSU writebacks. The integer PRF should have a similar arrangement, i.e. 8R6W (note: AGLU can write PRF). But AFAICR the integer PRF is replicated with write broadcast for each pair of EX and AG, so it is actually two PRFs, probably 4R6W.

Laurent06 · Dec 20, 2016

Infinisearch said:
I also have a question for the more technically inclined here. How many ports do the physical register file of modern microprocessors have? Just looking at block diagrams you'd think they might have alot (6) ports, but thats impossible right?

One simple way to estimate it is to use the maximum IPC. If a CPU can sustain 3 adds per cycle then you need 6 read ports and 3 write ports.

This might help: http://my.ece.msstate.edu/faculty/reese/EE8273/lectures/memarch/memarch.pdf

Laurent06 · Dec 20, 2016

xEx said:
from the GPUs? threat:

"Another new algorithm uses self-training neural network techniques to make smarter decisions on when to pre-fetch data or instructions for a core. That could result in systems that get modestly better over time at recurring tasks" this means that my CPU will gain performance over time if I play the same game over and over? It seems like over load of marketing to me tbh.

http://www.eetimes.com/document.asp?doc_id=1330981&page_number=1

That's marketing BS at its best or worst

Algorithms that work with training for branch prediction and data prefetching have been described many years ago. Look for perceptron for instance.

Alexko · Dec 20, 2016

I could be wrong, but I think that xEx is under the impression that we're talking about something persistent, that if he plays Crysis 4 every day, the game will get faster day after day.

xEx · Dec 20, 2016

Alexko said:
I could be wrong, but I think that xEx is under the impression that we're talking about something persistent, that if he plays Crysis 4 every day, the game will get faster day after day.

I just try to deduct what AMD said about "uses self-training neural network techniques to make smarter decisions on when to pre-fetch data or instructions for a core. That could result in systems that get modestly better over time at recurring tasks" means to users. because the literal meaning in that is that the cpu get better over time if hes doing the same task. Whether that "over time" means milliseconds, seconds, minutes, hours we dont know just yet. It may mean that the first couple of ms are bad and then it gets good(like the first couple of frames take longer to render than the rest or that the game will gain performance after some time playing)

Malo · Dec 20, 2016

If the comment is even possibly related to games at all then any potential gains would flatten very quickly.

Infinisearch · Dec 20, 2016

3dilettante said:
They can be significantly ported, depending on design choices. Commonly, ALUs can source 2 read operands and 1 write per cycle, and superscalar CPU register files can be heavily engineered to match the throughput.
Itanium, for example, had a lot of ports due to its width.
However, a design can reduce the impact on the file by relying on forwarding or an ROB. Prior to the shift to the current core architecture, Intel's cores could bottleneck on register port limits.
Other tricks, besides heavy multiporting, include clustering or mirroring so that individual portions of the register file are not physically burdened by having to service the full operand demand.

Similarly, having separate FP and Integer files helps divide the load, particularly since FP can have some very wide ports.

Metal layers are what the buses are made of right? Thats basically why I thought it would prove to be a limitation. Also do you know where I can read about the interplay between design and layout? I've taken a course on basic microprocessor design but have always wondered about how layout constraints affect design and would like to read about it.

pTmdfx said:
Bulldozer has a 10R6W FP PRF — two 3R1W pipelines, two 2R1W pipelines and two for LSU writebacks. The integer PRF should have a similar arrangement, i.e. 8R6W (note: AGLU can write PRF). But AFAICR the integer PRF is replicated with write broadcast for each pair of EX and AG, so it is actually two PRFs, probably 4R6W.

Where did you get this information from? Can you post a link?

3dilettante · Dec 20, 2016

xEx said:
I just try to deduct what AMD said about "uses self-training neural network techniques to make smarter decisions on when to pre-fetch data or instructions for a core. That could result in systems that get modestly better over time at recurring tasks" means to users. because the literal meaning in that is that the cpu get better over time if hes doing the same task. Whether that "over time" means milliseconds, seconds, minutes, hours we dont know just yet. It may mean that the first couple of ms are bad and then it gets good(like the first couple of frames take longer to render than the rest or that the game will gain performance after some time playing)

The demands of a high-clock CPU would keep the complexity of any "neural network" low, and the amount of context low. What a "task" is would be correspondingly simple.
Without some kind of off-die and possibly non-volatile storage, the time frame would also need to be as short as the life span of a branch predictor or prefetch stream entry, which is closer to the lifetime of a hot code region--and then probably only for that time slice.

The other limit is the limited scope of what the CPU can perceive of what kind of task it is running. Without some outside storage and software, it sees a limited range of values with no particular idea of what "task" it is globally.
The branch predictor for AMD has been a perceptron for a while now. Prefetch logic is unclear, but it might be that a perceptron or similar limited neural net could be applied there for non-strided accesses.
Within a core, maybe some limited level of prediction for scheduler allocation?

xEx · Dec 20, 2016

3dilettante said:
The demands of a high-clock CPU would keep the complexity of any "neural network" low, and the amount of context low. What a "task" is would be correspondingly simple.
Without some kind of off-die and possibly non-volatile storage, the time frame would also need to be as short as the life span of a branch predictor or prefetch stream entry, which is closer to the lifetime of a hot code region--and then probably only for that time slice.

The other limit is the limited scope of what the CPU can perceive of what kind of task it is running. Without some outside storage and software, it sees a limited range of values with no particular idea of what "task" it is globally.
The branch predictor for AMD has been a perceptron for a while now. Prefetch logic is unclear, but it might be that a perceptron or similar limited neural net could be applied there for non-strided accesses.
Within a core, maybe some limited level of prediction for scheduler allocation?

Yeah I agree. you can make a "neural net" that "self train" itself and learn over time but to make something like that inside the die of a cpu is something completely different. But it is a very good marketing material for sure.

3dilettante · Dec 20, 2016

Infinisearch said:
Metal layers are what the buses are made of right? Thats basically why I thought it would prove to be a limitation. Also do you know where I can read about the interplay between design and layout? I've taken a course on basic microprocessor design but have always wondered about how layout constraints affect design and would like to read about it.

Metal layers provide the connections between transistors and units, and the impact of wire delay and congestion is a major factor. How much can be supplied to a limited area is a function of how many layers could be available, but it's not a straightforward limit on the number of register file ports without knowing what the design's target is, how much is invested in engineering the units for optimal routing, and what trade-offs like banking or clustering are in play.

It's a scattering of sites references that I try to remember over time.
One item is rather old now, but the difficulties in getting significant improvement in wires have been well-proven since then.
http://www.realworldtech.com/shrinking-cpu/4/

Physical distance impacts delay and power, and items like the number of ports has implications about the size of the SRAM and the number of resources needed to access it (decoders for each register ID, sense amps for each new port, more access network to route the data, etc.). That's more wires, more distance, more complicated characterization of interference, and increasing congestion under the ceiling of the number of available layers.
The speed of register access and reliability is a tradeoff in density and drive strength--which perversely likes bigger transistors.

There are also comp.arch discussions including designers from AMD or elsewhere where they talk about how X unit or the forwarding distance would have expanded the physical distance traversed, and would have exceeded the time budget allotted by the design target.

pTmdfx · Dec 21, 2016

Infinisearch said:
Metal layers are what the buses are made of right? Thats basically why I thought it would prove to be a limitation. Also do you know where I can read about the interplay between design and layout? I've taken a course on basic microprocessor design but have always wondered about how layout constraints affect design and would like to read about it.

Where did you get this information from? Can you post a link?

Probably this one for the FP PRF. I didn't remember exactly which (IIRC both are a ISSCC conference paper), but both were definitely available in IEEE Xplore.
http://ieeexplore.ieee.org/abstract/document/5746227/

Edit: This figure sums up those data points in the paper.
http://pc.watch.impress.co.jp/img/pcw/docs/430/801/9.jpg

sebbbi · Dec 21, 2016

xEx said:
Yeah I agree. you can make a "neural net" that "self train" itself and learn over time but to make something like that inside the die of a cpu is something completely different. But it is a very good marketing material for sure.

Bobcat and Jaguar already had "neural network" branch predictors. Bulldozer didn't, but AMD later took the Jaguar's branch predictor and added it to Piledriver. This was briefly mentioned in technical articles and some reviews. Now neural network has become a buzzword, and marketing department uses it as well. We don't yet know how much AMDs neural network predictor has improved over Excavator, if any.

fehu · Dec 21, 2016

neural networks on the cloud with b2b 2.0

AMD RyZen CPU Architecture for 2017

Infinisearch

3dilettante

xEx

xEx

3dilettante

Infinisearch

3dilettante

pTmdfx

Laurent06

Laurent06

Alexko

xEx

Malo

Yak Mechanicum

Infinisearch

3dilettante

xEx

3dilettante

pTmdfx

sebbbi

fehu

Similar threads