I Can Hazwell?

A15 and Krait both have 3-wide decode and 128-bit FPU, but that's pretty much where the similarities end. Krait has a shorter pipeline and a low-latency, really weird cache subsystem. In comparison, A15 will have higher latencies and higher clocks. Whether the power consumption will blow up when it's taken to those clocks is a whole another (and as of yet unknown) issue.

I don't think that Krait is in any way a good proxy for A15 performance. In fact, I simply think that there isn't enough published data on A15 to make any sort of informed judgement yet.

There's a ton of published information on Cortex-A15, it's Krait that we know close to nothing about. Your information, taken from AnandTech, pretty much sums it up, where "short pipeline" and "low latency" are incredibly vague descriptions. In actuality we don't know what the fetch bandwidth is, we don't know what its integer execution unit resources are, we don't know if it can support simultaneous loads and stores, we don't know how deep its reordering capabilities are, we don't know what its branch prediction is like.. these are all things we have pretty good descriptions of for Cortex-A15. Sure it may be established that they both have 128-bit NEON units, but what's the latency like - are you going to take at face value that its SIMD is lower latency just because Anand says it has a smaller pipeline? There's way too much missing information, and there's definitely a lot of room where Cortex-A15 could outperform Krait, and unless ARM's estimations of how it'll perform vs A9 are totally unrealistic it will outperform Krait.

Note that a lot of Cortex-A15's long pipeline is in a frontend that can be partially bypassed if code is running from the loop buffer.

Gubbi said:
No exact science was used in my estimate, I was going by the claimed 40% IPC improvement of A15 vs A9.

That 40% number only applies to Dhrystone. ARM gave numbers of 50% improvement on integer code and 100% improvement on FP (presumably SIMD, maybe also including integer SIMD?) and memory bound stuff. They also cited that they had an internal goal to improve typical IPC by 50% over Cortex-A9, which they feel they've met.

You have to consider that, aside from the benchmark being abject garbage, there's just less room to grow with Dhrystone. It all fits in L1 cache, uses pretty predictable branches, and spends a lot of time in library functions that can be hand optimized. So Cortex-A15's strengths aren't going to benefit it as much as it'll benefit real programs.
 
Based on what exactly? Why would you expect Krait to represent Cortex-A15 any better than A6? If anything the one released closer in time would be more likely to be representative wouldn't it?

Because Scorpion core came out about a year before A9 and it wasn't a bad proxy.

By proxy I mean <20% difference. YMMV with this metric.
 
20% difference is good for a proxy? Seriously? Do you even have anything really showing Scorpion to A9 being a typical < 20% at same clock speed?

Scorpion is much closer to A8 than A9, making the latter comparison over the former seems totally disingenuous :/
 
20% difference is good for a proxy? Seriously?
Good enough for me. It's a pretty bad metric if you are doing an in depth comparison, no doubt about that. But in terms of the user experience with actual apps, I think this much difference is not perceptible.
 
That 40% number only applies to Dhrystone. ARM gave numbers of 50% improvement on integer code and 100% improvement on FP (presumably SIMD, maybe also including integer SIMD?)

IMO, the 40% IPC increase in Dhrystone is the upper limit we will see for IPC improvements. Memory latency doesn't magically go away, so a real workload that busts out of cache is going to see less IPC improvement.

The 50% performance improvement is with frequency improvements AFAICT (at a fixed power consumption level)

The 100% FP is only for SIMD code. The A9 doesn't track data dependencies on NEON registers. Using NEON instructions thus effectively turn the A9 into an in-order processor. The A15 has two remap tables, one for ARM registers and one for NEON registers. That and the wider datapaths is going to improve SIMD code immensely, but much less for regular FP code.

Cheers
 
IMO, the 40% IPC increase in Dhrystone is the upper limit we will see for IPC improvements. Memory latency doesn't magically go away, so a real workload that busts out of cache is going to see less IPC improvement.

That's like saying that the performance difference between Cortex-A8 and Ivy Bridge is purely down to their cache and main memory latencies. There's a big continuum of performance opportunities based on how well you can a) extract parallelism and b) schedule to hide latency. Cortex-A15 makes big advances on both fronts. Without knowing the weaknesses of what you're starting with that's a pretty blind statement. Given that Cortex-A15 doesn't actually add much to the execution resources on the integer side, over Cortex-A8 and A9, I'd say it really is all about better management of said resources.

Besides that, Cortex-A9 implementations do tend to have relatively high L2 latency and relatively high main memory latency, so there's plenty of room for improvement; the former can actually be delivered by ARM since the L2 is tightly coupled with the CPUs again.

What really confuses me is how you can make this statement while simultaneously saying A6's CPU is higher performing - does only it get to magically make latency go away?

The 50% performance improvement is with frequency improvements AFAICT (at a fixed power consumption level)

No it isn't. There's no ambiguity in what ARM said.

The 100% FP is only for SIMD code. The A9 doesn't track data dependencies on NEON registers. Using NEON instructions thus effectively turn the A9 into an in-order processor. The A15 has two remap tables, one for ARM registers and one for NEON registers. That and the wider datapaths is going to improve SIMD code immensely, but much less for regular FP code.

The 100% number was NOT just given for SIMD.

Everything you said about OoO applies to scalar VFP in Cortex-A9 vs Cortex-A51 just as much as it implies to NEON. The word isn't back yet but it's also possible that there are two "real work" VFP pipes (ie, 2x scalar FMADDs)
 
That's like saying that the performance difference between Cortex-A8 and Ivy Bridge is purely down to their cache and main memory latencies. There's a big continuum of performance opportunities based on how well you can a) extract parallelism and b) schedule to hide latency. Cortex-A15 makes big advances on both fronts.

Without knowing the weaknesses of what you're starting with that's a pretty blind statement. Given that Cortex-A15 doesn't actually add much to the execution resources on the integer side, over Cortex-A8 and A9, I'd say it really is all about better management of said resources.

The A9 has 2-wide instruction decode and retirement, a 24 entry reorder window and 4 dispatch ports.

The A15 has 3-wide decode and retirement and a 40+ entry reorder window. None of the material I've seen is more detailed than "40+" ROB entries and none says how many dispatch ports or execution units it has. The A15 is designed for higher operating frequency. Higher operating frequency generally increases latency measured in cycles to the memory subsystem, if ARM combats this with a new cache architecture, fine, but it still means that the amount of ROB entries per peak instruction throughput per cycle roughly stays the same, so don't expect more than 50% IPC increase.

What really confuses me is how you can make this statement while simultaneously saying A6's CPU is higher performing - does only it get to magically make latency go away?

It's not my claim. The A6 enjoys a 60% IPC increase vs A9 as per Anandtech's tests. ARM claims a 40% (or 50%) IPC increase of A15 over A9. I merely tried to connect the dots.

We don't know anything about the A6 other than it has a kickass memory subsystem. Where does the performance come from ? Is it 4-wide? Does it have multi ported D$? How big is the reorder buffer? Does it have memory disambiguation ?

Cheers
 
The A9 has 2-wide instruction decode and retirement, a 24 entry reorder window and 4 dispatch ports.

Where did you read that it has a 24 entry reorder window? Or 4 dispatch ports for that matter?

This is the best Cortex-A9 reference I've seen: http://www.docstoc.com/docs/73399229/Cortex-A9-Processor-Microarchitecture

When they say "3+1" dispatch all diagrams would suggest that's either referring to the third port being capable of going to LS vs NEON/VFP, or the separate branch resolution. It's not a real quad dispatch either way.

There's no official documentation on the issue queue, but the diagram draws 6 squares, so the best guess will be that it's 6 wide. Everything else about it suggests a unified scheduler. Given that ARM themselves says that 8 scheduler slots were pushing the upper limit of feasibility in their design constraints for Cortex-A15 it'd be awfully strange if Cortex-A9 had 24, although I suppose it's possible given that they were designed by two totally different teams.

The A15 has 3-wide decode and retirement and a 40+ entry reorder window. None of the material I've seen is more detailed than "40+" ROB entries and none says how many dispatch ports or execution units it has. The A15 is designed for higher operating frequency. Higher operating frequency generally increases latency measured in cycles to the memory subsystem, if ARM combats this with a new cache architecture, fine, but it still means that the amount of ROB entries per peak instruction throughput per cycle roughly stays the same, so don't expect more than 50% IPC increase.

You're not looking very hard for information. http://www.arm.com/files/pdf/AT-Exploring_the_Design_of_the_Cortex-A15.pdf

A15 has 8 issue queues (to each execution pipeline) in 5 clusters, each with 8 slots. That's 64 entries total. It can dispatch to each of the 8 pipelines each cycle. The pipelines are 2x simple ALU, 1x branch, 1x MUL, 1x load, 1x store, and 2x NEON/VFP. Note that the ALUs bring back parallel shift + op execution, which was moved to separate stages in A9.

But there's way more to the comparison than just execution window, execution width, and latency to the memory subsystem. I don't think I really need to start listing things.

It's not my claim. The A6 enjoys a 60% IPC increase vs A9 as per Anandtech's tests. ARM claims a 40% (or 50%) IPC increase of A15 over A9. I merely tried to connect the dots.

Are you aware that A6 runs at up to 1.3GHz and therefore was probably running at that clock speed during Anand's tests?

We don't know anything about the A6 other than it has a kickass memory subsystem. Where does the performance come from ? Is it 4-wide? Does it have multi ported D$? How big is the reorder buffer? Does it have memory disambiguation ?

True, we don't know those things, but it seems you don't know a lot about Cortex-A15 too.. 4-wide seems pretty outrageous for a phone chip.

Anyway, back to the original claim - regardless of what you think the maximum improvement Cortex-A15 can bring is, why would you think Dhrystone would be what's representative of the upper limit? Dhrystone is relatively static, predictable, small, and the test is designed so that you can spend a lot of the time in hand tuned ASM. An other words, an easy problem A lot of the hardware in Cortex-A15, quite possibly the majority of it, is designed for problems harder than Dhrystone.
 
Last edited by a moderator:
Where did you read that it has a 24 entry reorder window? Or 4 dispatch ports for that matter?

This is the best Cortex-A9 reference I've seen: http://www.docstoc.com/docs/73399229/Cortex-A9-Processor-Microarchitecture

Reorder window size inferred from 56 rename entries with 32 needed for architected state (int+fp).

Dispatch: Page 6 here

Although the diagram is confusing, it does say up to FOUR dispatches per cycle.

You're not looking very hard for information. http://www.arm.com/files/pdf/AT-Exploring_the_Design_of_the_Cortex-A15.pdf

A15 has 8 issue queues (to each execution pipeline) in 5 clusters, each with 8 slots. That's 64 entries total. It can dispatch to each of the 8 pipelines each cycle. The pipelines are 2x simple ALU, 1x branch, 1x MUL, 1x load, 1x store, and 2x NEON/VFP. Note that the ALUs bring back parallel shift + op execution, which was moved to separate stages in A9.

The amount of instructions in issue queues doesn't say anything about the rename capacity and hence the size of reorder window.

When an instruction is renamed, it is allocated an entry in the commit queue. The only time I've seen the size of the commit queue mentioned was in comp.arch on usenet two years ago, where the number 40 was mentioned.

Are you aware that A6 runs at up to 1.3GHz and therefore was probably running at that clock speed during Anand's tests?

No, I wasn't aware of that. I'm surprised Apple doesn't market it as a 1.3GHz processor then.

True, we don't know those things, but it seems you don't know a lot about Cortex-A15 too.. 4-wide seems pretty outrageous for a phone chip.

Nobody, outside of ARM, knows much about A15.

Anyway, back to the original claim - regardless of what you think the maximum improvement Cortex-A15 can bring is, why would you think Dhrystone would be what's representative of the upper limit? Dhrystone is relatively static, predictable, small, and the test is designed so that you can spend a lot of the time in hand tuned ASM. An other words, an easy problem A lot of the hardware in Cortex-A15, quite possibly the majority of it, is designed for problems harder than Dhrystone.

Dhrystone runs close to the maximum of what the execution core of the CPU is capable of. A real workload is not fully contained in D$ and you then have to contend with memory latencies.

The A15 can execute 50% more instructions per cycle. That also implies that latency of a memory operation grows by 50% measured in instructions even if number of cycles stays the same.In order to get a perfect 50% speedup you'd need to reduce main memory latency to 66%.

Can the A15 do that? Possibly, the tests I've seen of A9 shows a 200ns main memory latency, so there is certainly room for improvement.

Also, datapaths are twice as wide so that'll buy you a lot on throughput workloads (FP and media). The extra bandwidth can also be used for more aggressive prefetch where you effectively trade bandwidth for lower latency

Cheers
 
Last edited by a moderator:
Drhystone? Here's an extremely old benchmark, that not only can be abused with compilation optimisation (thanks wikipedia) but will also typically entirely fit in L1. Nowadays mobile CPU have become like PCs of the past 15 years with a hierarchy of L1, L2 and memory with a huge relative latency, so you're not testing real performance and don't even have an excuse for it.
 
Drhystone? Here's an extremely old benchmark, that not only can be abused with compilation optimisation (thanks wikipedia) but will also typically entirely fit in L1. Nowadays mobile CPU have become like PCs of the past 15 years with a hierarchy of L1, L2 and memory with a huge relative latency, so you're not testing real performance and don't even have an excuse for it.

Part of my point. ARM claiming a 50% IPC increase in Dhrystone tells you nothing.

Cheers
 
Reorder window size inferred from 56 rename entries with 32 needed for architected state (int+fp).

Dispatch: Page 6 here

Although the diagram is confusing, it does say up to FOUR dispatches per cycle.

Size of physical register file/rename capability is not the same as reordering capability. Sandy Bridge, for instance, has an instruction window based on the size of its ROB (168 entries), not its integer PRF (144 entries) or floating point PRF (160 entries). You could have zero register renaming whatsoever and still provide reordering.

The document I linked is much more detailed than yours, and makes it pretty clear it doesn't any true capability to dispatch four things in one cycle. The comment is probably counting folded branch resolution as dispatch, which is fair in the sense that it correlates to an instruction that was decoded and issued, but still not what most would consider true dispatch. But this is really nit-picking over details.

The amount of instructions in issue queues doesn't say anything about the rename capacity and hence the size of reorder window.

Sure it does. It's the issue queues that are scanned for instructions to dispatch each cycle. It is literally the pool from which eligible instructions are chosen and when it's full you can't add to the reordering capacity. Maybe you're confused by it being called a "queue." These queues are analogous to ROBs in other processors. ARM makes it very clear in the article I linked that the instruction window is dictated by the size and quantity of these queues.

Of course, since they don't have a unified scheduler, you generally won't come that close to actually utilizing the full reordering capacity, in general it'll probably be < 40 instructions.

When an instruction is renamed, it is allocated an entry in the commit queue. The only time I've seen the size of the commit queue mentioned was in comp.arch on usenet two years ago, where the number 40 was mentioned.

You will see that the instructions are issued to the issue queues after renaming. The number 40 probably came from someone multiplying the 5 clusters by 8 instead of the 8 pipelines (the document I linked indicates that this is the partitioning of the queues)

No, I wasn't aware of that. I'm surprised Apple doesn't market it as a 1.3GHz processor then.

Since when has Apple ever marketed the MHz of anything?

Nobody, outside of ARM, knows much about A15.

Did you even read the document I linked? It's far more detailed than any Cortex-A9 document out there! It's also more detailed than most descriptions Intel or AMD has given for their CPUs. You can find some more information in the publicly visible TRM (like various buffer/cache sizes/associativities).

Dhrystone runs close to the maximum of what the execution core of the CPU is capable of. A real workload is not fully contained in D$ and you then have to contend with memory latencies.

Yes, we both agree on this.

The A15 can execute 50% more instructions per cycle. That also implies that latency of a memory operation grows by 50% measured in instructions even if number of cycles stays the same.In order to get a perfect 50% speedup you'd need to reduce main memory latency to 66%.

It can decode 50% more instructions per cycle. It can fetch 100% more instructions per cycle. It can dispatch at least 100% more instructions per cycle. Its general branch misprediction penalty is larger but its mispredict rate is better. Its loop buffer lets it bypass fetch and most of decode stages, and is probably more capable than Cortex-A9's (larger, can handle two forward branches with unknown predict capability). It can execute loads and stores in parallel. It has wider reordering capability. It has better prefetchers. It can predict indirect branches better than by just using the last thing in the BTB. It can perform shifts and ALU operations in parallel. If I'm reading things right, the load-use latency is generally one cycle where Cortex-A9 is often two. Its L2 is more tightly coupled meaning lower latency in addition to twice the interface width. It has a bigger TLB hierarchy and new partitioning to include both load and store DTLBs.

Taking all that and putting it into a simplistic equation saying that it must need 66% lower MAIN memory latency to achieve 50% better perf/clock on average is a total farce. I don't know what you're doing here. You find out the performance by benchmarking it, but right now the best thing to go on is ARM's claim that it'll get 50% better performance.

Can the A15 do that? Possibly, the tests I've seen of A9 shows a 200ns main memory latency, so there is certainly room for improvement.

You will find that the numbers vary a lot based on which SoC we're talking about, which makes sense since the processor isn't responsible for the rest of the memory interface.

You'll also find that despite some SoCs having main memory latencies over 50% better than others they don't usually get a huge boost in performance. Cortex-A15 is less sensitivity to main memory latency than Cortex-A9 (I'm not claiming how much, but it's definitely less). Do I have to explain why?

Part of my point. ARM claiming a 50% IPC increase in Dhrystone tells you nothing.

I feel like you're not listening to me. ARM claimed 40% higher Dhrystone scores at the same MHz. They claimed 50% higher integer performance in general, again at the same MHz. The latter was not about Dhrystone. They haven't explained it further but some other charts imply this number is from SPEC.
 
Size of physical register file/rename capability is not the same as reordering capability. Sandy Bridge, for instance, has an instruction window based on the size of its ROB (168 entries), not its integer PRF (144 entries) or floating point PRF (160 entries). You could have zero register renaming whatsoever and still provide reordering.

The number of rename entries determine how many results you can rename and thus how many instructions you can have in flight. I never claimed the physical register file size had anything to do with it other that you need rename entries to map to non-speculated state.

The document I linked is much more detailed than yours, and makes it pretty clear it doesn't any true capability to dispatch four things in one cycle. The comment is probably counting folded branch resolution as dispatch, which is fair in the sense that it correlates to an instruction that was decoded and issued, but still not what most would consider true dispatch. But this is really nit-picking over details.

The document you linked clearly states, on page 7, four instructions can be dispatched per cycle, the diagram clearly shows 4 arrows to exec pipes: Two integer, one LS and one FP/NEON. On page 14 the diagram shows three arrows and one to the branch unit, so you may very well be right. To me, it isn't clear at all.

Sure it does. It's the issue queues that are scanned for instructions to dispatch each cycle. It is literally the pool from which eligible instructions are chosen and when it's full you can't add to the reordering capacity. Maybe you're confused by it being called a "queue." These queues are analogous to ROBs in other processors. ARM makes it very clear in the article I linked that the instruction window is dictated by the size and quantity of these queues.

That would make the issue queues equivalent to reservation stations/local ROBs like we know from OOO x86 CPUs.

Without a global scheduler the OOO capabilities are much more limited than an equivalent x86 implementation. A simple integer rich workload with a few loads missing D$ sprinkled in could effectively limit the amount of instructions in flight to the size of the int issue queues, - 16 entries.

AFAICT, if you're right, the only way to get anywhere near the maximum number of instructions in flight is FP/NEON code. There is always a surprising amount of integer chores in FP codes and that way most of the issue queues could be filled (or at least see any action).

You will see that the instructions are issued to the issue queues after renaming. The number 40 probably came from someone multiplying the 5 clusters by 8 instead of the 8 pipelines (the document I linked indicates that this is the partitioning of the queues)

Since all instructions except branches and nops produce a result (branches do too in ARM, since the PC is a general register, but I expect it to be special cased), the amount of instructions in flight is limited by the amount of entries in the commit queue where results sits until speculated state is resolved (branches). That queue has 40 entries (read in an ARM document, linked to in a usenet post in november 2010, the ARM document is now nowhere to be found.)

Did you even read the document I linked? It's far more detailed than any Cortex-A9 document out there! It's also more detailed than most descriptions Intel or AMD has given for their CPUs. You can find some more information in the publicly visible TRM (like various buffer/cache sizes/associativities).
I did. It is not only far more detailed than any Cortex-A9 document, it is also much more confusing than any document detailing micro architecture I've ever seen from AMD or Intel.

The commit queue looks like a data-full ROB, but it claims to be a PRF OOO implementation. The OOO capabilities looks to be ample except they are limited by the issue queue sizes.

BTW. This is off topic for this thread, move it ?

Cheers
 
Last edited by a moderator:
Intel to Merge Xeon and Itanium in 2015-2017

Ivy Bridge (Core i3/i5/i7) debuted in 2012
Haswell (Core i3/i5/i7) will debut in early 2013
Ivy Bridge-EP (Xeon E3/E5) should arrive in mid-2013
Ivy Bridge-E (Core i7) debuts in late 2013
Ivy Bridge-EX for critical servers (Xeon E7) debuts in late 2013
Broadwell (Core i3/i5/i7) should ship in early 2014
Haswell-EP (Xeon E3, E5) should ship by mid 2014
Haswell-E (Core i7) debuts in late 2014
Haswell-EX (Xeon E7) is planned for late 2014
Broadwell-EP (Xeon E3 / E5) is planned for mid 2015
Broadwell-E (Core i7) arrives in late 2015
Broadwell-EX (Xeon E7) is planned for late 2016


The new socket could be the one you already know - according to some sources, Intel plans to re-wire the LGA-2011 for Haswell/Broadwell, making it incompatible with Sandy Bridge/Ivy Bridge-based products. The rewiring isn't being done to support new architectures, but rather provide more power - according to documents we saw, Intel plans to introduce 150W and up to 180W parts when Haswell and Broadcom architectures enter the cut throat server business.

Hmm, sounds very nice. :p 180 W CPU, I want for my desktop machine. :mrgreen:
 
Merging them is highly inaccurate. Merging the support system(Socket, perhaps chipset, etc) is accurate. We won't be seeing itaniums on our PCs, and for good reason.
 
Merging them is highly inaccurate. Merging the support system(Socket, perhaps chipset, etc) is accurate. We won't be seeing itaniums on our PCs, and for good reason.
I haven't read the linked article (yet), but I assume this would be preparation for a move to (relatively) painlessly kill off itanium, since that product is dead anyway.

So, the day intel finally pulls the plug on itanium, customers could drop in x86 chips there instead.
 
Frankly I don't know why it took intel so long. Back in 2006 roadmaps suggested that Xeons and Itanics will use the same chipsets in the future and ultimately boards could support both chips (I dunno what happened with the "same chipsets" but up to now at least the sockets obviously ended up different). Remember QuickPath was initially known as CSI ("Common System Interface").
 
Intel has always done minimum service regarding socket compatibility, they had three generations of socket 370 and four of socket 775, each time the motherboards were backwards compatible but never forward compatible (millions of computers are stuck with a pentium 4 and can't get a Core 2 Celeron).
Or there's Socket 1156 and 1155, where everyone has forgotten what the new socket brought to the table already.

Intel is opportunist, they won't care about breaking compatibility if that means the CPU will use 1% less power or something. They are also good at pushing a new platform in the distribution channels. They care more about deadlines and such.
 
Back
Top