Reorder window size inferred from 56 rename entries with 32 needed for architected state (int+fp).
Dispatch: Page 6
here
Although the diagram is confusing, it does say up to FOUR dispatches per cycle.
Size of physical register file/rename capability is not the same as reordering capability. Sandy Bridge, for instance, has an instruction window based on the size of its ROB (168 entries), not its integer PRF (144 entries) or floating point PRF (160 entries). You could have zero register renaming whatsoever and still provide reordering.
The document I linked is much more detailed than yours, and makes it pretty clear it doesn't any true capability to dispatch four things in one cycle. The comment is probably counting folded branch resolution as dispatch, which is fair in the sense that it correlates to an instruction that was decoded and issued, but still not what most would consider true dispatch. But this is really nit-picking over details.
The amount of instructions in issue queues doesn't say anything about the rename capacity and hence the size of reorder window.
Sure it does. It's the issue queues that are scanned for instructions to dispatch each cycle. It is literally the pool from which eligible instructions are chosen and when it's full you can't add to the reordering capacity. Maybe you're confused by it being called a "queue." These queues are analogous to ROBs in other processors. ARM makes it very clear in the article I linked that the instruction window is dictated by the size and quantity of these queues.
Of course, since they don't have a unified scheduler, you generally won't come that close to actually utilizing the full reordering capacity, in general it'll probably be < 40 instructions.
When an instruction is renamed, it is allocated an entry in the commit queue. The only time I've seen the size of the commit queue mentioned was in comp.arch on usenet two years ago, where the number 40 was mentioned.
You will see that the instructions are issued to the issue queues after renaming. The number 40 probably came from someone multiplying the 5 clusters by 8 instead of the 8 pipelines (the document I linked indicates that this is the partitioning of the queues)
No, I wasn't aware of that. I'm surprised Apple doesn't market it as a 1.3GHz processor then.
Since when has Apple ever marketed the MHz of anything?
Nobody, outside of ARM, knows much about A15.
Did you even read the document I linked? It's far more detailed than any Cortex-A9 document out there! It's also more detailed than most descriptions Intel or AMD has given for their CPUs. You can find some more information in the publicly visible TRM (like various buffer/cache sizes/associativities).
Dhrystone runs close to the maximum of what the execution core of the CPU is capable of. A real workload is not fully contained in D$ and you then have to contend with memory latencies.
Yes, we both agree on this.
The A15 can execute 50% more instructions per cycle. That also implies that latency of a memory operation grows by 50% measured in instructions even if number of cycles stays the same.In order to get a perfect 50% speedup you'd need to reduce main memory latency to 66%.
It can decode 50% more instructions per cycle. It can fetch 100% more instructions per cycle. It can dispatch at least 100% more instructions per cycle. Its general branch misprediction penalty is larger but its mispredict rate is better. Its loop buffer lets it bypass fetch and most of decode stages, and is probably more capable than Cortex-A9's (larger, can handle two forward branches with unknown predict capability). It can execute loads and stores in parallel. It has wider reordering capability. It has better prefetchers. It can predict indirect branches better than by just using the last thing in the BTB. It can perform shifts and ALU operations in parallel. If I'm reading things right, the load-use latency is generally one cycle where Cortex-A9 is often two. Its L2 is more tightly coupled meaning lower latency in addition to twice the interface width. It has a bigger TLB hierarchy and new partitioning to include both load and store DTLBs.
Taking all that and putting it into a simplistic equation saying that it must need 66% lower MAIN memory latency to achieve 50% better perf/clock on average is a total farce. I don't know what you're doing here. You find out the performance by benchmarking it, but right now the best thing to go on is ARM's claim that it'll get 50% better performance.
Can the A15 do that? Possibly, the tests I've seen of A9 shows a 200ns main memory latency, so there is certainly room for improvement.
You will find that the numbers vary a lot based on which SoC we're talking about, which makes sense since the processor isn't responsible for the rest of the memory interface.
You'll also find that despite some SoCs having main memory latencies over 50% better than others they don't usually get a huge boost in performance. Cortex-A15 is less sensitivity to main memory latency than Cortex-A9 (I'm not claiming how much, but it's definitely less). Do I have to explain why?
Part of my point. ARM claiming a 50% IPC increase in Dhrystone tells you nothing.
I feel like you're not listening to me. ARM claimed 40% higher Dhrystone scores at the same MHz. They claimed 50% higher integer performance in general, again at the same MHz. The latter was not about Dhrystone. They haven't explained it further but some other charts imply this number is from SPEC.