ipod touch 4G

It's a selling point for geeks.

But there really haven't been apps. which require that kind of performance, unless generally faster UI helps sell given platforms.

Responsive UI though may be more about software optimization than brute force HW resources.

Video and games could be differentiators for dual-core CPUs but I think most people would be content with Youtube-quality videos (in which case network performance would be more pertinent than CPU/GPU). As far as games, the most popular smart phone games are very simple, like Angry Birds.

No matter how good that Epic Citadel demo looked, I don't see quasi-console quality shooters taking off on smart phone platforms either, especially if they're sold at higher prices than the most popular iOS games, which go for a couple of bucks.
 
I hope there are at least some apps that do benefit. I hope dual core helps push that sort of thing. I'm sure it'll still be a high-end feature only, but it'll still be there.

Video should be moot, I mean, the fixed function hardware should have that covered still.

Really, any argument against dual core should work as an argument against ever increasing single-threaded performance capabilities of smart phones, but that hasn't stopped them from increasing rapidly and it's still a selling point, even for the non-hardcore. If anything, dual core helps better distribute the power consumption at lower clocks for properly threaded software.
 
Okay, good, I'm on the same page then ^^

I think 8x60 should be on a pretty level playing field against OMAP4 and Tegra 2, even if it's in-order; the clock speed advantage probably helps negate that a little. Hopefully dual core is going to be the real selling point for these platforms.

Scorpion is actually partially OoOE. It's just very restrictive of what instruction combos can be dispatched OoO. Drystone puts it at around 2.1-2.2 DMIPS/MHz which, while not as high as the A9, is a bit above the A8.
 
I've heard that claim, but I've also heard Intel claim that Atom was partially OoO because it could integer and floating point in parallel, like most other designs (including Cortex-A8 and the original Pentium). Without really knowing anything else about it one can't say for sure how genuine the claim is. To me the biggest strength of OoOE are being able to execute ahead of loads and their immediate dependencies (especially in the even of L1 misses) and speculative execution/register renaming.

I take Dhrystone figures from different vendors with a grain of salt too. Especially when they're this close. There are a lot of small improvements that could be made to Cortex-A8 without making it out of order in any sense.

Of course, I welcome any further information if you have it.
 
I believe Dhrystone is entire integer bound. That being said, I agree that it shouldn't be the final word. I'm not sure if anyone's done native code testing of a Scorpion device, should be relatively easy to see based on instruction ordering whether any operation reordering is occuring.
 
Of course, "integer bound" really just means "no floating point", which is fine and all.. it may also mean that it runs entirely within L1 cache, which should be the case for Dhrystone. But it definitely still makes memory accesses, pretty extensively, so some simple parameters like load-use penalty, dual issue on loads/stores, address generation interlock.. things like that, can make a difference in performance. Also branch prediction.

A problem with Dhrystone is that it's a C program, so the performance is heavily dependent on the compiler and runtime used. Even if both CPUs use the same instruction set (as is the case with Cortex-A8 and Scorpion) they could still be using different compiler versions with different compiler options applied. Even if they're using the same version and options, the compiler can still be generating better code for one platform than the other. For instance, if GCC's scheduler for Scorpion is somehow superior to the Cortex-A8 scheduler. For (mostly?) in-order processors, compiler scheduling is very important.
 
Well yes, by "integer bound" I mean it doesn't use the FPU. The integer and load/store usually share a dispatch queue whereas FP (including FP load/store) is its own separate entity.

My point is that Dhrystone is supposed to (don't know how well it does) flood the processor with little memory interaction. Now, it may fit in L1 or in L2 but how well the CPU schedules from those and mask read latencies is part of the processor's charm if you will.

I'm going to go ahead and trust that ARM -- with their compiler -- publishes what they tweaked to be the best out of Dhrystone for their A8 publications. I don't think Qualcomm has their own compiler team so I assume they either used GCC (which, IIRC, does not generate very good ARM code) or more likely just compiled the program to target an A8; I don't recall seeing an option in the armasm for Scorpion.

Now, you're right in that many various things can account for a slightly higher DMIPS/MHz. However, it does mean the claims of a partial OoOE isn't out of the question.
 
Scorpion is actually partially OoOE. It's just very restrictive of what instruction combos can be dispatched OoO. Drystone puts it at around 2.1-2.2 DMIPS/MHz which, while not as high as the A9, is a bit above the A8.

The fact that anyone anywhere are comparing anything using drystone in this day and age is just shockingly sad. Drystone is effectively completely useless for comparing performance for any application written in the past 15 years. Its really quite bad and bad form to even use it anymore, its like comparing performance using bogomips or setting your locker combination to 1234, only an idiot would do it. <insert spaceballs joke here>
 
The fact that anyone anywhere are comparing anything using drystone in this day and age is just shockingly sad. Drystone is effectively completely useless for comparing performance for any application written in the past 15 years. Its really quite bad and bad form to even use it anymore, its like comparing performance using bogomips or setting your locker combination to 1234, only an idiot would do it. <insert spaceballs joke here>
I couldn't agree more. One huge dependency dhrystone has is the performance of the string functions in the C library. Also the rule that forbids function inlining is utterly stupid.

Even Coremark, though less dumb, doesn't seem that meaningful.
 
I'm trying to google more on this.

The only really root source I could find was Anand's comment:

"Qualcomm claims the ability to do some things out of order, but by and large the pipeline is in order which ultimately keeps it out of the A9 classification."

This can, of course, mean just about anything.. but I think we should take that "by and large" comment as detracting from its pertinence.
 
I'm trying to google more on this.

The only really root source I could find was Anand's comment:

"Qualcomm claims the ability to do some things out of order, but by and large the pipeline is in order which ultimately keeps it out of the A9 classification."

This can, of course, mean just about anything.. but I think we should take that "by and large" comment as detracting from its pertinence.

OoOE is a varying scale. Even the A9 isn't OoOE in the sense used by desktop processors. There is no rename register (albeit there is a WB buffer), and stalls halt entire sub-pipes.

Scorpion is obviously a lot closer to strict in-order than out-of-order (at least, based on its relative performance to the A8). But we don't know how much of its performance is due to design similarity and how much is due to trade-offs; it has a longer pipeline, which should theoretically, even with a better branch predictor, lead to penalties in branchy code.
 
OoOE is a varying scale. Even the A9 isn't OoOE in the sense used by desktop processors. There is no rename register (albeit there is a WB buffer), and stalls halt entire sub-pipes.

I'm not sure if you're referring to something specific to the load-store pipe by "a rename register", or if you merely mean register renaming which Cortex-A9 TRM indicates:

"The register renaming scheme facilitates out-of-order execution in Write-after-Write (WAW) and Write-after-Read (WAR) situations for the general purpose registers and the flag bits of the Current Program Status Register (CPSR).

The scheme maps the 32 ARM architectural registers to a pool of 56 physical 32-bit registers, and renames the flags (N, Z, C, V, Q, and GE) of the CPSR using a dedicated pool of eight physical 9-bit registers."

Writeback buffering has been a feature of ARM processors since ARM9, for what it's worth (many other pretty old small/embedded designs have it too)

I take it the long poll ramification if stalls blocking the sub-pipe would suggest in ability to reoder loads and provide for instance hit under miss. I figured Cortex-A9 did have this capability since ARM11 did, but I can't find any mention of it. The product brief does say this:

"Dependent load-store instructions can be forwarded for resolution within the memory system -
further reduces pipeline stalls and significantly accelerating the execution of high level code accessing complex data structures or invoking C++ functions."

But I have no idea if that has any actual relevance.

Scorpion is obviously a lot closer to strict in-order than out-of-order (at least, based on its relative performance to the A8). But we don't know how much of its performance is due to design similarity and how much is due to trade-offs; it has a longer pipeline, which should theoretically, even with a better branch predictor, lead to penalties in branchy code.

We don't even have good benchmarks, so how are we comparing them? Dhrystone probably predicts well, even if you ignore all the other flaws in using its results. We also don't really know what anything about its pipeline outside of being "longer" (and I've seen mention of differently sized pipelines, ala ARM11, and this is quite likely where the OoO aspects come from), that doesn't even really tell you the mispredict penalty.
 
I'm not sure if you're referring to something specific to the load-store pipe by "a rename register", or if you merely mean register renaming which Cortex-A9 TRM indicates:

"The register renaming scheme facilitates out-of-order execution in Write-after-Write (WAW) and Write-after-Read (WAR) situations for the general purpose registers and the flag bits of the Current Program Status Register (CPSR).

The scheme maps the 32 ARM architectural registers to a pool of 56 physical 32-bit registers, and renames the flags (N, Z, C, V, Q, and GE) of the CPSR using a dedicated pool of eight physical 9-bit registers."

Interesting. For some reason I was under the impression it handled WAW and WAR's with a large write-back buffer instead of renaming registers. My mistake.

Writeback buffering has been a feature of ARM processors since ARM9, for what it's worth (many other pretty old small/embedded designs have it too)

Having a write-back stage, yes. Using it as your OoOE data window, with early-out support of your pipeline, no.

I take it the long poll ramification if stalls blocking the sub-pipe would suggest in ability to reoder loads and provide for instance hit under miss.

Re-ordering can occur but depending on the conditions for dispatch into the execute pipe, a load can still stall the entire execute sub-pipe, for instance, if it does not return data in the time expected. Waiting for data availability before dispatching can avoid this but you lose performance in the common case.
 
Back
Top