Next-Gen iPhone & iPhone Nano Speculation

Arun · Feb 5, 2012

metafor said:
A 800MHz dual A9 seems to be able to parse 99% of the web pages out there in under a second. While I don't have an iPad 2, I've used them extensively and I honestly cannot think of a time when I was not bottlenecked by the network, even on WiFi.

I'm not sure it makes sense to say you're completely bottlenecked by the network - the HTTP protocol works as a back-and-forth mechanism where you need to do some amount of parsing (usually small but can be much higher with javascript/AJAX in theory) to determine what you need to ask the webserver to send you next (e.g. images, dynamic content, etc) and then once you've got everything you need after one or more back-and-forths you need to do the final parsing (obviously it's usually possible to display something before that time). I'm also not sure if Android is slightly slower than iOS, but if it is, then that's what SoCs should be judged on though.

Still, a few quick tests with my 4S on WiFi do indicate that things load faster than I remembered them to, so you're probably right in practice.

Isn't A7 supposed to be fairly quick? Faster than the A8 and scale to around ~1GHz? Two of those would seem sufficient and would pretty much the iPad 2 today, which, as I've pointed out, does not lack in browser performance.

A7 is faster than A8 for web browsing per clock, but clocks significantly lower. The idea is that on 28nm High-K it will be able to clock at more than 1GHz which seems very believable (and more aggressive licensees like Broadcom might get it up to 1GHz on 40nm), but on a comparable process you might get the Cortex-A9 at 1.5GHz or more, so the real performance difference versus the A9 on the same process is at least 2x. I'm also pretty sure that in practice, an A7@1GHz would be slower than an A9@800MHz (although I'm not sure by how much). It's still an extremely good standalone core though, since in many cases two A7s could be cheaper, faster, and lower power than a single A9.

That point has been addressed. Very few consumer level use cases, let alone for mobile, will be able to spread its workload evenly amongst 4 threads.

Completely agreed although I'm curious how important that is in practice if you have sufficiently fast power gating. Let's say you have two threads that take 700MHz and 300MHz to complete. Instead of running one core at 1GHz, you could run two cores at 700MHz, and power gate the 2nd core 4/7th of the time. In practice I'm somewhat skeptical OS schedulers will be smart enough to make that work efficiently... And if you need to do that many times a second because it's real-time, you've got quite a bit of power gating overhead (might you still come ahead in terms of power efficiency with clock gating? Hmm, maybe, maybe not)

In a perfect world, I'd also have my own volcano lair and the requisite disproportionately hot lair chick.

And a unicorn.

Disproportionately hot... compared to the volcano? Wow!

Also, is the unicorn for you or the disproportionately hot chick?

sebbbi said:
Usually quad core CPUs do not consume twice as much energy as dual cores at same clocks, so we could conclude that the power efficiency is generally slightly better for quad cores (on highly threaded loads).

I'm not sure why you think 4 cores will take less than 2x the power of 2 cores at the same clocks. By definition it should scale linearly - the only thing that's lower power is 4 cores at significantly lower clock speeds (but still >0.5x) than 2 cores and that's exclusively where the power efficiency benefit comes from.

french toast · Feb 5, 2012

How frustrating, i typed out nearly a full reply, a long one complete with links and quotes then my tab crashed and lost it..Grrrr.

Here we go again....

Translation: Asus can not provide a explanation for the bad battery performance during the web browsing, and has promised them ship them a new model

Lol, love it how you went to the effort to Translate it lol, i know theres Google translate..but still..
Yea i did wonder about that, thats why i did link that updated review for the same reason.

Silent_Buddha said:
I'm not sure where you got the idea that 4 cores leads to a smoother base UI or application experience. That certainly doesn't hold true on any desktop platform (OSX, Windows, Linux, etc.) which would actually stress single and multicores.

Just like desktop computers, 2 cores (or to a slightly lesser extent HT) lead to a significant increase in how smoothly an OS runs compared to a single core. More cores than that typically lead to imperceptable improvements upon that nebulous experience.

Unless you run heavily threaded (and hence heavy power consumption) applications then you will never notice the benefits of a 4 core versus 2 core CPU. And if you are pushing 4 cores enough for its impact to be noticeable, prepare for your battery life to drop like a stone. Quad core CPUs have been available in the consumer market for almost half a decade now. Despite that there is still very little advantage to having it outside of a few speciality applications and a game or two.

So I'm not sure how it's an automatic win in the mobile space where battery life is important and pushing 4 cores enough for the impact to be noticeable will make that drop like crazy.

Regards,
SB

Well comparing desktop i supose isn't exactly fair as the single threaded performance and clock speeds are far out of proportion of current mobile designs, thats why i used my N270 net top for comparison, as it is based on the same processor Anand compared in his Medfield review and is comparable to mobile performance...and as i stated that level of IPC & clock speed DOES get bogged down very easy, even with a few tabs or other things..as ICS brings multi tabbed chrome and also schedules better for multi cores uarchs..which does lend to notion that is where we are heading in...you disagree with that assumption?

About the performance and smoothness; I did originally find and copy some quotes and links for an article from the verge, that was before my tab crashed, and despite 30 mins searching i cant seem to re find it..oh well, it did state, that whilst page loads were no better, the general 'smooness' was improved..

That point has been addressed. Very few consumer level use cases, let alone for mobile, will be able to spread its workload evenly amongst 4 threads. Most not even 2. Moreover, mobile browsers, kill flash instances that are in the background so even your most heavy laptop use-case with one of the most resource-hogging runtimes out there won't carry over to mobile.

Yes, in the perfect world, spreading a workload evenly to 4x 500MHz A9's will be far better than running it on 2x 1GHz A9's. In a perfect world, I'd also have my own volcano lair and the requisite disproportionately hot lair chick.

And a unicorn.

I concede that in 'normal' mode that all 4 cores do not run a full speed all the time, i also consede that on most occasions all 4 threads will not be used to there full potential, except on gaming, some specialised apps, and of course multitasking and multi tabbed web browsing.

I really think you are underplaying the multi core aspect, and the batterylife savings that occur from it, with ICS this will only get better, and means that you are never cpu starved..that will be especially true when 4x krait @ 1.5-2.0ghz turns up, we will never worry about such things again.

. While there's no major performance gain when it comes to loading most web pages, the difference is that parts of the workload are spread across more cores, allowing each of the cores to run at a lower frequency and thus voltage.

I stand by my original assessment of the Prime's performance. The place you notice the additional CPU cores the most is when multitasking.

Note that even running in Normal mode and allowing all four cores to run at up to 1.3GHz, Tegra 3 is able to post better battery life than Tegra 2. I suspect this is because NVIDIA is able to parallelize some of the web page loading process across all four cores, delivering similar performance to the original Transformer but at lower frequency/voltage settings across all of the cores

-Anand
http://www.anandtech.com/show/5175/asus-transformer-prime-followup/4

As expected, finding applications and usage models to task all four cores is pretty difficult. That being said, it's not hard to use the tablet in such a way that you do stress more than two cores. You won't see 100% CPU utilization across all four cores, but there will be a tangible benefit to having more than two

''The bigger benefit I saw to having four cores vs. two is that you're pretty much never CPU limited in anything you do when multitasking''

http://www.anandtech.com/show/5163/asus-eee-pad-transformer-prime-nvidia-tegra-3-review/2

I had better links than those originally which also mentioned the improved smoothness, my point has been proved with proper hands on accounts or actual cases of benefit, both in battery life and multitasking abilities.

metafor · Feb 5, 2012

Arun said:
I'm not sure it makes sense to say you're completely bottlenecked by the network - the HTTP protocol works as a back-and-forth mechanism where you need to do some amount of parsing (usually small but can be much higher with javascript/AJAX in theory) to determine what you need to ask the webserver to send you next (e.g. images, dynamic content, etc) and then once you've got everything you need after one or more back-and-forths you need to do the final parsing (obviously it's usually possible to display something before that time). I'm also not sure if Android is slightly slower than iOS, but if it is, then that's what SoCs should be judged on though.

Still, a few quick tests with my 4S on WiFi do indicate that things load faster than I remembered them to, so you're probably right in practice.

I'm speaking form the user experience. On WiFi, I do not notice any websites slow to render on either Android or iOS. The only time a slow-down is ever noticeable is something like an ad server that takes seconds to respond. I think until web pages become more rich and complex -- and there's no indication that they ever will, or outside of games or heavy apps such as office (let's hope) -- there are very few use cases where 2x 800MHz A9 (or 2x-4x 1GHz A7) won't deliver a perfectly suitable experience.

A7 is faster than A8 for web browsing per clock, but clocks significantly lower. The idea is that on 28nm High-K it will be able to clock at more than 1GHz which seems very believable (and more aggressive licensees like Broadcom might get it up to 1GHz on 40nm), but on a comparable process you might get the Cortex-A9 at 1.5GHz or more, so the real performance difference versus the A9 on the same process is at least 2x. I'm also pretty sure that in practice, an A7@1GHz would be slower than an A9@800MHz (although I'm not sure by how much). It's still an extremely good standalone core though, since in many cases two A7s could be cheaper, faster, and lower power than a single A9.

Sure, but what is the power like at 28nm HPL/HPM for an A7 at 1GHz? Is it lower than a comparably performing A9 at 28nm LP/HPM/HPL at 800MHz? It's definitely smaller so you can squeeze a multitude of those suckers in there and still have room for a behemoth A15 or Krait. My point is, this whole race to QUAD MONSTER CPU's is entirely lopsided. At best you need 2 of those and most likely 1. The concentration should be on a cluster of A7's to handle most tasks -- touch, UI, network stack, etc.

Completely agreed although I'm curious how important that is in practice if you have sufficiently fast power gating. Let's say you have two threads that take 700MHz and 300MHz to complete. Instead of running one core at 1GHz, you could run two cores at 700MHz, and power gate the 2nd core 4/7th of the time. In practice I'm somewhat skeptical OS schedulers will be smart enough to make that work efficiently... And if you need to do that many times a second because it's real-time, you've got quite a bit of power gating overhead (might you still come ahead in terms of power efficiency with clock gating? Hmm, maybe, maybe not)

Power-gating depends on the size of the core. Also keep in mind the task of reloading the cache, restoring architectural state, etc. On something the size of an A15, for instance, railing up voltage can be in the hundreds of microseconds.

But what if the task split is even more lopsided? What if a secondary, low-utilization task only requires a processor to run at 100MHz? Many many background tasks simply pop up for a quick poll of memory and then stop. The overhead of both waking up another core, not to mention the leakage associated with it, would far outweigh the benefit of running the first core 100MHz higher.

Disproportionately hot... compared to the volcano? Wow! Also, is the unicorn for you or the disproportionately hot chick?

You can't own unicorns, man. They're magical.

sebbbi said:
Usually quad core CPUs do not consume twice as much energy as dual cores at same clocks, so we could conclude that the power efficiency is generally slightly better for quad cores (on highly threaded loads).

Quite the opposite. I'm not sure why you think twice the number of cores running at the same frequency would use less than twice the power but it's usually more since efficiency drops as you have to supply more current to the circuit region. Add to that the fact that shared components such as the coherency unit, exclusive monitor, L2 cache, prefetcher, bus interface, etc. all have to work more when there are 4 extra cores -- to keep track of ordering and consistency of access -- and it actually can take significantly more power to run 4 cores.

Add to *that* the fact that supplying higher instantaneous current almost always takes a hit on the efficiency of the power regulator.

Add to *that* the fact that it means you'll need thicker power rails in the SoC to supply all 4 cores, thus either reducing routing efficiency (which can majorly impact power) or introducing higher voltage drops (which means you'll have to ramp up voltage to maintain the same frequency).

metafor · Feb 5, 2012

french toast said:
Well comparing desktop i supose isn't exactly fair as the single threaded performance and clock speeds are far out of proportion of current mobile designs, thats why i used my N270 net top for comparison, as it is based on the same processor Anand compared in his Medfield review and is comparable to mobile performance...and as i stated that level of IPC & clock speed DOES get bogged down very easy, even with a few tabs or other things..as ICS brings multi tabbed chrome and also schedules better for multi cores uarchs..which does lend to notion that is where we are heading in...you disagree with that assumption?

I doubt ICS or any future revision of Android will do full background tabs like desktop web browsers do. I seriously doubt that's a common use case as well. Most background pages are rather static. The exception, of course, are pages with Flash. But Flash will bog down any system, even a desktop quad.

I concede that in 'normal' mode that all 4 cores do not run a full speed all the time, i also consede that on most occasions all 4 threads will not be used to there full potential, except on gaming, some specialised apps, and of course multitasking and multi tabbed web browsing.

I really think you are underplaying the multi core aspect, and the batterylife savings that occur from it, with ICS this will only get better, and means that you are never cpu starved..that will be especially true when 4x krait @ 1.5-2.0ghz turns up, we will never worry about such things again.

I think you're overplaying it and you're really taking speculation from journalists as gospel. For instance:

-Anand
http://www.anandtech.com/show/5175/asus-transformer-prime-followup/4

Even this graph shows that utilization barely goes above 2 cores. And even then, the workload isn't spread evenly. Let me put it to you this way, if you had a choice of waking up a core -- with leakage and cache activity and all -- and running it at 300MHz (minimum clock) to handle a job that could've easily been taken care of by clocking core 1 100MHz higher, you're wasting power.

I had better links than those originally which also mentioned the improved smoothness, my point has been proved with proper hands on accounts or actual cases of benefit, both in battery life and multitasking abilities.

Because you accounted for variables such as a completely new version of an OS (ICS), a low power companion core (that's actually quite capable), a new method of handling touch (bypassing the touch controller and having the companion core process user input directly) and, most dramatically, an LCD color/luma modulation scheme that reduces power of the display dramatically?

But no, it must be the quad core....

sebbbi · Feb 5, 2012

metafor said:
Quite the opposite. I'm not sure why you think twice the number of cores running at the same frequency would use less than twice the power but it's usually more since efficiency drops as you have to supply more current to the circuit region. Add to that the fact that shared components such as the coherency unit, exclusive monitor, L2 cache, prefetcher, bus interface, etc. all have to work more when there are 4 extra cores -- to keep track of ordering and consistency of access -- and it actually can take significantly more power to run 4 cores.

It of course depends on what kind of work you are doing. In the worst case the shared parts need to work twice as hard (and the scaling likely is slightly worse than linear). But in the best case, the shared parts only need to work slightly harder. For example when the threads access same memory regions the L2 utilization is great and the number of outgoing memory accesses stay pretty much the same. In this case you will see major efficiency gains. Of course in the opposite case, none of the threads share any data though L2 and the memory accesses scale by more than 4x (as the cache is shared and there's much more trashing).

The big question is, how much the shared part energy consumption scales up/down dynamically (with low power states / enabled core count)? If majority of the shared parts are always active, the increased load from four cores doesn't increase the shared part energy consumption that much. Dual core already needs all of the shared parts, so fully disabling some areas is not possible. Down clocking the shared parts (based on active core count) is of course possible, but how much is this done in the current mobile chips?

benjiro · Feb 5, 2012

metafor said:
Quite the opposite. I'm not sure why you think twice the number of cores running at the same frequency would use less than twice the power but it's usually more since efficiency drops as you have to supply more current to the circuit region. Add to that the fact that shared components such as the coherency unit, exclusive monitor, L2 cache, prefetcher, bus interface, etc. all have to work more when there are 4 extra cores -- to keep track of ordering and consistency of access -- and it actually can take significantly more power to run 4 cores.

Metafor, you just stumbled upon something without realizing it.

One of the bigger energy draining, on a PC CPU, is considered to be the L2 cache.

The reason why the power usage does not always scale up the same, when going from dual to quad core, is because of this.

You expect with a Quad Core, that it uses x watt * 2? Problem is, in a lot of cases, you get:

- Dual Core with 2MB cache uses x watt.
- Quad Core with 3MB cache uses z watt. Where z watt is NOT x * 2.

It also depends, if the Cache is split in a 2 * Dual Core 1.5MB ( for the Quad Core in other words, its just 2 Dual cores stick together ), or if its one mass pool.

I agree with french toast his statement, in general, a Quad Core made on the same process, same speed, will have a lower power usage. Because they made some changes to the design to deal with power usage / heat dispatching. One of the oldest tricks in the book to lower power usage, is have a lower amount of L2 cache, then what a 2 * Dual Core will have.

french toast said:
Lol, love it how you went to the effort to Translate it lol, i know theres Google translate..but still..
Yea i did wonder about that, thats why i did link that updated review for the same reason.

Its just a few short sentence. Easy to translate. Google translation is not bad, but sometimes it can alter the meaning of the text. One advantage of knowing Dutch

In general i don't like articles like that. They know there are flaws in there review. Battery life is one of the bigger question people have on Smartphones/Tablets/Laptops, and yet the article is never updated ( and probably never will as its not a "hot topic" anymore ).

I don't always agree with Anandtech there review conclusions, but i applaud them for updating there reviews when there used to be flaws or problems with the original review.

sebbbi · Feb 5, 2012

benjiro said:
One of the oldest tricks in the book to lower power usage, is have a lower amount of L2 cache, then what a 2 * Dual Core will have.

But that results in more cache misses, and thus more requests to main memory. And getting data from further away requires more energy. Reducing cache size reduces the CPU power usage, but the energy used in memory controller, bus and the memory chips will likely increase more than the amount saved.

And it will reduce the performance, so the CPU needs to be in the high power state for a longer time to process the required tasks. Better to get to the idle state as soon as possible.

french toast · Feb 5, 2012

Even this graph shows that utilization barely goes above 2 cores. And even then, the workload isn't spread evenly. Let me put it to you this way, if you had a choice of waking up a core -- with leakage and cache activity and all -- and running it at 300MHz (minimum clock) to handle a job that could've easily been taken care of by clocking core 1 100MHz higher, you're wasting power.

More great theorys, however at least i have provided some apples to apples comparisons, even if i didn't get the variety i originally typed, you keep posting theorys that go against the actual reviewer and the tests...

Because you accounted for variables such as a completely new version of an OS (ICS), a low power companion core (that's actually quite capable), a new method of handling touch (bypassing the touch controller and having the companion core process user input directly) and, most dramatically, an LCD color/luma modulation scheme that reduces power of the display dramatically?

But no, it must be the quad core....

Well i have stated i don't like the Tegra way of silly 'shadow' cores and much prefer a proper version like Krait will do..
But that doesn't discount that 'more cores' does allow that, where as you are just saying 'duel core'..so what set up are you proposing then for
your 'duel core'..1x 1.3ghz A9 and a 'shadow core'?? once again performance would become even worse...no matter how you swing it, having ''more cores'', and spreading the load across them is more efficient than loading up any implementation of 'duel'..whilst providing a better experience and offering up more powerfull app potential.

Skimming the top 30-40% off 2 processors and loading them onto seperate cores will help performance and power consumption yes, that is what i am saying.

Because Nvidia coded to the cores better is a reason why you think the strategy isn't valid?....thats because it has those extra cores to thread to..thats my whole point, a duel core setup doesn't have that advantage, and besides Krait should be able to scale frequency right from the bottom properly, removing the need for a 'shadow core'

You seem to be missing the point that the reviews i linked DID NOT HAVE ICS loaded, yet still out performed the duel core, when newer versions of the OS get perfected, more of the load will start to get multi threaded and power consumption will further decrease...maybe not by a huge margin..but considering all the extra performance on tap ANY improvement is welcome.

Im not going to keep bothering to quote more stuff when there are no counter input coming in so here is a few glowing links that backup what i said about improved multitasking, improved smoothness, and better batterylife..(note some have ICS on board, some dont)
http://www.phonearena.com/reviews/Asus-Transformer-Prime-Review_id2946/page/2
http://www.pcpro.co.uk/reviews/tablets/371776/asus-eee-pad-transformer-prime
http://reviews.cnet.co.uk/ipad-and-tablets/asus-transformer-prime-review-50006423/
http://www.t3.com/reviews/asus-eee-pad-transformer-prime-review
http://www.slashgear.com/asus-transformer-prime-review-02199429/
and one explaining how vSMP works in tegra...
http://www.slashgear.com/nvidia-det...brain-of-quad-core-mobile-computing-20181062/

french toast · Feb 5, 2012

Its just a few short sentence. Easy to translate. Google translation is not bad, but sometimes it can alter the meaning of the text. One advantage of knowing Dutch

In general i don't like articles like that. They know there are flaws in there review. Battery life is one of the bigger question people have on Smartphones/Tablets/Laptops, and yet the article is never updated ( and probably never will as its not a "hot topic" anymore ).

I don't always agree with Anandtech there review conclusions, but i applaud them for updating there reviews when there used to be flaws or problems with the original review.

Ha so you actually do speak dutch..impressive!

Yea i trust Anand the most, however i first had some doubts with the Medfield review, not that i think he is wrong as such..just the optimistic conslusions about the architecture v ARM and how Medfield would have 'dominated' Android last year.

Still it is my fav 'go to' site.

metafor · Feb 6, 2012

french toast said:
More great theorys, however at least i have provided some apples to apples comparisons, even if i didn't get the variety i originally typed, you keep posting theorys that go against the actual reviewer and the tests...

No, you didn't. Apples to apples means you control the other parameters and adjust one. You've compared 2 entirely different tablets running not only different software versions, different displays but have forgotten about some of the biggest changes in the SoC when it comes to the very factors you're talking about. Namely:

1. Tegra 3 moves the touch control straight onto the SoC, dramatically improving input response compared to Tegra 2. This comes from nVidia's PR themselves.
2. Tegra 3 has NEON, something drastically lacking in Tegra 2 and when it comes to software rendering, helps tremendously.
3. Tegra 3 tablets modulate the LCD chroma/luma to dramatically reduce power consumption of the display -- the biggest power draw in a tablet.

You've conveniently ignored all of this and instead attributed slightly better battery life to OMG-QUAD-CORE. Hell, the power reduction in the LCD alone should've made a huge improvement in battery life, yet we see battery life is only slightly better.

Please stop using site reviews with blind speculation from tech writers to backup anything. This isn't the engadget boards.

Erinyes · Feb 6, 2012

I logged in just to post this. For pete's sake can you call it a dual core and not a "duel" core??

Its been driving me nuts..

Secondly, another Reason T3 is smoother than T2 is because of the fact that it has higher memory b/w and the GPU is a lot faster (especially in ICS where GPU acceleration is used system wide). Also manufacturers tend to focus on software for new products first more than for the old products. So Asus probably spent a lot more time optimizing software for the Transformer Prime than the Transformer. Have any of the reviews compared the Transformer Prime to the original Transformer with the update?

Also, i want a Volcano Lair and Disproportionately hot chick too (Unicorn would be nice as well..)

metafor · Feb 6, 2012

sebbbi said:
It of course depends on what kind of work you are doing. In the worst case the shared parts need to work twice as hard (and the scaling likely is slightly worse than linear). But in the best case, the shared parts only need to work slightly harder. For example when the threads access same memory regions the L2 utilization is great and the number of outgoing memory accesses stay pretty much the same. In this case you will see major efficiency gains. Of course in the opposite case, none of the threads share any data though L2 and the memory accesses scale by more than 4x (as the cache is shared and there's much more trashing).

If they all access the same region of memory -- unless it's purely reads -- they'll snoop the hell out of each other which isn't just bad for power, it's bad for performance. Add to the fact that -- at least on A9, I forget for A15 -- that L2 cache access is done over the AXI bus and you can see how it isn't just 2x the scaling but more as you run into issues of bus contention.

The big question is, how much the shared part energy consumption scales up/down dynamically (with low power states / enabled core count)? If majority of the shared parts are always active, the increased load from four cores doesn't increase the shared part energy consumption that much. Dual core already needs all of the shared parts, so fully disabling some areas is not possible. Down clocking the shared parts (based on active core count) is of course possible, but how much is this done in the current mobile chips?

Those shared parts will have to grow in size to accommodate more cores. And often, due to the inter-locking nature of shared components such as snoop controllers and barrier support, the energy consumption grows exponentially. It's the same thing with the L2 cache itself. Granted nVidia chose to keep the L2 cache size the same with their quad-implementation, but to get equivalent performance out of twice the number of cores, you really need a far bigger L2 cache, which both eats up die area and adds significantly to leakage power.

benjiro said:
You expect with a Quad Core, that it uses x watt * 2? Problem is, in a lot of cases, you get:

- Dual Core with 2MB cache uses x watt.
- Quad Core with 3MB cache uses z watt. Where z watt is NOT x * 2.

It also depends, if the Cache is split in a 2 * Dual Core 1.5MB ( for the Quad Core in other words, its just 2 Dual cores stick together ), or if its one mass pool

You'd also end up with lower performance per dual-core cluster since you've effectively reduced its L2 cache size. I consider nVidia's choice better. They stuck with the same L2 cache size that they used for their dual-core and just implemented 2 more cores. Now, granted the benefits are marginal at best but as Arun pointed out, the actual compute core for a Cortex A9 is pretty damn small in size, so why not add it. But the key is that they manage to keep performance the same in most cases that use 1-2 cores while keeping die area and power consumption low by having a relatively small L2 cache.

Come time for Cortex A15 SoC's, the trade-offs for "let's bolt on 2 more cores for marginal benefits and marketing" won't be as favorable.

Because they made some changes to the design to deal with power usage / heat dispatching.

You could easily make those same changes in the design on a dual-core and end up with an even lower power part. Also, you're pretty bound by physics.

sebbbi · Feb 6, 2012

metafor said:
If they all access the same region of memory -- unless it's purely reads -- they'll snoop the hell out of each other which isn't just bad for power, it's bad for performance.

No sane programmer would simultaneously read and write to the same cache line from two cores. That would't simply be bad for performance, it would be horrific

Exophase · Feb 6, 2012

sebbbi said:
No sane programmer would simultaneously read and write to the same cache line from two cores. That would't simply be bad for performance, it would be horrific

The reads and writes don't have to be simultaneous, the coherency traffic will be generated on writes so long as the data is in the L1 cache of multiple cores. That is, for your scenario where two cores access the same regions where it resides in L2.

Laurent06 · Feb 6, 2012

Exophase said:
The reads and writes don't have to be simultaneous, the coherency traffic will be generated on writes so long as the data is in the L1 cache of multiple cores.

Given that the protocol is MOESI, if a core writes to a line that is shared, the coherency traffic will be to just ask other cores to flag the cache line as invalid. That also means if another core wants to read the data again, the line will have to be transferred from the core that wrote that line, something you definitely don't want to happen too often :smile:

It's the reason why shared data that have different use cases should not live in the same cache line.

dagamer · Feb 7, 2012

Mike11 said:
IMHO A7 will be at least a quad-core (four cores exposed to the OS). Just 2x Cortex-A15 in 2013 doesn't make sense for Apple if they stick with just one new SoC for all their iOS devices. It's gonna be at least 2x Cortex-A7 plus 2x Cortex-A15.

I forgot about the Cortex-A7s, largely because I expect them to only be in use when the system is idle doing some background tasks when the A15s are completely turned off. I do not expect Apple to use both the A7 and A15s at the same time during typical use.

sebbbi · Feb 7, 2012

Exophase said:
The reads and writes don't have to be simultaneous, the coherency traffic will be generated on writes so long as the data is in the L1 cache of multiple cores. That is, for your scenario where two cores access the same regions where it resides in L2.

There are many ways to efficiently share L2 between cores and minimize the memory reads/writes. The easiest case is when both are reading the same memory areas, and writing to their own designated areas. Big gains, and no synchronization needed. Another good example is when X cores are generating items to a queue, and X cores are processing items from the queue. As long as the items are large enough (several cache lines of aligned data) and the counters are not often updated (for example processing occurs in 64 item chunks), the data is always evicted from the L1 before another core needs it. It just requires some profiling to get the parameters right. With manual eviction finetuning if of course easier (but that's not possible on all consumer platforms, esp in user mode).

For example a properly optimized multicore LSD radix sorter scales almost perfectly with the core count (if the sorted items + the work buffer fit to L2). The memory traffic is not increased at all, but the sorting time drops to 1/4. 32 kB of L1 (in A4/A5) in each core is enough to fully store the local histograms and top cache line of each 256 bins (so the L2 usage isn't increased much either). There's only four synchronization points in 32 bit (8 bit per pass) LSD radix sort (a barrier between the four passes). This is a good example of an algorithm that runs more efficiently on a four core CPU (with equal sized shared L2).

french toast · Feb 7, 2012

metafor said:
No, you didn't. Apples to apples means you control the other parameters and adjust one. You've compared 2 entirely different tablets running not only different software versions, different displays but have forgotten about some of the biggest changes in the SoC when it comes to the very factors you're talking about. Namely:

1. Tegra 3 moves the touch control straight onto the SoC, dramatically improving input response compared to Tegra 2. This comes from nVidia's PR themselves.
2. Tegra 3 has NEON, something drastically lacking in Tegra 2 and when it comes to software rendering, helps tremendously.
3. Tegra 3 tablets modulate the LCD chroma/luma to dramatically reduce power consumption of the display -- the biggest power draw in a tablet.

You've conveniently ignored all of this and instead attributed slightly better battery life to OMG-QUAD-CORE. Hell, the power reduction in the LCD alone should've made a huge improvement in battery life, yet we see battery life is only slightly better.

Please stop using site reviews with blind speculation from tech writers to backup anything. This isn't the engadget boards.

Where did i link different software versions?? Anands review which i originally linked, used Android 3x in BOTH versions, just to add some variety i also linked some other ICS versions.

Also if you bothered to read, you would understand that 'normal mode' doesn't use the powersaving LCD features that you bang on about, and thats what Anand refered to.

What you seem to discount that the A9s are also 30-40% higher clocked, likely has a faster 2d processor AND higher clocked GPU, which IS used to render web pages in Honeycomb.
...Besides at least i have bothered and not just coming back with insights...

It may not be complete apples to apples, but it pretty darn close don't you think? same manufacturer of tablet, same reviewer, same software,same tests,same resolution, same manufacturing process, likely same screen manufacturer, (MAYBE slightly better p/c), same ram, same battery, and same SOC manufacturer...

We are comparing 2 different gen products, so with that in mind the above isn't that bad now is it? what the hell do you expect??

I said i believe that going multi core up to say 4 over 2 is a good thing, that it gived more options, and at worst doesn't impact batterylife and at best can actually improve it, i have provided some decent if not perfect links, whilst you come back with nothing.

If you actually provided an Apples to apples comparison that showed that it didn't consume less power going to multi core, then i would except that as im obviously not as ignorant as you.
..So instead of blindly picking apart my comments, (innacurately i might add) why don't you come back with some links and evidence of your own??

In some cases you havn't even bothered to read my post or links before coming back and posting a negative comment to disagree!?

*note sorry about spelling of dual, i do slip from time to time*

EDIT; Im not carrying this on as i have nothing more to add and its wayyy off topic.

benjiro · Feb 7, 2012

Here is something new to discuss:

http://www.patentlyapple.com/patent...cture-will-it-take-ios-to-the-next-level.html

The macroscalar processor addresses this problem in a new way: at compile-time it generates contingent secondary instructions so when a data-dependent loop completes the next set of instructions are ready to execute. In effect, it loads another pipeline for, say, completing a loop, so the pipeline remains full whether the loop continues or completes. It can also load a set of sequential instructions that run within or between loops, speeding execution as well.

From a user perspective, the technology could support faster performance and lower power consumption, something Apple would definitely be interested in pursing for its mobile devices.

Its a interesting concept, that's for sure.

Laurent06 · Feb 7, 2012

That looks very similar to what TI has been doing in C6x DSP for more than a decade. Look for SPLOOP.

Next-Gen iPhone & iPhone Nano Speculation

Arun

Unknown.

french toast

metafor

metafor

sebbbi

benjiro

sebbbi

french toast

french toast

metafor

Erinyes

metafor

sebbbi

Exophase

Laurent06

dagamer

sebbbi

french toast

benjiro

Laurent06

Similar threads