Intel Atom Z600

mczak · Jun 8, 2012

french toast said:
Well i can confirm with my own hands on and tests with system tuner pro that 4 threads in Galaxy S3 do get used very often, sometimes with all 4 cores running at 200mhz only to keep things smooth and efficient, sometimes with just 1 core with the rest power gated, all 4 cores @ 1.4ghz in 'race to sleep' mode and every variable inbetween.

Well if you have 4 cores running at 200Mhz then I bet you wouldn't really notice any difference to a 2 core device running at a bit higher frequency. Granted if you use all 4 cores at a high frequency that's something different but I still think that has to happen quite rarely.

Well im pretty sure that early manufacturing over at TSMC is not on HKMG, and they certainly have had some serious problems with that process as Snapdragons have been delayed, Intels 32nm is very mature now and their HKMG technology is several generations along so i can't really see a scenario where it wouldn't be better, with the exception on die size.

You're right the snapdragon s4 chip appears to be non-HKMG. TSMC has HKMG option for all their 28nm nodes, but apparently qualcomm opted to not use use it for this chip, the statement I found saying there was "too much risk" associated with it.

Other quad cores that might be out this year, is Exynos 5450- quad A15. Tegra 4 is scheduled for Q4?

Tegra 4 is Q1/2013. I doubt though Exynos 5450 (which I forgot about) will really be seen in 2012 though, and even if it will, just like Tegra 4 it looks to be a tablet-only chip.

I'm very sceptical any of these quad A15 will be seen in a smartphone anytime soon.

french toast · Jun 8, 2012

You might well be right about the A15s, certainly if they do arrive this year it will be Q4, quad Kraits will be here soon though, that in my opinion will be the apex performance for Android.

I have had a good look at the thread schedule with Exynos and i can say that when loading up a heavy flash web page, combined with either other tabs or several things going at once all 4 cores will spurl up to 1.4ghz to load things very very fast and keep the system running smooth.

I mentioned all 4 running at 200mhz to point out the efficiency of it, with loads of small background processess running, instead of thrashing 1 core it will schedule the tasks across 4 cores at minimal voltage = thus saving precious power consumption and having bags of processor head room to keep things running when you fire up the web browser or a game.

Likewise it can run 3 cores at 500mhz and power one at 1ghz+ or just power 3 down and have 1 running at 200mhz if needed, thats efficiency for you.

You will have to have a go on one to really see what im talking about, i went into developer options and turned off the animations in Touchwhizz, and i have to say it made loading in and out of settings, screens, gallery etc shotgun fast, no amount of opening and closing anything can provoke a significant slowdown, the only slight pause will be when you have opened too many apps and ran out of ram, and processess start to get tombstoned, other than that performace is a non issue.

-As a side note rumours are circulating that a Korean version will indeed have the full 2gb ram, i wish they have shipped that with mine, it really is needed!

-Oh and the phone boots from cold in 18-20 seconds...thats got to be a record!

metafor · Jun 8, 2012

french toast said:
You might well be right about the A15s, certainly if they do arrive this year it will be Q4, quad Kraits will be here soon though, that in my opinion will be the apex performance for Android.

I have had a good look at the thread schedule with Exynos and i can say that when loading up a heavy flash web page, combined with either other tabs or several things going at once all 4 cores will spurl up to 1.4ghz to load things very very fast and keep the system running smooth.

I'd be curious as to what site you actually loaded. Anand has done CPU utilization charts on Tegra 3 for the browser. Up to 3 cores were used and the 3rd one really only clocked up to minimum frequency. The second one was utilized to about ~25%. I don't doubt that there is some corner case that you can push 4 cores up to full utilization. But I'd argue those cases are not only rare but in a smartphone, would have to be the result of very inefficient software (Flash).

I mentioned all 4 running at 200mhz to point out the efficiency of it, with loads of small background processess running, instead of thrashing 1 core it will schedule the tasks across 4 cores at minimal voltage = thus saving precious power consumption and having bags of processor head room to keep things running when you fire up the web browser or a game.

This is not necessarily true. There is significant overhead in waking up a core. Not only do you in essence double the leakage, power gates take some significant amount of time (on the order of ~100 us) to source supply to a core. Unless you run that core and do some work for a significant amount of time, the overall power consumption due to overhead of waking/power-gating a core will outweigh any benefits you gain.

Not only that, the efficiency of 1 cpu vs 2 is only true at a certain point in the voltage/frequency curve. A core running at 200MHz will likely not need anymore voltage than to run at 400MHz. So having 2 A9's split the work when 1 running at 400MHz would do will actually use more power due to leakage of a second core.

If we're talking about running one core at 1.5GHz compared to two at 750MHz, then the story changes as it requires significantly higher voltage to run the A9 at 1.5GHz than at 750MHz. So there are significant power savings in that scenario.

Likewise it can run 3 cores at 500mhz and power one at 1ghz+ or just power 3 down and have 1 running at 200mhz if needed, thats efficiency for you.

I'd be curious what workload you can generate that can load 3 cores at 500MHz for a significant amount of time that the power governor felt the need to wake it up.

metafor · Jun 8, 2012

btw, S4 Pro for smartphones will likely be the dual core variants. Quad Krait/A15 does not belong in a smartphone.

Nebuchadnezzar · Jun 8, 2012

metafor said:
This is not necessarily true. There is significant overhead in waking up a core. Not only do you in essence double the leakage, power gates take some significant amount of time (on the order of ~100 us) to source supply to a core. Unless you run that core and do some work for a significant amount of time, the overall power consumption due to overhead of waking/power-gating a core will outweigh any benefits you gain.

That is not true anymore; the 4412 for example :

Another engineer from NVIDIA, where the digital solution of a "4-PLUS-1" architecture is being used to manage power/performance in Tegra-3 application processors, was curious to know how long it takes to switch cores from the inactive to active states. The initial answer: "in the micro-second" range. Once again, the questioner insisted on a more detailed answer "a couple of microseconds, or tens of microseconds", to which Dr. Yang replied "more than a couple".

Additionally the governor is tuned to have very aggressive hotplugging thresholds, for example the minimum frequency being 500MHz and sampling periods of 50ms.

metafor · Jun 8, 2012

Nebuchadnezzar said:
That is not true anymore; the 4412 for example :

That's again, not detailed enough. ARM provides many "inactive" states similar to x86's C states. The latency involved in coming out and going into them grows almost exponentially depending on the deep level. At its most basic (C1), wakeup can occur on the order of tens of clocks.

Deep (C7, or P0 in ARM world). Involves a lot more. There's also a distinction between rail gating and power-gating. Needless to say, take answers to the public on this question with a grain of salt. I can tell you that for at least 2 of the SoC's out there in modern smartphones, it's not <10us. Unless Samsung did some really amazing magic with their power grid design (and their regulator) on 32HKMG -- and they very well may have -- I'd say he's referring to a C4-style state.

It also varies wildly depending on the core and its MP configuration. It takes a lot to bringup a large core.

Additionally the governor is tuned to have very aggressive hotplugging thresholds, for example the minimum frequency being 500MHz and sampling periods of 50ms.

500MHz is actually quite high. But it does make sense for an A9 as I'd expect that to be the threshold where multi-core makes sense. My point is that something ridiculously low like 200MHz is not worth waking up a core for. Of course, that depends on leakage levels and what kind of state you're waking up from.

french toast · Jun 8, 2012

I dont know about Tegra 3 as i have not used it,(only briefly) i did read Anandtech article, indeed i linked that article to your self and Exophase when discussing this very subject.

Tegra 3 has gone a different way about it in my limited understanding, much much less per core optimisations and loads of work loaded onto that shadow core, the whole thing was custom tuned by Nvidia seperate from Android to use its resources accordingly, Exynos similar to Krait is a 'proper' quad core, no shadow core nonsense and scheduled by ICS/touchwhizz (?)

What i can confirm is that all 4 cores are fully utilised at different frequencies depending on workloads, i have witnessed various ranges from 1 or 2 cores active, 4 cores or 3 cores on at one frequency and another core on a seperate plane, the only thing i have not seen myself is a work load that pits all 4 cores running completly seperate of one another, aka the Krait 4x 720p demo, although that could be because im just chucking normal media scenarios at it which wouldn't require such a processor state.

All 4 cores power up quite reguarly in what appears to me like a 'race to sleep' scenario where peak load is dealt with swifty then cores settle down, usually the most likely scenario when doing very little is 1 or 2 cores moderatly clocked, but all 4 cores do spoil up to load pages such as engadget, plus a few other system strains, definately easier to fire up all 4 cores than what Anand indicated in his review, but that could be the limited way in which Tegra can use its cores but i don't know.

I have taken some screen shots on the cores in seperate loads/frequencies, but i can't log onto Beyond 3D with the phone grr.
I can also record a process log over say a 6-12 hour time frame, but too be honest im new to this software, ill see if i can get at least some screen shots, maybe with the task manager in frame.

french toast · Jun 8, 2012

500MHz is actually quite high. But it does make sense for an A9 as I'd expect that to be the threshold where multi-core makes sense. My point is that something ridiculously low like 200MHz is not worth waking up a core for. Of course, that depends on leakage levels and what kind of state you're waking up from.

One more thing, i may have miss read what you guys mean about the 500mhz limit, but i have definately seen all 4 cores run at 200mhz, i may have taken a screen shot as i didn't expect to see that, if i havn't i may not be able to replicate it again.

metafor · Jun 8, 2012

french toast said:
I dont know about Tegra 3 as i have not used it,(only briefly) i did read Anandtech article, indeed i linked that article to your self and Exophase when discussing this very subject.

Tegra 3 has gone a different way about it in my limited understanding, much much less per core optimisations and loads of work loaded onto that shadow core, the whole thing was custom tuned by Nvidia seperate from Android to use its resources accordingly, Exynos similar to Krait is a 'proper' quad core, no shadow core nonsense and scheduled by ICS/touchwhizz (?)

The OS schedules the workload no matter what. Tegra 3 transparently swaps to shadow only when only 1 core is in use and the threshold is below some performance level (IIRC 500MHz). That performance level is entirely up to the OS to indicate, however.

For example, if Android tells the processor "I'll only need one core active at 500MHz", Tegra 3 swaps to shadow in the background. How much is loaded in a multi-core scenario is entirely up to the OS.

What i can confirm is that all 4 cores are fully utilised at different frequencies depending on workloads, i have witnessed various ranges from 1 or 2 cores active, 4 cores or 3 cores on at one frequency and another core on a seperate plane, the only thing i have not seen myself is a work load that pits all 4 cores running completly seperate of one another, aka the Krait 4x 720p demo, although that could be because im just chucking normal media scenarios at it which wouldn't require such a processor state.

What I'm asking for is total utilization ratios. Just because all 4 cores are active doesn't mean that they're active in a way that is optimal compared to if there were only 2 cores. More importantly, just because there is some instantaneous point in which they are used in the ideal situation you believe doesn't mean that, in aggregate for that one task, that situation comes up enough to make a noticeable difference.

Hell, for a certain task (let's say loading some page), are those 4 cores utilized at, say, 500MHz or above, for longer than 10% of the total processing time?

All 4 cores power up quite reguarly in what appears to me like a 'race to sleep' scenario where peak load is dealt with swifty then cores settle down

What is the utilization in those cases? 100% on 4 cores? 50% on 4 cores? Do they go back to sleep after (as in, the task was finished) or was it 4 cores for 50ms and then 1 core for 500ms? This matters because it tells us how often 4 cores are actually utilized in a way that is efficient for 4 cores. Just saying "they wake up" doesn't say much.

usually the most likely scenario when doing very little is 1 or 2 cores moderatly clocked, but all 4 cores do spoil up to load pages such as engadget, plus a few other system strains, definately easier to fire up all 4 cores than what Anand indicated in his review, but that could be the limited way in which Tegra can use its cores but i don't know.

It shouldn't. It's entirely up to the OS. However, I'd far more willingly believe Samsung has optimized their browser to be more aggressive in threading.

I have taken some screen shots on the cores in seperate loads/frequencies, but i can't log onto Beyond 3D with the phone grr.
I can also record a process log over say a 6-12 hour time frame, but too be honest im new to this software, ill see if i can get at least some screen shots, maybe with the task manager in frame.

I'd be curious about the data collected here. Particularly since Samsung's browser seems to have been significantly altered compared to the stock Android browser, at least if we're to believe the Sunspider and Browsermark scores of the GS3.

However, again, just showing some instantaneous time when "look, 4 cores are used" isn't really saying much. Aggregate usage over the time it took to complete the task is what matters.

french toast · Jun 9, 2012

How annoying, i have pulled a load of screen shots off, saved them onto my email as an attachment, unzipped them onto my netbook, and now im stuck as i can't attach anything and it wont copy and paste....grrr

I could email you them?....

Nebuchadnezzar · Jun 9, 2012

metafor said:
That's again, not detailed enough. ARM provides many "inactive" states similar to x86's C states. The latency involved in coming out and going into them grows almost exponentially depending on the deep level. At its most basic (C1), wakeup can occur on the order of tens of clocks.

Deep (C7, or P0 in ARM world). Involves a lot more. There's also a distinction between rail gating and power-gating. Needless to say, take answers to the public on this question with a grain of salt. I can tell you that for at least 2 of the SoC's out there in modern smartphones, it's not <10us. Unless Samsung did some really amazing magic with their power grid design (and their regulator) on 32HKMG -- and they very well may have -- I'd say he's referring to a C4-style state.

Samsung doesn't have any traditional C states, there's only 3: AFTR which is mysterious in what it does and disabled on the 4412, LPA which is core clock gating and only works when the screen is off, and complete core power gating/hotplugging, and nothing else.

The 500MHz threshold is the frequency condition in which a core is supposed to kick online from its offline state, it can go back down into a lower frequency when it is online, and kick-off frequency is 200MHz which is minimum frequency on stock. The hotplugging is actually fairly complex based on thread runqueues where it injects a monitoring thread into the runqueue of each CPU and measures the time spent in the queue and based on that and a threshold queue with different values for each core jump. Here's the governor with all the scaling logic of the 4412, I think it is very well thought through.

french toast said:
What i can confirm is that all 4 cores are fully utilised at different frequencies depending on workloads, i have witnessed various ranges from 1 or 2 cores active, 4 cores or 3 cores on at one frequency and another core on a seperate plane, the only thing i have not seen myself is a work load that pits all 4 cores running completly seperate of one another, aka the Krait 4x 720p demo, although that could be because im just chucking normal media scenarios at it which wouldn't require such a processor state.

What are you using for monitoring this?

french toast · Jun 9, 2012

Im using system tuner pro, there is a free version on the play store which has near identical features if you want to check it out, its the best one stop peice of task manager/process checker/phone management software i have come across, you can monitor all 4 cores, overall cpu load (havn't found a way to monitor each thread in depth like Metafor would like) the usual ram resources to every sytem process, including Kernal.

You can record at log over a time period - but i have not got the hang of that yet. you have options to change the govenor and set cpu frequencies manually (root) the only thing missing that i can see is some rooted gpu control, other than that this is the single best monitering software off of the playstore IMO.(the widget is also very very good)

Yea i see what you mean about the 500mhz, when they wake up they shoot to 500mhz then they can set to any frequency from there by 100mhz.

tangey · Jun 9, 2012

french toast said:
The point i was trying to make was that Intel is already on the most advanced process node with Medfield,

???

Intel's latest node is 22nm. And medfield came out as intel was moving its main line from 32 to 22, I.e. medfield is at the tail end of the 32nm cycle. That puts medfield almost 2 nodes behind where intel could have it, which is at the heart of Intels plans to accelerate node adoption on their handheld socs.

french toast · Jun 9, 2012

Yes I know but you are not getting what I am saying, forget 22nm for a moment their 32nm is already the best process. For a process that mature medfield doesn't provide either the performance or battery life that it should.

Of course a new design and 22nm will be the real deal, but they need that technology to enable multicore atoms, they have not so instead they make false statements to discredit technology that they can't provide is all I'm saying.

tangey · Jun 9, 2012

french toast said:
Yes I know but you are not getting what I am saying, forget 22nm for a moment their 32nm is already the best process. For a process that mature medfield doesn't provide either the performance or battery life that it should.

Of course a new design and 22nm will be the real deal, but they need that technology to enable multicore atoms, they have not so instead they make false statements to discredit technology that they can't provide is all I'm saying.

That's not what you said, you said medfield was on their most advanced node, which it clearly is not.
Also if It came out at the START of their 32nm process, it would likely be on a par with anything that was available at the same timeframe.

It appears that your new argument is that at any process point, Intels socs are not in the same class as arms socs. That is likely true. However intel would argue that that is irrelevant. If they can produce socs that are class equalling/ class leading, then it doesn't matter how they do it ( whether it's process or arch). And the surprising decent performance of medfield is evidence that they may well me able to produce class leading socs in 12-18 months time, something that many many would having suggested was impossible 2 years ag

ne wonders will the arch improvements of the redesigned atom ( is it called airmont ?) bring minor or significant efficiency benefits.

french toast · Jun 10, 2012

Sorry but you have miss read my post that you are referring to.

I clearly said THE best process : meaning the best process around in smartphone SOC's right now, this was in the context of medfield and it's current competitors this year. Silvermont is the new design on 22nm and is 12-18 months away, although your right that will be the real deal.

The reason why I mention medfield process is because performance - whilst better than almost everyone thought, is not competitive with 2012 designs -outside of single thread sunspider, remember the whole package is built on that class leading 32nm, including baseband and gpu.

The gpu is weak and it has no lte, despite that battery life is comparable to a 2011 exynos 4210@ 45nm, meaning as I read it that single saltwell core is sucking the power budget disproportionately.

Yes we all look forward to silvermont, but in 18 months arm won't be stagnant either, technologies like big_little, 20nm & fd-soi will be in play or not far off, not to forget arm v8 is on the horizon which will be revolutionary.

tangey · Jun 10, 2012

french toast said:
Sorry but you have miss read my post that you are referring to.

I clearly said THE best process : meaning the best process around in smartphone SOC's right now, this was in the context of medfield and it's current competitors this year. Silvermont is the new design on 22nm and is 12-18 months away, although your right that will be the real deal.

Indeed, but the only reason your statement is true is because intel determined that mobile soc was not important. Clearly, if their priorities had been different, enough resources could have been put into soc development so a "medfield" could have launch mid 2010 when their mainline went to 32nm. Such a chip would have been class equalling, and in terms of graphics it would have been class leading ( Samsung was already using the same graphics core on its S5PC110 chip but only at 200mhz due to 45nm process). Instead they silently launched their 2-chip, 45 nm moorestown with redundant dx compliance in the graphics core, and shortly after that fired the vp in charge of the handheld division.

In other words, they had 90% of the current ip in 2010 ( identical CPU, identical gpu core, and identical video decode in use today), but on purely business terms decided not to put that ip on 32nm in 2010, likely because they were not really interested in the market segment, could not figure out how to make money out of it, and/or some uncertainty as to what OS to aim at.

In contrast, I think any arm soc developers have been (correctly) making best use of all the tech and the best process available at any one development timeslot.

Yes we all look forward to silvermont, but in 18 months arm won't be stagnant either, technologies like big_little, 20nm & fd-soi will be in play or not far off, not to forget arm v8 is on the horizon which will be revolutionary.

Indeed, at which point, assuming intel sticks to its public roadmap, their handheld socs will be launching on the latest process, and we'll be able to compare the best of both worlds when both are trying their best.

metafor · Jun 10, 2012

Nebuchadnezzar said:
Samsung doesn't have any traditional C states, there's only 3: AFTR which is mysterious in what it does and disabled on the 4412, LPA which is core clock gating and only works when the screen is off, and complete core power gating/hotplugging, and nothing else.

Core power states are architecturally required. For Cortex A9, there are 3 that are relevant, standby, dormant and shutdown.

The big difference between dormant and shutdown is that dormant has the SRAM array in a retention state. This means that the L1 cache doesn't need to be flushed. The processor state can also be stored in L1 instead of needing to be written back to memory. This is a significant part of the core power cycle latency.

The 500MHz threshold is the frequency condition in which a core is supposed to kick online from its offline state, it can go back down into a lower frequency when it is online, and kick-off frequency is 200MHz which is minimum frequency on stock. The hotplugging is actually fairly complex based on thread runqueues where it injects a monitoring thread into the runqueue of each CPU and measures the time spent in the queue and based on that and a threshold queue with different values for each core jump. Here's the governor with all the scaling logic of the 4412, I think it is very well thought through.

It looks like they're taking a runtime average of frequency and only kicking off when that's below 200MHz. Clever, but that can result in a core being up for much longer than it needs to be. I assume the history length is set up as something reasonable?

french toast said:
How annoying, i have pulled a load of screen shots off, saved them onto my email as an attachment, unzipped them onto my netbook, and now im stuck as i can't attach anything and it wont copy and paste....grrr

I could email you them?....

I'd rather not give that out. You can use a file hosting service like dropbox or depositfiles.

french toast · Jun 10, 2012

Fair enough. Just as well as I tried emailing another member them and they wouldn't send. I have a new dropbox account but unsure how it works, pm me your dropbox details and how to send and I will.

Laurent06 · Jun 11, 2012

liolio said:
You can't take Intel CPU on insulation.
Something Intel has is really FPU / SIMD for example.
It's nice how irrelevant benchmark are relevant depending on who win.
Medfield is tiny barely bigger than a tegra2 on 32nm process.

This whole Intel sucks, X86 sucks, bores me I leave the discussion it's pointless.

Too bad you left, you'd have learned that Atom derivatives are not that good at FPU/SIMD compared to other x86. I ran Linpack (as compiled by gcc) on both T2 and N270; when both run at 1 GHz I got 100 MFlop/s on the Atom and 134 on Tegra2.

Intel Atom Z600

Similar threads