Intel Atom Z600

french toast · Jun 11, 2012

Laurent06 said:
Too bad you left, you'd have learned that Atom derivatives are not that good at FPU/SIMD compared to other x86. I ran Linpack (as compiled by gcc) on both T2 and N270; when both run at 1 GHz I got 100 MFlop/s on the Atom and 134 on Tegra2.

If you look at anands htc one x at&t article, he does linpack single and multithreaded.

Shipping medfield pulls around 90mflops, as it is only ht it only pulls the same result in multithreaded, where as krait scales near linearly with 2 full cores for around 210mflops.

This maybe an area mobile atom struggles with going forward if they continue with ht, 4 krait cores will be hitting 1gflop.

Although with on board gpus im not sure how much of an advantage that would be?.

mczak · Jun 11, 2012

Laurent06 said:
Too bad you left, you'd have learned that Atom derivatives are not that good at FPU/SIMD compared to other x86. I ran Linpack (as compiled by gcc) on both T2 and N270; when both run at 1 GHz I got 100 MFlop/s on the Atom and 134 on Tegra2.

I don't think the SIMD unit is that weak on atom, after all it has a physically 128bit wide unit (except the multiplier) which is more than what you got with Athlon 64 before Phenom...
In some areas though it is indeed very weak - divisions for instance take forever (and block everything), non-sse fpu code suffers compared to sse scalar code, and probably most notably packed doubles is a pure non-pipelined disaster (must not use them if you want performance, always use scalar code).
If you'd use a single precision linpack benchmark results might be more reasonable (just a guess...). For doubles the simd/fpu unit is weak no question about it but otherwise it might not be that bad.

sebbbi · Jun 11, 2012

Laurent06 said:
Atom derivatives are not that good at FPU/SIMD compared to other x86.

Atom can actually execute two 128 bit SIMD instructions per cycle (peak). x87 FPU is a disaster, and you should instead just use single lane SIMD and get much better performance out of it. Doubles are bad no matter where you run them (but why use doubles on mobile phone software in the first place?).

Atom is much better than competing architectures in SIMD. Cortex A9s run SIMD at half rate (64 bit per cycle), and so does Bobcat.

metafor · Jun 11, 2012

sebbbi said:
Atom can actually execute two 128 bit SIMD instructions per cycle (peak). x87 FPU is a disaster, and you should instead just use single lane SIMD and get much better performance out of it. Doubles are bad no matter where you run them (but why use doubles on mobile phone software in the first place?).

Atom is much better than competing architectures in SIMD. Cortex A9s run SIMD at half rate (64 bit per cycle), and so does Bobcat.

Unfortunately, the JIT likes doubles. It also loves divides when it could easily multiply by 0.5. Given that the Linpack benchmark runs on the JIT and that its working data set is larger than most L2 caches in the mobile space and you can pretty much forget it as a comparison of FP performance.

metafor · Jun 11, 2012

french toast said:
Fair enough. Just as well as I tried emailing another member them and they wouldn't send. I have a new dropbox account but unsure how it works, pm me your dropbox details and how to send and I will.

Dropbox works by you uploading something to your account and then pasting the public link. IIRC, you can't upload to someone else's dropbox. Your GS3 should've even come with it integrated (with 50GB of free storage for buying the GS3).

french toast · Jun 12, 2012

metafor said:
Dropbox works by you uploading something to your account and then pasting the public link. IIRC, you can't upload to someone else's dropbox. Your GS3 should've even come with it integrated (with 50GB of free storage for buying the GS3).

Well for some reason when i signed up i only got 2.5gb? Ill have to have a word...

http://db.tt/zJeOJAzp

Theres my dropbox link to the screen shots, what usefull information you can get from that not sure, interesting none the less.

sebbbi · Jun 12, 2012

metafor said:
Unfortunately, the JIT likes doubles. It also loves divides when it could easily multiply by 0.5. Given that the Linpack benchmark runs on the JIT and that its working data set is larger than most L2 caches in the mobile space and you can pretty much forget it as a comparison of FP performance.

And that's the thing I have never understood in mobile phone design. Why would you want to run your software on JIT instead of running native optimized code? Battery sizes are limited, processing capacity is limited and memory bus is limited (doubles take 8 bytes each). Why would you want to waste resources on a platform like this?

Apple understood this problem, and their phones run very well on 1.0 GHz processors (4G = single core, 4GS = dual core, both are smooth as butter). Recent Android phones are quad core, and clocks are 1.5 GHz already, but the software isn't still perfectly smooth. Androids have twice as much memory as well (garbage collected higher level languages have poor peak memory usage). These higher level managed languages are great for PCs that are plugged into wall and have huge amount of untapped performance, but I don't see the point to run this kind of unoptimized code on devices that are size, battery and performance constrained.

Exophase · Jun 12, 2012

metafor said:
I'd be curious as to what site you actually loaded. Anand has done CPU utilization charts on Tegra 3 for the browser. Up to 3 cores were used and the 3rd one really only clocked up to minimum frequency. The second one was utilized to about ~25%. I don't doubt that there is some corner case that you can push 4 cores up to full utilization. But I'd argue those cases are not only rare but in a smartphone, would have to be the result of very inefficient software (Flash).

I was under the impression that Tegra 3's cores were synchronously clocked. nVidia at least seemed to be arguing in favor of this. If this is true then you'd be a lot more conservative about turning on those cores.

sebbbi said:
Atom can actually execute two 128 bit SIMD instructions per cycle (peak). x87 FPU is a disaster, and you should instead just use single lane SIMD and get much better performance out of it. Doubles are bad no matter where you run them (but why use doubles on mobile phone software in the first place?).

Atom is much better than competing architectures in SIMD. Cortex A9s run SIMD at half rate (64 bit per cycle), and so does Bobcat.

Sure, at its core Atom has quite respectable execution resources for SIMD - although it's not as dire as you make it sound, NEON on Cortex-A8/A9 can execute most (non-multiply) integer operations at 128-bit in parallel with a 128-bit load/store/permute. In my mind the real limitation is in ISA, where being stuck with x86-32 SSSE3 (on Medfield) is a real hindrance for typical SSE code with in-order execution. I'd much rather take the 16x128/32x64 register layout with three-address execution in NEON if forced in-order. There are a number of other ISA advantages as well, IMO, in addition to some disadvantages of course. Atom does lack some of the more fundamental/crippling latency problems of NEON on A8/A9, though. But unlike A8/A9 it's really awful at 64-bit integer SIMD.

Exophase · Jun 12, 2012

sebbbi said:
And that's the thing I have never understood in mobile phone design. Why would you want to run your software on JIT instead of running native optimized code? Battery sizes are limited, processing capacity is limited and memory bus is limited (doubles take 8 bytes each). Why would you want to waste resources on a platform like this?

Apple understood this problem, and their phones run very well on 1.0 GHz processors (4G = single core, 4GS = dual core, both are smooth as butter). Recent Android phones are quad core, and clocks are 1.5 GHz already, but the software isn't still perfectly smooth. Androids have twice as much memory as well (garbage collected higher level languages have poor peak memory usage). These higher level managed languages are great for PCs that are plugged into wall and have huge amount of untapped performance, but I don't see the point to run this kind of unoptimized code on devices that are size, battery and performance constrained.

I don't think the "smoothness" of the UI is really down to Dalvik. That's probably indicative of some other design problems..

Performance critical code can (and very often does) use NDK. You probably still pay some price for the Java glue going to the Android interfaces but I doubt it's that bad.

However, when Android released there was no NDK and there was no JIT for Dalvik even. Leading me to pose the same questions you are about it. Nonetheless, if it weren't for Dalvik Intel and MIPS vendors would probably have a harder time pushing phones than they are now.

Laurent06 · Jun 12, 2012

sebbbi said:
Atom can actually execute two 128 bit SIMD instructions per cycle (peak).

My understanding is that this applies only to integer SIMD instructions (and not all of them). That doesn't matter for Linpack which needs FP multiplication which isn't even fully pipelined on Atom.

From Agner Fog microarchitecture manual:

The four units ALU0, ALU1, FP0 and FP1 probably have one integer ALU each, though it
cannot be ruled out that there are only two integer ALUs, which are shared between ALU0
and FP0 and between ALU1 and FP1, respectively. There is one multiply unit which is
shared between ALU0 and FP0, and one division unit shared between ALU1 and FP1.

The SIMD integer adders and shift units have full 128-bit widths and a one clock latency.
The floating point adder has full 128-bit capability for single precision vectors, but only 64-bit
capability for double precision. The multiplier and the divider are 64-bits wide.

The floating point adder has a latency of 5 clock cycles and is fully pipelined to give a
throughput of one single precision vector addition per clock cycle. The multiplier is partially
pipelined with a latency of 4 clocks and a throughput of one single precision multiplication
per clock cycle. Double precision and integer multiplications have longer latencies and a
lower throughput. The time from one multiplication starts till the next multiplication can start
varies from 1 clock cycle in the most favorable cases, to 2 clock cycles or more in less
favorable cases. Double precision vector multiplication and some integer multiplications
cannot overlap in time.

Even the 64-bit wide NEON unit in A9 can issue 2 single precision mac per cycle. So I insist that Atom FPU/SIMD unit stinks

french toast · Jun 12, 2012

Yes android is a resource hog and a bloated mess, you guys would know more about why than me, but I will say even this galaxy s3 can be coaxed into the occasional stutter or slowdown, as unreal as that sounds considering the bandwidth, processing, gpu,ram and software it runs.
If this hardware and 4 year optimisation can't do it, then they may want to scrap it and start from scratch, or at least ban ui skinning.

The amount of hidden mysterious background processes running at anyone time, completely un explained, unstoppable, and having access to all your data is not just a resource hog, it's worrying.
-why is it every time I install any app, I have to consent to sharing my data at any time, even to the extent of `recording phone calls' ?? Bizarre....

Ios 6 no doubt runs like butter, but it's boring, limited and looks dated.

Windows phone 8 for me will be the best mobile operating system for non geeks. Intuitive, lightning fast, resource friendly, really good looking and modern.

As long as they pack in decent soc support (a dual core s4 pro) and link the metro app store across all windows platforms then that is going to be real competition imo.

When comparing simd engines why are you guys comparing atom to a9? I was under the impression krait holds the crown....

Laurent06 · Jun 12, 2012

french toast said:
When comparing simd engines why are you guys comparing atom to a9? I was under the impression krait holds the crown....

My point is that Atom has about the same speed as A9 for most tasks (when run at about the same frequency). Krait obviously is faster.

Exophase · Jun 12, 2012

Laurent06 said:
Even the 64-bit wide NEON unit in A9 can issue 2 single precision mac per cycle. So I insist that Atom FPU/SIMD unit stinks

Sure, but with ~9 cycle latency and a bunch of hazards (some which are sort of documented, others which are not) it's really hard to actually sustain this throughput.

You're misreading Agner's comments. When he says "one single precision output" he means packed single precision, or four scalars. Look at his instruction timing tables, addps has a throughput of one per cycle and mulps one per two cycles, although there are apparently cases where it can achieve one per cycle (but he doesn't elaborate on this). Furthermore, FP adds and multiplies can be co-issued. You can see this allows a higher theoretical FMADD throughput even without an FMADD instruction. Cortex-A9 doesn't even have A8's NEON co-issuing capabilities, which is a major drawback for hand optimized NEON code.

Atom's SIMD is really pretty good within the limitations of its ISA. Perhaps a little too good, since its utilization is going to be pretty low in Android apps.

metafor · Jun 12, 2012

Exophase said:
I was under the impression that Tegra 3's cores were synchronously clocked. nVidia at least seemed to be arguing in favor of this. If this is true then you'd be a lot more conservative about turning on those cores.

All 4 cores run at the same clock, yes. However, when CPU utilization is low -- and the OS schedules it in batches -- then the clock root can be opportunistically shut off. For most mobile CPU's, this requires a few cycles (C1 state in x86 world). This can happen automatically for a WFI/WFE instruction that typically is used when a thread goes to sleep.

It can optionally be used for when there is a (in the case of mobile SoC's) very long latency memory reads and the pipeline is backed up waiting for that load.

This effectively accomplishes much of the same advantages as downclocking. Though obviously, all else being equal, the higher-clocked CPU will still consume more power. Just not as much as you'd think.

sebbbi said:
And that's the thing I have never understood in mobile phone design. Why would you want to run your software on JIT instead of running native optimized code? Battery sizes are limited, processing capacity is limited and memory bus is limited (doubles take 8 bytes each). Why would you want to waste resources on a platform like this?

Android wasn't created with such a specific aim in mind. Remember that this was in the days of Palm and PDA's. Most of them actually ran Java ME. It made development easy -- this was before the age of app stores -- and the choice of hardware flexible.

It wasn't until iOS that UI "smoothness" was even a factor in most manufacturer's minds. Android originally wasn't even supposed to be touch-based.

While it is a perception thing, I have to wonder how important absolute stutter-free is. I've played around with the One S and there is nothing about the UI that I find lacking in terms of use. Yes, you can push it to the point where the scroll list may skip a frame but I think we've gotten to the point where it's not really a hinderence to usability.

As for efficiency and battery life, the meager processing power required for most applications -- even with the JIT -- is absolutely dwarfed by the amount consumed by the screen, wifi, gps and cell radios.

silent_guy · Jun 12, 2012

metafor said:
While it is a perception thing, I have to wonder how important absolute stutter-free is. I've played around with the One S and there is nothing about the UI that I find lacking in terms of use. Yes, you can push it to the point where the scroll list may skip a frame but I think we've gotten to the point where it's not really a hinderence to usability.

It's mostly a matter of the delight factor. Nothing will prevent you from reading emails better on one or the other, but there's nothing wrong with finding these little flaws distracting if you're used to better. With the One X & friends that I've tried, it seems to have come to the point where it's pretty good now. It only took 5 years.

As for efficiency and battery life, the meager processing power required for most applications -- even with the JIT -- is absolutely dwarfed by the amount consumed by the screen, wifi, gps and cell radios.[/QUOTE]

rpg.314 · Jun 13, 2012

metafor said:
Unfortunately, the JIT likes doubles. It also loves divides when it could easily multiply by 0.5.

The changes you suggest are quite likely forbidden by laguage spec.

metafor · Jun 13, 2012

french toast said:
Well for some reason when i signed up i only got 2.5gb? Ill have to have a word...

http://db.tt/zJeOJAzp

Theres my dropbox link to the screen shots, what usefull information you can get from that not sure, interesting none the less.

So I finally got some time to look at this. What's odd is that Exynos clocks each CPU at similar frequencies (except for core 0 for some reason) but instantaneous CPU utilization is still ~24%. Did Samsung go with symmetrical MP ala nVidia?

I assume the 4 bars are OS utilization of each core. Correct me if I'm wrong but it looks like one core does about twice as much total work as the the second most used. A third core looks like it does again half the amount of total work and the 4th half of that.

french toast · Jun 13, 2012

Well I'm not as informed as your self however I would say that apart from core 1 and core 4 which seem to be able to be power gated off or clocked independently of the rest (I think), any other scenario seems to be groupings of 2 or 3 cores at the same frequency.

I don't think it is as limited as nvidias setup, on the other hand I don't think it's quite as optimised as Qualcomm's.

Another prominant member has suggested that maybe they are sharing 2 voltage rails, perhaps that's correct although I may have seen 3 clock at diff frequencies, can't be sure though. (Unlikely).

Laurent06 · Jun 13, 2012

Exophase said:
You're misreading Agner's comments. When he says "one single precision output" he means packed single precision, or four scalars. Look at his instruction timing tables, addps has a throughput of one per cycle and mulps one per two cycles, although there are apparently cases where it can achieve one per cycle (but he doesn't elaborate on this). Furthermore, FP adds and multiplies can be co-issued. You can see this allows a higher theoretical FMADD throughput even without an FMADD instruction.

I indeed misread it, thanks for correcting my mistake. I'll try to rerun linpack, but unless gcc improved, I'm afraid it won't properly use Atom SIMD instructions (though in the last few years it seems some Intel emplyees pushed some Atom specific optimizations).

mczak · Jun 13, 2012

Laurent06 said:
I indeed misread it, thanks for correcting my mistake. I'll try to rerun linpack, but unless gcc improved, I'm afraid it won't properly use Atom SIMD instructions (though in the last few years it seems some Intel emplyees pushed some Atom specific optimizations).

Can you get single-precision numbers? Obviously they aren't comparable but I'm curious how much faster atom would be...

Intel Atom Z600

Similar threads