Intel Atom Z600

Discussion in 'Mobile Devices and SoCs' started by liolio, May 6, 2010.

  1. french toast

    french toast Veteran

    If you look at anands htc one x at&t article, he does linpack single and multithreaded.

    Shipping medfield pulls around 90mflops, as it is only ht it only pulls the same result in multithreaded, where as krait scales near linearly with 2 full cores for around 210mflops.

    This maybe an area mobile atom struggles with going forward if they continue with ht, 4 krait cores will be hitting 1gflop.

    Although with on board gpus im not sure how much of an advantage that would be?.
     
  2. mczak

    mczak Veteran

    I don't think the SIMD unit is that weak on atom, after all it has a physically 128bit wide unit (except the multiplier) which is more than what you got with Athlon 64 before Phenom...
    In some areas though it is indeed very weak - divisions for instance take forever (and block everything), non-sse fpu code suffers compared to sse scalar code, and probably most notably packed doubles is a pure non-pipelined disaster (must not use them if you want performance, always use scalar code).
    If you'd use a single precision linpack benchmark results might be more reasonable (just a guess...). For doubles the simd/fpu unit is weak no question about it but otherwise it might not be that bad.
     
  3. sebbbi

    sebbbi Veteran

    Atom can actually execute two 128 bit SIMD instructions per cycle (peak). x87 FPU is a disaster, and you should instead just use single lane SIMD and get much better performance out of it. Doubles are bad no matter where you run them (but why use doubles on mobile phone software in the first place?).

    Atom is much better than competing architectures in SIMD. Cortex A9s run SIMD at half rate (64 bit per cycle), and so does Bobcat.
     
  4. metafor

    metafor Regular

    Unfortunately, the JIT likes doubles. It also loves divides when it could easily multiply by 0.5. Given that the Linpack benchmark runs on the JIT and that its working data set is larger than most L2 caches in the mobile space and you can pretty much forget it as a comparison of FP performance.
     
  5. metafor

    metafor Regular

    Dropbox works by you uploading something to your account and then pasting the public link. IIRC, you can't upload to someone else's dropbox. Your GS3 should've even come with it integrated (with 50GB of free storage for buying the GS3).
     
  6. french toast

    french toast Veteran

    Well for some reason when i signed up i only got 2.5gb? Ill have to have a word...

    http://db.tt/zJeOJAzp

    Theres my dropbox link to the screen shots, what usefull information you can get from that not sure, interesting none the less.
     
  7. sebbbi

    sebbbi Veteran

    And that's the thing I have never understood in mobile phone design. Why would you want to run your software on JIT instead of running native optimized code? Battery sizes are limited, processing capacity is limited and memory bus is limited (doubles take 8 bytes each). Why would you want to waste resources on a platform like this?

    Apple understood this problem, and their phones run very well on 1.0 GHz processors (4G = single core, 4GS = dual core, both are smooth as butter). Recent Android phones are quad core, and clocks are 1.5 GHz already, but the software isn't still perfectly smooth. Androids have twice as much memory as well (garbage collected higher level languages have poor peak memory usage). These higher level managed languages are great for PCs that are plugged into wall and have huge amount of untapped performance, but I don't see the point to run this kind of unoptimized code on devices that are size, battery and performance constrained.
     
  8. Exophase

    Exophase Veteran

    I was under the impression that Tegra 3's cores were synchronously clocked. nVidia at least seemed to be arguing in favor of this. If this is true then you'd be a lot more conservative about turning on those cores.

    Sure, at its core Atom has quite respectable execution resources for SIMD - although it's not as dire as you make it sound, NEON on Cortex-A8/A9 can execute most (non-multiply) integer operations at 128-bit in parallel with a 128-bit load/store/permute. In my mind the real limitation is in ISA, where being stuck with x86-32 SSSE3 (on Medfield) is a real hindrance for typical SSE code with in-order execution. I'd much rather take the 16x128/32x64 register layout with three-address execution in NEON if forced in-order. There are a number of other ISA advantages as well, IMO, in addition to some disadvantages of course. Atom does lack some of the more fundamental/crippling latency problems of NEON on A8/A9, though. But unlike A8/A9 it's really awful at 64-bit integer SIMD.
     
  9. Exophase

    Exophase Veteran

    I don't think the "smoothness" of the UI is really down to Dalvik. That's probably indicative of some other design problems..

    Performance critical code can (and very often does) use NDK. You probably still pay some price for the Java glue going to the Android interfaces but I doubt it's that bad.

    However, when Android released there was no NDK and there was no JIT for Dalvik even. Leading me to pose the same questions you are about it. Nonetheless, if it weren't for Dalvik Intel and MIPS vendors would probably have a harder time pushing phones than they are now.
     
  10. Laurent06

    Laurent06 Veteran

    My understanding is that this applies only to integer SIMD instructions (and not all of them). That doesn't matter for Linpack which needs FP multiplication which isn't even fully pipelined on Atom.

    From Agner Fog microarchitecture manual:
    Even the 64-bit wide NEON unit in A9 can issue 2 single precision mac per cycle. So I insist that Atom FPU/SIMD unit stinks :)
     
  11. french toast

    french toast Veteran

    Yes android is a resource hog and a bloated mess, you guys would know more about why than me, but I will say even this galaxy s3 can be coaxed into the occasional stutter or slowdown, as unreal as that sounds considering the bandwidth, processing, gpu,ram and software it runs.
    If this hardware and 4 year optimisation can't do it, then they may want to scrap it and start from scratch, or at least ban ui skinning.

    The amount of hidden mysterious background processes running at anyone time, completely un explained, unstoppable, and having access to all your data is not just a resource hog, it's worrying.
    -why is it every time I install any app, I have to consent to sharing my data at any time, even to the extent of `recording phone calls' ?? Bizarre....

    Ios 6 no doubt runs like butter, but it's boring, limited and looks dated.

    Windows phone 8 for me will be the best mobile operating system for non geeks. Intuitive, lightning fast, resource friendly, really good looking and modern.

    As long as they pack in decent soc support (a dual core s4 pro) and link the metro app store across all windows platforms then that is going to be real competition imo.

    When comparing simd engines why are you guys comparing atom to a9? I was under the impression krait holds the crown....
     
  12. Laurent06

    Laurent06 Veteran

    My point is that Atom has about the same speed as A9 for most tasks (when run at about the same frequency). Krait obviously is faster.
     
  13. Exophase

    Exophase Veteran

    Sure, but with ~9 cycle latency and a bunch of hazards (some which are sort of documented, others which are not) it's really hard to actually sustain this throughput.

    You're misreading Agner's comments. When he says "one single precision output" he means packed single precision, or four scalars. Look at his instruction timing tables, addps has a throughput of one per cycle and mulps one per two cycles, although there are apparently cases where it can achieve one per cycle (but he doesn't elaborate on this). Furthermore, FP adds and multiplies can be co-issued. You can see this allows a higher theoretical FMADD throughput even without an FMADD instruction. Cortex-A9 doesn't even have A8's NEON co-issuing capabilities, which is a major drawback for hand optimized NEON code.

    Atom's SIMD is really pretty good within the limitations of its ISA. Perhaps a little too good, since its utilization is going to be pretty low in Android apps.
     
  14. metafor

    metafor Regular

    All 4 cores run at the same clock, yes. However, when CPU utilization is low -- and the OS schedules it in batches -- then the clock root can be opportunistically shut off. For most mobile CPU's, this requires a few cycles (C1 state in x86 world). This can happen automatically for a WFI/WFE instruction that typically is used when a thread goes to sleep.

    It can optionally be used for when there is a (in the case of mobile SoC's) very long latency memory reads and the pipeline is backed up waiting for that load.

    This effectively accomplishes much of the same advantages as downclocking. Though obviously, all else being equal, the higher-clocked CPU will still consume more power. Just not as much as you'd think.


    Android wasn't created with such a specific aim in mind. Remember that this was in the days of Palm and PDA's. Most of them actually ran Java ME. It made development easy -- this was before the age of app stores -- and the choice of hardware flexible.

    It wasn't until iOS that UI "smoothness" was even a factor in most manufacturer's minds. Android originally wasn't even supposed to be touch-based.

    While it is a perception thing, I have to wonder how important absolute stutter-free is. I've played around with the One S and there is nothing about the UI that I find lacking in terms of use. Yes, you can push it to the point where the scroll list may skip a frame but I think we've gotten to the point where it's not really a hinderence to usability.

    As for efficiency and battery life, the meager processing power required for most applications -- even with the JIT -- is absolutely dwarfed by the amount consumed by the screen, wifi, gps and cell radios.
     
    Last edited by a moderator: Jun 12, 2012
  15. silent_guy

    silent_guy Veteran Subscriber

    It's mostly a matter of the delight factor. Nothing will prevent you from reading emails better on one or the other, but there's nothing wrong with finding these little flaws distracting if you're used to better. With the One X & friends that I've tried, it seems to have come to the point where it's pretty good now. It only took 5 years. :wink:


    As for efficiency and battery life, the meager processing power required for most applications -- even with the JIT -- is absolutely dwarfed by the amount consumed by the screen, wifi, gps and cell radios.[/QUOTE]
     
  16. rpg.314

    rpg.314 Veteran

    The changes you suggest are quite likely forbidden by laguage spec.
     
  17. metafor

    metafor Regular

    So I finally got some time to look at this. What's odd is that Exynos clocks each CPU at similar frequencies (except for core 0 for some reason) but instantaneous CPU utilization is still ~24%. Did Samsung go with symmetrical MP ala nVidia?

    I assume the 4 bars are OS utilization of each core. Correct me if I'm wrong but it looks like one core does about twice as much total work as the the second most used. A third core looks like it does again half the amount of total work and the 4th half of that.
     
  18. french toast

    french toast Veteran

    Well I'm not as informed as your self however I would say that apart from core 1 and core 4 which seem to be able to be power gated off or clocked independently of the rest (I think), any other scenario seems to be groupings of 2 or 3 cores at the same frequency.

    I don't think it is as limited as nvidias setup, on the other hand I don't think it's quite as optimised as Qualcomm's.

    Another prominant member has suggested that maybe they are sharing 2 voltage rails, perhaps that's correct although I may have seen 3 clock at diff frequencies, can't be sure though. (Unlikely).
     
  19. Laurent06

    Laurent06 Veteran

    I indeed misread it, thanks for correcting my mistake. I'll try to rerun linpack, but unless gcc improved, I'm afraid it won't properly use Atom SIMD instructions (though in the last few years it seems some Intel emplyees pushed some Atom specific optimizations).
     
  20. mczak

    mczak Veteran

    Can you get single-precision numbers? Obviously they aren't comparable but I'm curious how much faster atom would be...
     
Loading...

Share This Page

Loading...