Intel Silvermont(Next-gen OOE Atom)

If Intel gets traction on the mobile market while would devs use a compiler that is less performant than Intel one because the worse performing compiler is available on different platform?
Then why not every commercial app on Windows or Linux is using icc?

My personal experience with icc is mixed to say the least: when it's faster than gcc, it's by a few percent, but most of the time the speed is very similar. Basically if your code doesn't vectorize or doesn't look like a benchmark that icc has specifically targeted then there's no point in using it, just use the most widely used compiler on your platform (that is VS for Windows and gcc for Linux/Android).
 
Then why not every commercial app on Windows or Linux is using icc?

My personal experience with icc is mixed to say the least: when it's faster than gcc, it's by a few percent, but most of the time the speed is very similar. Basically if your code doesn't vectorize or doesn't look like a benchmark that icc has specifically targeted then there's no point in using it, just use the most widely used compiler on your platform (that is VS for Windows and gcc for Linux/Android).
Thanks for the explanation, I was ignorant on the matter :)
 
If Intel gets traction on the mobile market why would devs use a compiler that is performing worse than Intel one because the worse performing compiler is available on different platform?
Even if everybody would be using icc it is questionable you would see the same kind of differences. Now maybe icc indeed makes more of a difference on atoms compared to "normal" x86 cpus (and yes I'd be interested in seeing results for less-widely-known stuff compiled), but this kind of optimization intel is doing here smells like being targeted exclusively at this benchmark. So this isn't really a fair comparison since noone is using a special gcc targeting benchmarks.
Also, it would help if the guys developing such benchmarks would disclose the toolchain they are using up front as it plays such a crucial role.
 
Even better..release the source code.
Now that would be neat - kinda like SPEC cpu where you could certainly do your own numbers using your own toolchain.
Though given that at least the cpu portion just seems to be nbench someone could try that instead...
 
There's a nbench app on the Android store ;)
With published results even...
Seems to be single-threaded though in contrast to AnTuTu. In any case atoms look sort of ok there, though there's only desktop atoms in the list, and the arm SoCs listed aren't quite the fastest (best I could spot was msm8960, dual core Krait 1.5Ghz (in the Motorola MB886)).
 
With published results even...
Seems to be single-threaded though in contrast to AnTuTu. In any case atoms look sort of ok there, though there's only desktop atoms in the list, and the arm SoCs listed aren't quite the fastest (best I could spot was msm8960, dual core Krait 1.5Ghz (in the Motorola MB886)).
Yes, not many results, in particular no Android x86 results :(
 
Various parts of a CPU pipeline can be in-order or out-of-order, depending on the goals of the particular architecture and there's nothing strange about it. For instance, P4's instruction fetch and decode phase was an in-order 8-stage pipeline organization.
 
I was not implying this is a strange choice just an interesting design decision (if true).
Obviously, this would not be possible with a core-i design (as that has unified scheduler) but since it is known silvermont has separate schedulers this is indeed quite possible. However, silvermont architecture articles (like this one: http://www.realworldtech.com/silvermont/4/) certainly implied simd unit is fully OoO too.
Though I would definitely qualify the P4 design as interesting AND strange overall :).
 
Interestingly heise is claiming there are hints that the chip is not completely an out-of-order design. In particular the SIMD unit could be in-order they say: http://www.heise.de/ct/artikel/Prozessorgefluester-1921728.html - I've never heard of this and there's no other sources there but that's quite interesting.

Intel says it unambiguously in the software optimization guide.

The IEC schedulers pick the oldest ready instruction from each of its RSVs while the MEC and the FPC schedulers only look at the oldest instruction in their respective RSVs. Even though the MEC and FPC clusters employ in-order schedulers, a younger instruction from a particular FPC RSV can execute before an older instruction in the other FPC RSV for example (or the IEC or MEC RSVs).
http://www.intel.com/content/www/us...-ia-32-architectures-optimization-manual.html

Various parts of a CPU pipeline can be in-order or out-of-order, depending on the goals of the particular architecture and there's nothing strange about it. For instance, P4's instruction fetch and decode phase was an in-order 8-stage pipeline organization.

I've actually never heard of a CPU that had out of order fetch or decode, have you? It seems like there wouldn't be value since there aren't really dependencies between fetch/decode of different instructions, although I guess it'd qualify as out of order if you can fetch from later blocks that are in icache when earlier ones aren't (not sure if this is really done either).
 
Last edited by a moderator:
Oh I missed that. In fact it even says it in the silvermont microarchitecture overview too: "While floating-point and memory instructions schedule from their respective queues in program order, integer execution instructions schedule from their respective queues out of order."
Makes me wonder though if it would really have been much effort to go "full" out-of-order in the fpu cluster? Granted it does make the RSVs simpler but that seems to be about it. Also I wonder how the allocation to the RSVs works, if an instruction could be scheduled on both FP RSVs (deciding this was a quite sore point on K8 cpus). If the wrong one is picked it potentially has to wait behind some instruction whose data isn't ready even if it could be executed just fine (which could be a somwhat worse problem than it was on K8, as there picking the "wrong" RSV only meant picking a more busy execution pipe essentially).
So I guess the simd unit is all in all really quite weak (don't forget the multiplier (as well as divide unit) is also 2x32bit only, it still has terrible non-pipelined microcoded horizontal instructions (not just horizontal but some very nice other instructions as well, like pmulld, pshufb are also like that) inherited from bonnell essentially). Kabini's simd unit (not quite sure about A15 as the instructions are obviously different but I suspect it's much better on paper as well) is compared to that in a completely different class.
 
http://www.anandtech.com/show/7263/intel-teases-baytrail-performance-with-atom-z3770-cinebench-score

d8RTgjf.png
 
I had made a guess over at the AnandTech forums that the 2W SDP Silvermont Z3770 would come within 15% of the 15W Kabini A4-5000. It appears that my already generous assumption wasn't generous enough.

I doubt that we'll see quite these high of numbers when Silvermont lands in the real world, but I think it's pretty safe to say that Silvermont is going to be a slam dunk, from a performance standpoint.
 
I had made a guess over at the AnandTech forums that the 2W SDP Silvermont Z3770 would come within 15% of the 15W Kabini A4-5000. It appears that my already generous assumption wasn't generous enough.

I doubt that we'll see quite these high of numbers when Silvermont lands in the real world, but I think it's pretty safe to say that Silvermont is going to be a slam dunk, from a performance standpoint.

Cinebench is scalar floating point so Silvermont isn't penalized for it's 64-bit multiply SIMD integer/FP pipe. It be interesting to see the Silvermont in a diverse set of benchmarks. Also next year should bring Jaguar's successor which will bring Connected Standby and probably a good non-deterministic Richland type turbo to the CPU side.
 
Back
Top