I understand the appeal of big/little, but isn't that a theoretical option at the moment? Is big/little available as a solution now or is actual silicon still a year out?
http://www.eetimes.com/electronics-...g-little--no-Haswell--Project-Denver-at-ISSCC
It'd be interesting to see how much power you can gain (*if* it's on the same process) by just synthesizing with lower clock. Tegra 3 can't provide any insight in that.
It's nearly certainly also using a lower Vt rather than only different synthesis. The problem on 40LP is that even Low Vt wasn't that fast and not as power efficient as 40G High Vt, while there also was a big difference in nominal voltage (1.1v vs 0.9v). I think 28HPL has a much higher base performance (such that you might want a lot more logic on High Vt rather than normal Vt presumably) and the nominal voltage difference isn't as big (1.0v vs 0.9v).
I really expected them to use 28HPM since it felt like the natural replacement to 28HPL, but I suppose they concluded they didn't need the 28HPM Ultra-High-Vt option and the higher performance wasn't worth the cost, or maybe it's just a timeframe issue since 28HPM is lagging behind quite a lot (and already was before it got delayed by TSMC).
What do you mean 'with both sets active'? Is this: both the big and the little of the same combo are running their own independent code? So you're basically running 8 CPUs? That's crazy.
Is this what ARM is currently promoting?
Yes. In fact they promoted it from the start, see the "big.LITTLE MP Use Model" from this paper:
http://www.arm.com/files/downloads/big.LITTLE_Final.pdf
And they're promoting an asymmetric number of cores as a real option for the Cortex-A5x generation:
http://images.anandtech.com/doci/6420/Screen Shot 2012-10-30 at 12.22.25 PM.png (although it remains to be seen whether the software will be ready by then, specifically the Linux kernel).
What's the performance difference between big and little anyway?
I'd say ~2x in ILP and ~1.5x in clock speed for A15 vs A7, so ~3x at a given voltage. The power efficiency advantage at a given voltage is obviously massively higher than that...
I think this is a developers headache more than anything else, not something most people will actively experience as a reduction in image quality.
I disagree. You can minimise the problem by setting your depth range properly (and games/engines designed on other hardware and not properly tested on Tegra might not bother), but even then there are always some scenes where 16-bit isn't enough and you *will* have artifacts. You could make an argument that many people might not notice that, but then they might not notice the higher graphics performance either...
Furthermore the power efficiency cost of an uncompressed depth buffer is significant due to higher bandwidth, and 24-bit with framebuffer compression would nearly certainly be more power efficient than 16-bit without compression. It's really just a way to save area, and I don't think it makes any sense whatsoever for this generation of hardware. So hopefully they at least bothered with this even if it's basically the same architecture...
But more important: again as a user, I don't think it matters for the vast majority of games that are currently out there. Cut the Rope, Angry Birds, board games etc. They are all at the top of the sales ladder. They have cute graphics and they better be smooth these days, but 16-bit Z fights are not going to be a concern.
Agreed, although you'd be surprised by how many 3D games are in the Top 10 of the iOS sales chart nowadays although Android is still lagging behind obviously.
Anyway the problem here is that if you want high CPU performance, then the SoCs usually have high GPU performance as well, which you clearly don't need outside of 3D games. There's an interesting question of what the "mainstream" of the market will move to a few years from now based on the fact 3D performance is only a differentiator for part of the market, and realistically a browser doesn't need a quad-core OoOE monster either.
Personally I can see something like 1xA57+4xA53 with a fairly cheap GPU being a very attractive solution in the 20nm timeframe if the industry is willing to take the risk.
I haven't seen many people rave about the 2x faster GPU on the iPad 4. It's just not very noticeable.
I'd argue the benefit to Apple isn't people raving about how fast it is, but rather preventing people/competitors raving about how fast other products are. If you told me 5 years ago that NVIDIA wouldn't even talk about relative GPU performance compared to an in-house Apple SoC because it's clearly lagging behind, I'd have laughed you out of the room.
I understand the first order bandwidth implications of an IMR, but it strikes me that the effects are less pronounced in the real world that what you'd expect them to be. Take Tegra3: it has the same 32-bit wide MC as Tegra2, needs to feed double the CPUs and a more demanding GPU, yet the results are really much better than you'd expect them to be, with a pretty small die size to boot.
I don't know if I'd say "same 32-bit wide MC" given that the memory clock speed increased significantly and the highest performance products like the Prime Infinity use DDR3-1600 vs LPDDR2-667 for Tegra 2, but yeah, Tegra's performance is still surprisingly good with LPDDR2-800.
Take the A5X, (
http://www.anandtech.com/show/5688/apple-ipad-2012-review/15, bottom of page): it's a ridiculous 160mm2 vs 80mm2. And the GPU only area ratio should be even more out of whack, yet for equivalent resolutions it's only 2.5x faster. The A5X has not only way more external BW, but the disparity is even higher taking into account the on-chip RAMs. I'm really not impressed by this 2.5x and I don't understand why it's so little.
Part of the reason has to be that the clock speed is much lower, very likely both because Tegra's RTL is targeted at surprisingly high clock speeds for a handheld product (remember the GPU is on the LP voltage rail, not the G one) and because Apple focused the synthesis/voltage choice much more on power consumption than NVIDIA. So the question is whether the A5X is indeed more power efficient and I think the anecdotal evidence is that it is, so if you targeted the same power efficiency on the same process (not 45 vs 40) then it might be quite a bit faster.
Even then Tegra might still be more area efficient than SGX. All I'm going to say on the matter is this based strictly on public information: SGX supports a branch granularity of 1 (true MIMD scheduling although with Vec4 ALUs for peak performance) with FP32 ALUs while Tegra doesn't support control flow at all in the pixel shader (predication only) and uses FP24 ALUs. Unfortunately benchmarks (and many workloads) do not benefit from MIMD/FP32 and it should be obvious that you can achieve much better area efficiency in the shader core with NVIDIA's trade-offs...