Next-Gen iPhone & iPhone Nano Speculation

rpg.314 · Mar 9, 2012

metafor said:
If the workload is parallel enough to spread over 4 atomic cores, latency shouldn't be an issue.

Inter tile parallelism won't help you hide latency unless the GPU had multiple tiles in flight per core.

If you are doing one tile per core at a time, then having lots of embarrasingly parallel tiles doesn't help with latency.

metafor · Mar 9, 2012

It's not 2x and I doubt MP4 will be 4x in most realistic cases. But let's be clear, it is still an order of magnitude faster than T3 and just about every SoC out there. Likely will be ahead of most that will be released this year.

metafor · Mar 9, 2012

rpg.314 said:
Inter tile parallelism won't help you hide latency unless the GPU had multiple tiles in flight per core.

I thought all GPU's did this....doesn't seem to make sense not to.

french toast · Mar 9, 2012

Ailuros said:
Since the A5X sustained the dual A9 CPU, I honestly wonder why they went for a MP4@250MHz and didn't go for a MP2@500MHz. Granted a frequency increase is never for free, but it could still be cheaper than a MP4@250MHz.

Battery life? fill rate?

Lazy8s · Mar 9, 2012

Pixel fill should be almost exactly the same both ways. Geometry performance, at most, might be slightly reduced comparatively going with the multicore approach.

wco81 · Mar 9, 2012

Maybe Apple will be able to move the SOC to 32 nm in time for iPhone 5 (or whatever they will call it, probably the "new iPhone) later this year.

Or put out devices with A6 SOC, as some are speculating, later this year rather than next year.

Ailuros · Mar 9, 2012

metafor said:
It's not 2x and I doubt MP4 will be 4x in most realistic cases. But let's be clear, it is still an order of magnitude faster than T3 and just about every SoC out there. Likely will be ahead of most that will be released this year.

Well if Kishonti isn't going to change to GLBenchmark2.5 anytime soon, I'd expect the MP4 to be a good notch over 2x times faster than T3 in 720p Egypt. So yes since the "order of magnitude" in your and my cases includes also a "2" we agree.

Lazy8s said:
Pixel fill should be almost exactly the same both ways. Geometry performance, at most, might be slightly reduced comparatively going with the multicore approach.

Hmmm wait. I thought it looses according to their claims roughly 5% going from single to multi-core. I hope it doesn't also mean that you lose another 5% of geometry performance when you further scale cores (like in that case from 2 to 4).

Single SGX543@200MHz = 35 Mio Tris/s
SGX543MP2@200MHz = 66.5 Mio Tris/s
SGX543MP4@200MHz = 133 Mio Tris/s

If I'm not understanding the geometry scaling the wrong way, then theoretical peak geometry rates (assuming same frequencies for MP2 and MP4) should be twice as high on the MP4 compared to the MP2.

SGX543MP2@250MHz = 41.56 Mio Tris/s
SGX543MP2@500MHz = 83.13 Mio Tris/s
SGX543MP4@250MHz = 166.25 Mio Tris/s

rpg.314 · Mar 10, 2012

metafor said:
I thought all GPU's did this....doesn't seem to make sense not to.

Let's say you have ~100 pixel tiles on a 3M pixel screen, that's pretty low to begin with, but if you want to have say 4 tiles/core to hide latency, then you are rapidly approaching desktop class multithreading, not exactly ideal in mobile.

rpg.314 · Mar 10, 2012

GL Benchmark is pretty old at this point. Lot's of SoC's are exceeding 60fps. They really need to upgrade the benchmark, or at least upgrade the assets used.

metafor · Mar 10, 2012

rpg.314 said:
Let's say you have ~100 pixel tiles on a 3M pixel screen, that's pretty low to begin with, but if you want to have say 4 tiles/core to hide latency, then you are rapidly approaching desktop class multithreading, not exactly ideal in mobile.

Why wouldn't it be ideal for mobile? Are there really that many interdependencies between tiles for a rendering engine? Are there any at all? If not, then I don't really get what a cascading fetch with a decently sized SRAM pool couldn't do to hide latency.

I'm not sure what pixel formats are used in PVR, but let's say RGBA with 4 bytes/pixel. So 400 bytes for a 100 pixel tile. Let's say we have a 300 cycle wait between fetch and first-byte memory return (rather high, especially at GPU clocks, a few orders of magnitude too high, but let's say we're talking CPU clocks here).

So to effectively hide access-to-first-byte latency, you'd need a 120,000 byte or roughly a 128KB SRAM pool. That doesn't seem too far-fetched.

rpg.314 · Mar 10, 2012

metafor said:
Why wouldn't it be ideal for mobile? Are there really that many interdependencies between tiles for a rendering engine? Are there any at all? If not, then I don't really get what a cascading fetch with a decently sized SRAM pool couldn't do to hide latency.

I'm not sure what pixel formats are used in PVR, but let's say RGBA with 4 bytes/pixel. So 400 bytes for a 100 pixel tile. Let's say we have a 300 cycle wait between fetch and first-byte memory return (rather high, especially at GPU clocks, a few orders of magnitude too high, but let's say we're talking CPU clocks here).

So to effectively hide access-to-first-byte latency, you'd need a 120,000 byte or roughly a 128KB SRAM pool. That doesn't seem too far-fetched.

You have considered only fragment color so far. There's more to it than that. Let's assume no transparencies to simplify things. 4 tiles/core and 100 pixels/tile (this is awfully low, really awfully low for multimillion pixel displays btw) gets you 400 threads. Scoreboarding, shader context etc. balloons rapidly. And don't forget that G80 was capped at 512 threads/block in CUDA mode, a GPU which is unmatched to date in terms of features.

Throw in more realistic numbers for tile sizes, say 20 bytes per thread for shader context (which is still on the lower side), hdr rendering and the memory requirements go through the roof. Far better to use a larger tile size and rely on intra tile parallelism, IMO.

Erinyes · Mar 10, 2012

Lazy8s said:
Ah, I see. And while not directly related, I'll change my guess on the app processor: it should still be Sammy's 45nm, which is why they might not have bumped the GPU clocks at all or much.

Hmm, im not so sure they would have stuck to the 45nm process. Die size would be quite high in that case and possibly could even be approaching the limits of packaging. Surely it has to be on 32nm, i mean even if 32nm was a little while away it would make sense to wait and launch then.

Besides its not unprecedented that Apple has launched first on a new process by Samsung. Even when the original ipad launched in 2010, the A4 was the first chip which shipped on Samsung's 45nm process. Samsung's own hummingbird SoC only shipped a month or more later.

rpg.314 · Mar 10, 2012

Samsung delivered 45nm SoC in 2010. If in 2012 they cannot deliver a shrink then obviously Apple wouldn't use them as foundry. It's Moore's law. Shrink every 2 years or die.

Erinyes · Mar 10, 2012

rpg.314 said:
Samsung delivered 45nm SoC in 2010. If in 2012 they cannot deliver a shrink then obviously Apple wouldn't use them as foundry. It's Moore's law. Shrink every 2 years or die.

Tell that to AMD/GF

(No seriously..there really isnt any clear information on when they are moving to 22/20 nm. AMD's roadmap so far indicates that they'll continue on 32/28nm through 2013. Intel will be close to 14nm by that point)

Anyway, what other choice does Apple have currently? TSMC has more than enough demand to deal with at the moment.

metafor · Mar 10, 2012

rpg.314 said:
You have considered only fragment color so far. There's more to it than that. Let's assume no transparencies to simplify things. 4 tiles/core and 100 pixels/tile (this is awfully low, really awfully low for multimillion pixel displays btw) gets you 400 threads. Scoreboarding, shader context etc. balloons rapidly. And don't forget that G80 was capped at 512 threads/block in CUDA mode, a GPU which is unmatched to date in terms of features.

Throw in more realistic numbers for tile sizes, say 20 bytes per thread for shader context (which is still on the lower side), hdr rendering and the memory requirements go through the roof. Far better to use a larger tile size and rely on intra tile parallelism, IMO.

Fair enough. How well do the front-ends handle partially fetched tiles? I imagine the meta-data can be fetched fairly quick (and perhaps cached on their own) while vertex information can be fetched.

Moreover, why would using intra-tile parallelism suffer from latency with fewer parallel front-ends if latency can be masked with fetch pipelining?

Lazy8s · Mar 10, 2012

Apple would've been planning for a sub-45nm app processor; I can imagine it both ways at this point. And I assume the new screen might account for the extra battery capacity pretty much all by itself.

We'll know soon enough.

Either way, the next iPhone will certainly be on the new process in my estimation, and I wonder if they'll still reduce the clocks comparatively this time. Apple seems more comfortable giving each iOS product its own SoC performance profile, though.

Arwin · Mar 10, 2012

So do I understand that it has basically the same graphics chip as the Vita, and the clocks are the only unknown here to determine how far they are distanced from each other?

And is the CPU basically the same as the vita also, except with 2 cores instead of 4, except that we don't know yet if it has Neon and what the + stands for in the Vita in terms of what Sony had customised (or do we know?). EDIT: Wiki says that A5X has Neon, but the + is still unknown.

If so, I think the comments about iPad's performance versus a 360 are hugely overblown. It won't even reach 25% of the 360, I reckon? And that many more pixels to push ...

I'm not 100% up to speed on what's in the iPad tech, so would love to learn some more details about relative performance.

wco81 · Mar 10, 2012

Lazy8s said:
Apple would've been planning for a sub-45nm app processor; I can imagine it both ways at this point. And I assume the new screen might account for the extra battery capacity pretty much all by itself.

We'll know soon enough.

Either way, the next iPhone will certainly be on the new process in my estimation, and I wonder if they'll still reduce the clocks comparatively this time. Apple seems more comfortable giving each iOS product its own SoC performance profile, though.

Found this article about the Retina Display. Notes that it requires twice as many LEDs for the backlight, which may explain some of the increased battery capacity. Obviously LTE would be a big draw on the increased capacity too.

http://gigaom.com/apple/the-science-behind-the-new-ipads-display/

Lazy8s · Mar 10, 2012

Both are 543MP4. The iPad's is clocked at least 50 MHz higher. The "+" customization on the Vita's is said to be a relatively minor addition to feature support.

While iPad has good memory performance, Vita has its dedicated video RAM. iPad has more total RAM, but apps are more limited in how they can use it.

Far less abstraction in the API for Vita and less overhead from the OS.

As for how a 543MP4 stacks up in general:
16 Vec4 all-purpose ALUs in a design yielding ~28.8 GFLOPS@200MHz
8 TMUs, so 1.6 Gtex/sec@200MHz
64 Z Units, so 12.8 Gpix/sec Z/stencil @200MHz
Rated for ~130M+ tri/sec@200MHz

Performance has to be considered in the context of a TBDR, so the benefits of a tile buffer and 100% efficiency for texel rate also apply.

The Vita and iPad both have NEON with the CPU.

Lazy8s · Mar 10, 2012

Thanks for the article on the iPad's new display tech.

I figured the LTE support should be a bit of a drain, especially from a 45nm baseband, yet the battery life over WiFi isn't being rated for much improvement. The significance of LTE's power drain here seems to be drowned out by the tug of war for battery life between the higher capacity battery and the backlighting/display.

Next-Gen iPhone & iPhone Nano Speculation

rpg.314

metafor

metafor

french toast

Lazy8s

wco81

Ailuros

Epsilon plus three

rpg.314

rpg.314

metafor

rpg.314

Erinyes

rpg.314

Erinyes

metafor

Lazy8s

Arwin

Now Officially a Top 10 Poster

wco81

Lazy8s

Lazy8s

Similar threads