[Article] Handheld CPUs: Past, Present & Future

Thanks a very nice, easy and informative read.

Perhaps the article could have mentioned the PSP2 and Nintendo 3DS? Also NVIDIA is becoming a player with aspirations of hitting the smartphone market with its Tegra2 and finally I am sure AMD is positioning Bobcat for shrinks to counter not only Atom but ARM.
 
Very nice article.
Not too complicated so even people who aren't fluent at reading in english shouldn't have problems with reading it and yet very informative, it gives us informations needed to know what are the major differences between different CPU's and what to expect next. Good job Arun! :)

Can we expect something similar about mobile GPU's? If so, when?
 
Oh I think I found one more favorite author. :)

EDIT:
First of all, Intel's 45nm process is not very dense compared to TSMC's 40nm process (no immersion lithography and more restrictive design rules - it's very different for Intel 32nm vs TSMC 28nm as TSMC introduced restrictive design rules while Intel uses dual patterning lithography and TSMC apparently doesn't

Wait, TSMC doesn't use double patterning? Really? What about their IEDM presentation? Intel actually used DP since 65nm, its not like 32nm changes that.

There's also TDP figures for Lincroft chip in Moorestown and Oak Trail.

Oak Trail: 3W @ 1.6GHz
Moorestown: 1.3W @ 900MHz/2.2W @ 1.5GHz
 
Last edited by a moderator:
And finally, the next-generation 28nm MSM8960 with LTE support and a completely new CPU architecture is expected to sample in early 2011

I had thought that Qualcomm had simply stated that this chip was going to sample "in 2011", which usually means it will be late in the year :cry:. Have you heard something different about the timeline for sampling?
 
Thanks for the kind words. On when you can expect something similar for mobile GPUs: mid-March if you're lucky, but then again I was hoping for early/mid-December for this article so that doesn't mean much. I also remember I promised an article about Icera 'early next week' nearly 2 years ago(!) so maybe I'll try to finally deliver on that first! :)

I think when I decide to write anything about mobile GPUs also depends on the announcement times for the next-generation architectures (the CUDA-capable GPU in Tegra3, IMG's PowerVR Series 6, Qualcomm's next-gen, etc.) - I don't need to be able to do in-depth analysis of any of them, but at least some public info and quick intelligent speculation based on that would help. So we'll see what happens there.

Tahir2 said:
Perhaps the article could have mentioned the PSP2 and Nintendo 3DS? Also NVIDIA is becoming a player with aspirations of hitting the smartphone market with its Tegra2 and finally I am sure AMD is positioning Bobcat for shrinks to counter not only Atom but ARM.
I was thinking of mentioning the PSP2, but it wasn't official when I wrote 99% of the article, and I forgot to make a last-minute addition. So I suppose that's what this thread is for: do you think the PSP2 being quad-core will help legitimise quad-core as an advantage in handhelds as well? As for NVIDIA, obviously except for Project Denver (which I mention on Page 2 and 4), they're using the same ARM11/Cortex-A9/Cortex-A15 cores that are described on Page 1.

DavidC said:
Wait, TSMC doesn't use double patterning? Really? What about their IEDM presentation? Intel actually used DP since 65nm, its not like 32nm changes that.
Little know fact, I know! Mind you it's not the only one in the article ;) Here's the source for my original claim:
http://www.edn.com/blog/Practical_Chip_Design/37542-No_TSMC_28_nm_is_not_late_they_are_on_the_record.php said:
Another aspect of conservatism is that TSMC has stayed with single-pattern lithography even at the most critical layers. They are using the latest ASML 1950i immersion steppers on some of these layers, however.
This article dates from August 2009 and TSMC could have changed plans since then, but I believe they haven't because it's really quite expensive (GlobalFoundries once singled out as one of their most expensive additions on 28nm iirc) and TSMC has emphasised that 28nm was barely more capital intensive than 40nm in their financial CCs. So yes, I'm pretty confident they are indeed reserving double immersion for 20nm.

DavidC said:
There's also TDP figures for Lincroft chip in Moorestown and Oak Trail.
Yes, unfortunately that's for the full chip including multimedia and other things, so it's even less comparable than the numbers for the original Atom chip. I think the synthesis is basically identical anyway, so I doubt the numbers would be noticeably different.

convergedw said:
I had thought that Qualcomm had simply stated that this chip was going to sample "in 2011", which usually means it will be late in the year :cry:. Have you heard something different about the timeline for sampling?
Hmm, FWIW, some time ago I had heard 'first half' for Qualcomm's first 28nm chip sampling, and I can find several websites that mention 'early 2011' so I had assumed they were just repeating what Qualcomm said, but it turns out they might just be speculating so maybe my info is outdated. Personally I still think it makes sense they'd be sampling to lead customers in Q2 2011, so I won't edit the article just yet, but we'll see if Qualcomm says anything about the timing at MWC. Here's hoping they do more than that and we actually get some architectural info!
 
Hmm, FWIW, some time ago I had heard 'first half' for Qualcomm's first 28nm chip sampling, and I can find several websites that mention 'early 2011' so I had assumed they were just repeating what Qualcomm said, but it turns out they might just be speculating so maybe my info is outdated. Personally I still think it makes sense they'd be sampling to lead customers in Q2 2011, so I won't edit the article just yet, but we'll see if Qualcomm says anything about the timing at MWC. Here's hoping they do more than that and we actually get some architectural info!

The initial announcement for 8960 was a January "sampling" as announced at the investor's meeting. The actual date for commercial sampling is yet to be announced.
 
I was thinking of mentioning the PSP2, but it wasn't official when I wrote 99% of the article, and I forgot to make a last-minute addition. So I suppose that's what this thread is for: do you think the PSP2 being quad-core will help legitimise quad-core as an advantage in handhelds as well?
In a word no, but then a quad core processor maybe more power and performance efficient than an A15 dual core as you mention in your article. I believe the GPU will make the difference however.

As for NVIDIA, obviously except for Project Denver (which I mention on Page 2 and 4), they're using the same ARM11/Cortex-A9/Cortex-A15 cores that are described on Page 1.

They also have Tegra2 which is the development platform for Honeycomb but that advantage may be short lived.
 
The article doesn't mention anything further then Intel 32nm, I expect 22nm for Atom would be an tipping point where ARMv7 and x86 reaches the same power and performance. Do you think so too? And with Intel having a much more aggressive node shrink scale then TSMC, running smaller node one year earlier then other manufacturing should provide enough benefits / advantages to Atom SoC?

For Mobile GPU It is interesting that Ti announced their OMAP 5 for 2H 2012, and we are still not seeing PowerVR 6 !!!!!!!!!!. It is annoying!!!!!! Duke Nukem of Hardware?

I would love to know the software side of things. It is quite a known fact that software for Graphics matters a lot more then Hardware. As Nvidia famously said they have many more software engineers then Hardware guys.

That is mainly because Drivers,makes or brakes the GPU. You could have a decent GPU but poor Drivers, Application and Gaming support. ( Matrox and S3 ).

Do Qualcomm, ARM, IMG provide drivers for every platform?
 
Great read.

Regarding the limited OOO capability of Marvell's PJ4 and Qualcomm's Scorpion: I suspect these use skewed/delayed execution, where ALU ops are delayed a number of cycles in relation to load ops. This reduces the apparent latency of loads at the cost of a higher branch mispredict penalty. Since loads are performed earlier than ALU ops, you can call this out-of-order execution.

It's a fairly common trick and it is also used in the Cortex A8.

The person, posting as Wilco, over at Realworldtech,com's forums has worked on the A8. He mentioned he regretted not skewing the pipes one more cycle to reduce apparent latency of loads from one to zero cycles. That way the A8 could issue a load+ALU op pair without stalling if the load hit the data cache.

A more skewed pipeline, which results in virtual zero latency from L1 hits, could offer an explanation why PJ4 and Scorpion performs so well compared to A8.

Cheers
 
Do Qualcomm, ARM, IMG provide drivers for every platform?
I can't speak for QCT or ARM, but we provide licensed 'reference' drivers for a given core and OS which the licensee is encouraged to use in their products, and we push really hard for a standard of quality, correctness and performance to help sell the IP.
 
I've read in other places (MDR) that A15 supposedly has 2 symmetrical NEON pipelines. I wonder if this means it can perform 2x128-bit arithmetic instructions or whether each pipeline does 64-bit and can be used for either LD/ST or arithmetic.

Do you have a link handy for where you read that there were symmetrical units?

We know for sure that it's vec4 FP32 and can multi-issue something - these two things are probably not mutually exclusive, so I would presume it doesn't achieve vec4 by dual-issue. This probably means that it can at least do a vec4 FP32 FMADD and 128-bit load, store, or permute in the same cycle. There could be some more redundancy, but going for full symmetry seems like overkill and isn't something you see in other SIMD architectures (afaik)

I also think it'd be very unlikely that you'd see the capability of two independent loads per cycle on NEON when the integer core doesn't support such a thing, and it's not really the typical use case for SIMD (unless they added gather/scatter, which they didn't)
 
Do you have a link handy for where you read that there were symmetrical units?

We know for sure that it's vec4 FP32 and can multi-issue something - these two things are probably not mutually exclusive, so I would presume it doesn't achieve vec4 by dual-issue. This probably means that it can at least do a vec4 FP32 FMADD and 128-bit load, store, or permute in the same cycle. There could be some more redundancy, but going for full symmetry seems like overkill and isn't something you see in other SIMD architectures (afaik)

I also think it'd be very unlikely that you'd see the capability of two independent loads per cycle on NEON when the integer core doesn't support such a thing, and it's not really the typical use case for SIMD (unless they added gather/scatter, which they didn't)

I was going by Jim Turley's write-up on Eagle at mdronline.

"the twin FP/Neon units are equipped to perform up to eight SIMD multiply/divide operations in parallel—four in each execution unit."

I don't know where that information came from; it could just be speculation. But if both pipelines can handle multiply/divide, that's quite a bit of FP muscle.
 
The slides imply 4 way fused FMAD, the divide is in the integer unit.

John.

I'm doubting 2-way 128-bit arithmetics as well, but they specifically listed 4-way VFMA's. If one NEON pipe isn't restricted to just LD/ST, there can be 8-way ADD/SUB for F32 without a lot of modification to the datapath.

OTOH, 8-way F32 multiplies are a bit more difficult.
 
I was going by Jim Turley's write-up on Eagle at mdronline.

"the twin FP/Neon units are equipped to perform up to eight SIMD multiply/divide operations in parallel—four in each execution unit."

I don't know where that information came from; it could just be speculation. But if both pipelines can handle multiply/divide, that's quite a bit of FP muscle.

Thanks, but it looks like I have to pay $900 to see it. I'm not sure I trust Linley's information 100% at face value anyway.

NEON can currently issue a streamed FMUL + FADD in a single at nearly double the latency, but of course this doesn't qualify as fused in either performance or precision terms. I wouldn't be very surprised if ARM did move for a fused FMADD implementation using the existing instructions, which makes it a good forward thinking strategy.

NEON already has reciprocal approximation support. There are two instructions: an initial step gives a value with 8-bits (don't know precisely what they use to get these, maybe a LUT) and an iteration which performs (2 - (op1 * op2)) and should be using in conjunction with an FMUL to perform a Newton-Rhapson iteration. The latter is functionally equivalent to an FMADD so it has the same issue and latency characteristics.

So if FMADD latency is improved I imagine the reciprocal step latency would improve too; I can't see any reason not to continue using the same pipelines for it. On the chance that there are really 2x vec4 FMADD pipelines you should be able to do reciprocal steps and normal FMADDs in parallel. I wouldn't expect that they'd have two reciprocal initial approximation pipelines. What seems most likely is that there'll be one vec4 FMADD pipeline and one vec4 reciprocal estimation pipeline and you'll be able to run those in parallel.

What I strongly doubt is that we'll be seeing a pipelined full-precision divide instruction that runs 4x in parallel. The current approach is a good way to balance performance and reuse the FMADD pipeline.. fully pipelined divides would take a lot of real estate and probably not much better latency than using the NR steps.
 
Thanks, but it looks like I have to pay $900 to see it. I'm not sure I trust Linley's information 100% at face value anyway.

NEON can currently issue a streamed FMUL + FADD in a single at nearly double the latency, but of course this doesn't qualify as fused in either performance or precision terms. I wouldn't be very surprised if ARM did move for a fused FMADD implementation using the existing instructions, which makes it a good forward thinking strategy.

They didn't. ARMv7-A LPA adds VFMA, VFMS, VNFMA and VNFMS for fused mult/add. VFMA and VFMS existing in both NEON and VFP variants whereas VNFMA and VNFMS are VFP only.

Existing VMLA/VMLS/VNMLA/VNMLS instructions will stay chained. And I would imagine optimization guides will discourage the use of the chained instructions.

So if FMADD latency is improved I imagine the reciprocal step latency would improve too; I can't see any reason not to continue using the same pipelines for it. On the chance that there are really 2x vec4 FMADD pipelines you should be able to do reciprocal steps and normal FMADDs in parallel. I wouldn't expect that they'd have two reciprocal initial approximation pipelines. What seems most likely is that there'll be one vec4 FMADD pipeline and one vec4 reciprocal estimation pipeline and you'll be able to run those in parallel.

Yes, but the main question is how flexible the issue queue will be. If the second pipeline contains VRECPE as well as VLD/ST logic along with other simpler arithmetics, that will be quite a bloated pipeline. I guess we won't know until this thing can be benchmarked.

Are there really enough scenarios out there where one pipeline can be saturated with FMAC while the other needs to handle miscellaneous FP?

What I strongly doubt is that we'll be seeing a pipelined full-precision divide instruction that runs 4x in parallel. The current approach is a good way to balance performance and reuse the FMADD pipeline.. fully pipelined divides would take a lot of real estate and probably not much better latency than using the NR steps.

There isn't an instruction for it. Just the legacy FDIV for VFP.
 
Back
Top