[Article] Handheld CPUs: Past, Present & Future

Discussion in 'Mobile Devices and SoCs' started by Arun, Feb 7, 2011.

  1. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    The article is finally up - enjoy! And don't be frightened by its size, it's not as if anyone would notice if you skipped a paragraph or two. Just don't forget to leave feedback here!

    Handheld CPUs: Past, Present & Future
     
  2. Florin

    Florin Merrily dodgy
    Veteran

    Joined:
    Aug 27, 2003
    Messages:
    1,625
    Likes Received:
    171
    Location:
    The colonies
    Ah sweet, let's have a look see
     
  3. Tahir2

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,978
    Likes Received:
    86
    Location:
    Earth
    Thanks a very nice, easy and informative read.

    Perhaps the article could have mentioned the PSP2 and Nintendo 3DS? Also NVIDIA is becoming a player with aspirations of hitting the smartphone market with its Tegra2 and finally I am sure AMD is positioning Bobcat for shrinks to counter not only Atom but ARM.
     
  4. Wishmaster

    Newcomer

    Joined:
    Nov 16, 2008
    Messages:
    238
    Likes Received:
    0
    Location:
    Warsaw, Poland
    Very nice article.
    Not too complicated so even people who aren't fluent at reading in english shouldn't have problems with reading it and yet very informative, it gives us informations needed to know what are the major differences between different CPU's and what to expect next. Good job Arun! :)

    Can we expect something similar about mobile GPU's? If so, when?
     
  5. DavidC

    Regular

    Joined:
    Sep 26, 2006
    Messages:
    347
    Likes Received:
    24
    Oh I think I found one more favorite author. :)

    EDIT:
    Wait, TSMC doesn't use double patterning? Really? What about their IEDM presentation? Intel actually used DP since 65nm, its not like 32nm changes that.

    There's also TDP figures for Lincroft chip in Moorestown and Oak Trail.

    Oak Trail: 3W @ 1.6GHz
    Moorestown: 1.3W @ 900MHz/2.2W @ 1.5GHz
     
    #5 DavidC, Feb 7, 2011
    Last edited by a moderator: Feb 7, 2011
  6. convergedw

    Newcomer

    Joined:
    Sep 2, 2008
    Messages:
    22
    Likes Received:
    0
    I had thought that Qualcomm had simply stated that this chip was going to sample "in 2011", which usually means it will be late in the year :sad:. Have you heard something different about the timeline for sampling?
     
  7. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    Thanks for the kind words. On when you can expect something similar for mobile GPUs: mid-March if you're lucky, but then again I was hoping for early/mid-December for this article so that doesn't mean much. I also remember I promised an article about Icera 'early next week' nearly 2 years ago(!) so maybe I'll try to finally deliver on that first! :)

    I think when I decide to write anything about mobile GPUs also depends on the announcement times for the next-generation architectures (the CUDA-capable GPU in Tegra3, IMG's PowerVR Series 6, Qualcomm's next-gen, etc.) - I don't need to be able to do in-depth analysis of any of them, but at least some public info and quick intelligent speculation based on that would help. So we'll see what happens there.

    I was thinking of mentioning the PSP2, but it wasn't official when I wrote 99% of the article, and I forgot to make a last-minute addition. So I suppose that's what this thread is for: do you think the PSP2 being quad-core will help legitimise quad-core as an advantage in handhelds as well? As for NVIDIA, obviously except for Project Denver (which I mention on Page 2 and 4), they're using the same ARM11/Cortex-A9/Cortex-A15 cores that are described on Page 1.

    Little know fact, I know! Mind you it's not the only one in the article ;) Here's the source for my original claim:
    This article dates from August 2009 and TSMC could have changed plans since then, but I believe they haven't because it's really quite expensive (GlobalFoundries once singled out as one of their most expensive additions on 28nm iirc) and TSMC has emphasised that 28nm was barely more capital intensive than 40nm in their financial CCs. So yes, I'm pretty confident they are indeed reserving double immersion for 20nm.

    Yes, unfortunately that's for the full chip including multimedia and other things, so it's even less comparable than the numbers for the original Atom chip. I think the synthesis is basically identical anyway, so I doubt the numbers would be noticeably different.

    Hmm, FWIW, some time ago I had heard 'first half' for Qualcomm's first 28nm chip sampling, and I can find several websites that mention 'early 2011' so I had assumed they were just repeating what Qualcomm said, but it turns out they might just be speculating so maybe my info is outdated. Personally I still think it makes sense they'd be sampling to lead customers in Q2 2011, so I won't edit the article just yet, but we'll see if Qualcomm says anything about the timing at MWC. Here's hoping they do more than that and we actually get some architectural info!
     
  8. metafor

    Regular

    Joined:
    May 26, 2010
    Messages:
    463
    Likes Received:
    0
    The initial announcement for 8960 was a January "sampling" as announced at the investor's meeting. The actual date for commercial sampling is yet to be announced.
     
  9. Tahir2

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,978
    Likes Received:
    86
    Location:
    Earth
    In a word no, but then a quad core processor maybe more power and performance efficient than an A15 dual core as you mention in your article. I believe the GPU will make the difference however.

    They also have Tegra2 which is the development platform for Honeycomb but that advantage may be short lived.
     
  10. iwod

    Newcomer

    Joined:
    Jun 3, 2004
    Messages:
    179
    Likes Received:
    1
    The article doesn't mention anything further then Intel 32nm, I expect 22nm for Atom would be an tipping point where ARMv7 and x86 reaches the same power and performance. Do you think so too? And with Intel having a much more aggressive node shrink scale then TSMC, running smaller node one year earlier then other manufacturing should provide enough benefits / advantages to Atom SoC?

    For Mobile GPU It is interesting that Ti announced their OMAP 5 for 2H 2012, and we are still not seeing PowerVR 6 !!!!!!!!!!. It is annoying!!!!!! Duke Nukem of Hardware?

    I would love to know the software side of things. It is quite a known fact that software for Graphics matters a lot more then Hardware. As Nvidia famously said they have many more software engineers then Hardware guys.

    That is mainly because Drivers,makes or brakes the GPU. You could have a decent GPU but poor Drivers, Application and Gaming support. ( Matrox and S3 ).

    Do Qualcomm, ARM, IMG provide drivers for every platform?
     
  11. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,503
    Likes Received:
    828
    Great read.

    Regarding the limited OOO capability of Marvell's PJ4 and Qualcomm's Scorpion: I suspect these use skewed/delayed execution, where ALU ops are delayed a number of cycles in relation to load ops. This reduces the apparent latency of loads at the cost of a higher branch mispredict penalty. Since loads are performed earlier than ALU ops, you can call this out-of-order execution.

    It's a fairly common trick and it is also used in the Cortex A8.

    The person, posting as Wilco, over at Realworldtech,com's forums has worked on the A8. He mentioned he regretted not skewing the pipes one more cycle to reduce apparent latency of loads from one to zero cycles. That way the A8 could issue a load+ALU op pair without stalling if the load hit the data cache.

    A more skewed pipeline, which results in virtual zero latency from L1 hits, could offer an explanation why PJ4 and Scorpion performs so well compared to A8.

    Cheers
     
  12. Rys

    Rys AMD RTG
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,140
    Likes Received:
    1,338
    Location:
    Beyond3D HQ
    I can't speak for QCT or ARM, but we provide licensed 'reference' drivers for a given core and OS which the licensee is encouraged to use in their products, and we push really hard for a standard of quality, correctness and performance to help sell the IP.
     
  13. GZ007

    Regular

    Joined:
    Jan 22, 2010
    Messages:
    416
    Likes Received:
    0
  14. metafor

    Regular

    Joined:
    May 26, 2010
    Messages:
    463
    Likes Received:
    0
    I've read in other places (MDR) that A15 supposedly has 2 symmetrical NEON pipelines. I wonder if this means it can perform 2x128-bit arithmetic instructions or whether each pipeline does 64-bit and can be used for either LD/ST or arithmetic.
     
  15. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    Do you have a link handy for where you read that there were symmetrical units?

    We know for sure that it's vec4 FP32 and can multi-issue something - these two things are probably not mutually exclusive, so I would presume it doesn't achieve vec4 by dual-issue. This probably means that it can at least do a vec4 FP32 FMADD and 128-bit load, store, or permute in the same cycle. There could be some more redundancy, but going for full symmetry seems like overkill and isn't something you see in other SIMD architectures (afaik)

    I also think it'd be very unlikely that you'd see the capability of two independent loads per cycle on NEON when the integer core doesn't support such a thing, and it's not really the typical use case for SIMD (unless they added gather/scatter, which they didn't)
     
  16. metafor

    Regular

    Joined:
    May 26, 2010
    Messages:
    463
    Likes Received:
    0
    I was going by Jim Turley's write-up on Eagle at mdronline.

    "the twin FP/Neon units are equipped to perform up to eight SIMD multiply/divide operations in parallel—four in each execution unit."

    I don't know where that information came from; it could just be speculation. But if both pipelines can handle multiply/divide, that's quite a bit of FP muscle.
     
  17. JohnH

    Regular

    Joined:
    Mar 18, 2002
    Messages:
    586
    Likes Received:
    2
    Location:
    UK
    The slides imply 4 way fused FMAD, the divide is in the integer unit.

    John.
     
  18. metafor

    Regular

    Joined:
    May 26, 2010
    Messages:
    463
    Likes Received:
    0
    I'm doubting 2-way 128-bit arithmetics as well, but they specifically listed 4-way VFMA's. If one NEON pipe isn't restricted to just LD/ST, there can be 8-way ADD/SUB for F32 without a lot of modification to the datapath.

    OTOH, 8-way F32 multiplies are a bit more difficult.
     
  19. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    Thanks, but it looks like I have to pay $900 to see it. I'm not sure I trust Linley's information 100% at face value anyway.

    NEON can currently issue a streamed FMUL + FADD in a single at nearly double the latency, but of course this doesn't qualify as fused in either performance or precision terms. I wouldn't be very surprised if ARM did move for a fused FMADD implementation using the existing instructions, which makes it a good forward thinking strategy.

    NEON already has reciprocal approximation support. There are two instructions: an initial step gives a value with 8-bits (don't know precisely what they use to get these, maybe a LUT) and an iteration which performs (2 - (op1 * op2)) and should be using in conjunction with an FMUL to perform a Newton-Rhapson iteration. The latter is functionally equivalent to an FMADD so it has the same issue and latency characteristics.

    So if FMADD latency is improved I imagine the reciprocal step latency would improve too; I can't see any reason not to continue using the same pipelines for it. On the chance that there are really 2x vec4 FMADD pipelines you should be able to do reciprocal steps and normal FMADDs in parallel. I wouldn't expect that they'd have two reciprocal initial approximation pipelines. What seems most likely is that there'll be one vec4 FMADD pipeline and one vec4 reciprocal estimation pipeline and you'll be able to run those in parallel.

    What I strongly doubt is that we'll be seeing a pipelined full-precision divide instruction that runs 4x in parallel. The current approach is a good way to balance performance and reuse the FMADD pipeline.. fully pipelined divides would take a lot of real estate and probably not much better latency than using the NR steps.
     
  20. metafor

    Regular

    Joined:
    May 26, 2010
    Messages:
    463
    Likes Received:
    0
    They didn't. ARMv7-A LPA adds VFMA, VFMS, VNFMA and VNFMS for fused mult/add. VFMA and VFMS existing in both NEON and VFP variants whereas VNFMA and VNFMS are VFP only.

    Existing VMLA/VMLS/VNMLA/VNMLS instructions will stay chained. And I would imagine optimization guides will discourage the use of the chained instructions.

    Yes, but the main question is how flexible the issue queue will be. If the second pipeline contains VRECPE as well as VLD/ST logic along with other simpler arithmetics, that will be quite a bloated pipeline. I guess we won't know until this thing can be benchmarked.

    Are there really enough scenarios out there where one pipeline can be saturated with FMAC while the other needs to handle miscellaneous FP?

    There isn't an instruction for it. Just the legacy FDIV for VFP.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...