If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 |
|
Unknown.
Join Date: Aug 2002
Location: UK
Posts: 4,877
|
The article is finally up - enjoy! And don't be frightened by its size, it's not as if anyone would notice if you skipped a paragraph or two. Just don't forget to leave feedback here!
Handheld CPUs: Past, Present & Future |
|
|
|
|
|
#2 |
|
Merrily dodgy
Join Date: Aug 2003
Location: The colonies
Posts: 1,398
|
Ah sweet, let's have a look see
__________________
"A man generally has two reasons for doing a thing. One that sounds good, and a real one." - J.P. Morgan |
|
|
|
|
|
#3 |
|
Itchy
Join Date: Feb 2002
Location: United Queendom
Posts: 2,858
|
Thanks a very nice, easy and informative read.
Perhaps the article could have mentioned the PSP2 and Nintendo 3DS? Also NVIDIA is becoming a player with aspirations of hitting the smartphone market with its Tegra2 and finally I am sure AMD is positioning Bobcat for shrinks to counter not only Atom but ARM.
__________________
Time is an illusion. Lunchtime doubly so - Douglas Adams |
|
|
|
|
|
#4 |
|
Member
Join Date: Nov 2008
Location: Warsaw, Poland
Posts: 219
|
Very nice article.
Not too complicated so even people who aren't fluent at reading in english shouldn't have problems with reading it and yet very informative, it gives us informations needed to know what are the major differences between different CPU's and what to expect next. Good job Arun! Can we expect something similar about mobile GPU's? If so, when? |
|
|
|
|
|
#5 | |
|
Member
Join Date: Sep 2006
Posts: 273
|
Oh I think I found one more favorite author.
EDIT: Quote:
There's also TDP figures for Lincroft chip in Moorestown and Oak Trail. Oak Trail: 3W @ 1.6GHz Moorestown: 1.3W @ 900MHz/2.2W @ 1.5GHz Last edited by DavidC; 07-Feb-2011 at 18:52. |
|
|
|
|
|
|
#6 | |
|
Junior Member
Join Date: Sep 2008
Posts: 19
|
Quote:
|
|
|
|
|
|
|
#7 | |||||
|
Unknown.
Join Date: Aug 2002
Location: UK
Posts: 4,877
|
Thanks for the kind words. On when you can expect something similar for mobile GPUs: mid-March if you're lucky, but then again I was hoping for early/mid-December for this article so that doesn't mean much. I also remember I promised an article about Icera 'early next week' nearly 2 years ago(!) so maybe I'll try to finally deliver on that first!
I think when I decide to write anything about mobile GPUs also depends on the announcement times for the next-generation architectures (the CUDA-capable GPU in Tegra3, IMG's PowerVR Series 6, Qualcomm's next-gen, etc.) - I don't need to be able to do in-depth analysis of any of them, but at least some public info and quick intelligent speculation based on that would help. So we'll see what happens there. Quote:
Quote:
Quote:
Quote:
Quote:
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles) "[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions." |
|||||
|
|
|
|
|
#8 | |
|
Member
Join Date: May 2010
Posts: 463
|
Quote:
|
|
|
|
|
|
|
#9 | ||
|
Itchy
Join Date: Feb 2002
Location: United Queendom
Posts: 2,858
|
Quote:
Quote:
__________________
Time is an illusion. Lunchtime doubly so - Douglas Adams |
||
|
|
|
|
|
#10 |
|
Member
Join Date: Jun 2004
Posts: 168
|
The article doesn't mention anything further then Intel 32nm, I expect 22nm for Atom would be an tipping point where ARMv7 and x86 reaches the same power and performance. Do you think so too? And with Intel having a much more aggressive node shrink scale then TSMC, running smaller node one year earlier then other manufacturing should provide enough benefits / advantages to Atom SoC?
For Mobile GPU It is interesting that Ti announced their OMAP 5 for 2H 2012, and we are still not seeing PowerVR 6 !!!!!!!!!!. It is annoying!!!!!! Duke Nukem of Hardware? I would love to know the software side of things. It is quite a known fact that software for Graphics matters a lot more then Hardware. As Nvidia famously said they have many more software engineers then Hardware guys. That is mainly because Drivers,makes or brakes the GPU. You could have a decent GPU but poor Drivers, Application and Gaming support. ( Matrox and S3 ). Do Qualcomm, ARM, IMG provide drivers for every platform? |
|
|
|
|
|
#11 |
|
Senior Member
Join Date: Feb 2002
Posts: 2,570
|
Great read.
Regarding the limited OOO capability of Marvell's PJ4 and Qualcomm's Scorpion: I suspect these use skewed/delayed execution, where ALU ops are delayed a number of cycles in relation to load ops. This reduces the apparent latency of loads at the cost of a higher branch mispredict penalty. Since loads are performed earlier than ALU ops, you can call this out-of-order execution. It's a fairly common trick and it is also used in the Cortex A8. The person, posting as Wilco, over at Realworldtech,com's forums has worked on the A8. He mentioned he regretted not skewing the pipes one more cycle to reduce apparent latency of loads from one to zero cycles. That way the A8 could issue a load+ALU op pair without stalling if the load hit the data cache. A more skewed pipeline, which results in virtual zero latency from L1 hits, could offer an explanation why PJ4 and Scorpion performs so well compared to A8. Cheers
__________________
I'm pink, therefore I'm spam |
|
|
|
|
|
#12 |
|
Tiled
Join Date: Oct 2003
Location: Kings Langley, UK
Posts: 2,675
|
I can't speak for QCT or ARM, but we provide licensed 'reference' drivers for a given core and OS which the licensee is encouraged to use in their products, and we push really hard for a standard of quality, correctness and performance to help sell the IP.
__________________
A major redesign of the core ALU pineapple boomerang fortress. |
|
|
|
|
|
#13 |
|
Member
Join Date: Jan 2010
Posts: 416
|
Interesting A15 info from ARM http://www.arm.com/files/pdf/AT-Expl...Cortex-A15.pdf.
EDIT from Arun: This was in the OMAP5 thread, but architectural discussion on the A15 fits better in this thread. |
|
|
|
|
|
#14 | |
|
Member
Join Date: May 2010
Posts: 463
|
Quote:
|
|
|
|
|
|
|
#15 | |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,579
|
Quote:
We know for sure that it's vec4 FP32 and can multi-issue something - these two things are probably not mutually exclusive, so I would presume it doesn't achieve vec4 by dual-issue. This probably means that it can at least do a vec4 FP32 FMADD and 128-bit load, store, or permute in the same cycle. There could be some more redundancy, but going for full symmetry seems like overkill and isn't something you see in other SIMD architectures (afaik) I also think it'd be very unlikely that you'd see the capability of two independent loads per cycle on NEON when the integer core doesn't support such a thing, and it's not really the typical use case for SIMD (unless they added gather/scatter, which they didn't) |
|
|
|
|
|
|
#16 | |
|
Member
Join Date: May 2010
Posts: 463
|
Quote:
"the twin FP/Neon units are equipped to perform up to eight SIMD multiply/divide operations in parallel—four in each execution unit." I don't know where that information came from; it could just be speculation. But if both pipelines can handle multiply/divide, that's quite a bit of FP muscle. |
|
|
|
|
|
|
#17 |
|
Member
Join Date: Mar 2002
Location: UK
Posts: 570
|
The slides imply 4 way fused FMAD, the divide is in the integer unit.
John. |
|
|
|
|
|
#18 | |
|
Member
Join Date: May 2010
Posts: 463
|
Quote:
OTOH, 8-way F32 multiplies are a bit more difficult. |
|
|
|
|
|
|
#19 | |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,579
|
Quote:
NEON can currently issue a streamed FMUL + FADD in a single at nearly double the latency, but of course this doesn't qualify as fused in either performance or precision terms. I wouldn't be very surprised if ARM did move for a fused FMADD implementation using the existing instructions, which makes it a good forward thinking strategy. NEON already has reciprocal approximation support. There are two instructions: an initial step gives a value with 8-bits (don't know precisely what they use to get these, maybe a LUT) and an iteration which performs (2 - (op1 * op2)) and should be using in conjunction with an FMUL to perform a Newton-Rhapson iteration. The latter is functionally equivalent to an FMADD so it has the same issue and latency characteristics. So if FMADD latency is improved I imagine the reciprocal step latency would improve too; I can't see any reason not to continue using the same pipelines for it. On the chance that there are really 2x vec4 FMADD pipelines you should be able to do reciprocal steps and normal FMADDs in parallel. I wouldn't expect that they'd have two reciprocal initial approximation pipelines. What seems most likely is that there'll be one vec4 FMADD pipeline and one vec4 reciprocal estimation pipeline and you'll be able to run those in parallel. What I strongly doubt is that we'll be seeing a pipelined full-precision divide instruction that runs 4x in parallel. The current approach is a good way to balance performance and reuse the FMADD pipeline.. fully pipelined divides would take a lot of real estate and probably not much better latency than using the NR steps. |
|
|
|
|
|
|
#20 | |||
|
Member
Join Date: May 2010
Posts: 463
|
Quote:
Existing VMLA/VMLS/VNMLA/VNMLS instructions will stay chained. And I would imagine optimization guides will discourage the use of the chained instructions. Quote:
Are there really enough scenarios out there where one pipeline can be saturated with FMAC while the other needs to handle miscellaneous FP? Quote:
|
|||
|
|
|
|
|
#21 |
|
Epsilon plus three
Join Date: Feb 2002
Location: Chania
Posts: 7,768
|
Outstanding work Arun! I think it's slowly high time for a hand-held GPU article...hm?
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs. |
|
|
|
|
|
#22 |
|
Unknown.
Join Date: Aug 2002
Location: UK
Posts: 4,877
|
I moved the A15 NEON architecture discussion here since it's much more on-topic here than in the OMAP5 thread.
The one thing I would point out is that the ARM A15 presentation indicates that one of the challenges on the NEON side is "late accumulator source operand for MAC operations" and the ISA extension slide mentions "Fused MAC [...] New instructions to complement current chained multiply+add". So it's very clear that it still supports the chained multiply+add and it's still implemented with a dedicated MAC FIFO (as mentioned for the Cortex-A9 on Page 12 of this presentation). I think they could probably let that MAC FIFO optionally store a non-rounded result to implement a 'fused MAC' rather than add another multiplier in the same pipeline. It seems by far most likely that there's still only a 4-way multiplier otherwise they'd brag about it; I've read all the public A15 coverage (so that doesn't include the subscription-only MPR article) and I haven't seen anyone else mention symmetric pipelines. On the other hand, even if the fused MAC uses both the MUL and ADD pipelines, maybe the processor can also co-issue ADDs and MULs (unlike the A8) especially as even NEON now supports Out-of-Order Issue (so it's less likely to be wasting a dual-issue opportunity with a pending ADD-part-of-MAC). In theory it's also possible that the MUL pipeline supports ADD but FP-wise can do fused FMACs (+MUL/ADDs) and not non-fused FMACs so you still need the dedicated MAC FIFO. I don't think it's very likely though (and neither are a wide variety of other options, many of which are still far from impossible though).
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles) "[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions." |
|
|
|
|
|
#23 | |||
|
Member
Join Date: May 2010
Posts: 463
|
Quote:
For chained MAC, the extra accumulate leg can simply be tied to 0 and the result written to a MAC FIFO (or forwarded to a separate ADD instruction). Quote:
Quote:
There are also other miscellaneous NEON operations (logicals, shift, recpe, etc.) that pretty much requires a separate pipeline. I wonder if the issue logic is capable of sharing a subpipe. For instance, that issue0 can issue to FMAC or Vlogical, issue1 can issue to Vlogical or Vld. |
|||
|
|
|
|
|
#24 |
|
Unknown.
Join Date: Aug 2002
Location: UK
Posts: 4,877
|
Gah yes, I'm being stupid, I meant another addition unit in the same pipeline. Also good point about a fused MAC being cheaper to implement than a chained one.
So here's one possibility: Pipe1 does MUL/ADD/Integer-MAC/Fused-FMAC, Pipe2 does ADD/ALU, and Pipe3 does LD/ST/Misc - and you can issue to two of these pipelines per cycle. Do you think that makes sense? It's far from the only possibility and I still think it's more likely they're reusing Pipe2's ADD for all MACs including Fused-FMACs if that's possible, but I want to make sure we're on the same page.
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles) "[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions." |
|
|
|
|
|
#25 | ||
|
Member
Join Date: May 2010
Posts: 463
|
Quote:
You just need a 3-input adder as opposed to a 2-input adder. When forming the wallace tree, this just means the use of 4-2 compressors instead of 3-2 compressors. And the circuit design of a 4-2 is actually simpler and faster than a 3-2 due to the ability to simply decode the input into one-hot, pass-gate enables. Quote:
The reason this can't be done with chained MAC is because the multiply result needs to be rounded/saturated and normalized. So a separate adder is needed at the end of the multiply operation. If there is to be a dedicated adder separate from the multiply tree, then I suppose having it in a separate pipe makes sense. So the scenario you listed above is perfectly feasible. But if there are 3 subpipes, keeping the LD/ST exclusively LD/ST would probably be a better option as that'd cut down significantly on the pipeline registers, controls and routing necessary. The slides do seem to imply, however, that there are only 2 subpipes... Another option is to split the chained MAC instruction into two separate MUL/ADD instructions and send them down the pipeline separately. This allows the reuse of both the multiply with accumulate leg disabled as well as the adder used in the partial product tree. Additionally, ADD micro-op could be staggered along a separate pipeline such that the main arithmetic pipe can still perform other operations during the second pass. |
||
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|