[Article] Handheld CPUs: Past, Present & Future

Outstanding work Arun! I think it's slowly high time for a hand-held GPU article...hm? ;)
 
I moved the A15 NEON architecture discussion here since it's much more on-topic here than in the OMAP5 thread.

The one thing I would point out is that the ARM A15 presentation indicates that one of the challenges on the NEON side is "late accumulator source operand for MAC operations" and the ISA extension slide mentions "Fused MAC [...] New instructions to complement current chained multiply+add". So it's very clear that it still supports the chained multiply+add and it's still implemented with a dedicated MAC FIFO (as mentioned for the Cortex-A9 on Page 12 of this presentation).

I think they could probably let that MAC FIFO optionally store a non-rounded result to implement a 'fused MAC' rather than add another multiplier in the same pipeline. It seems by far most likely that there's still only a 4-way multiplier otherwise they'd brag about it; I've read all the public A15 coverage (so that doesn't include the subscription-only MPR article) and I haven't seen anyone else mention symmetric pipelines. On the other hand, even if the fused MAC uses both the MUL and ADD pipelines, maybe the processor can also co-issue ADDs and MULs (unlike the A8) especially as even NEON now supports Out-of-Order Issue (so it's less likely to be wasting a dual-issue opportunity with a pending ADD-part-of-MAC).

In theory it's also possible that the MUL pipeline supports ADD but FP-wise can do fused FMACs (+MUL/ADDs) and not non-fused FMACs so you still need the dedicated MAC FIFO. I don't think it's very likely though (and neither are a wide variety of other options, many of which are still far from impossible though).
 
I think they could probably let that MAC FIFO optionally store a non-rounded result to implement a 'fused MAC' rather than add another multiplier in the same pipeline.

It's not necessary to add a separate multiplier for fused MAC. But you do need to expand your current multiplier's partial product tree to include another accumulate leg. That's one of the advantages of fused MAC over chained; it's actually easier and faster to implement.

For chained MAC, the extra accumulate leg can simply be tied to 0 and the result written to a MAC FIFO (or forwarded to a separate ADD instruction).

It seems by far most likely that there's still only a 4-way multiplier otherwise they'd brag about it; I've read all the public A15 coverage (so that doesn't include the subscription-only MPR article) and I haven't seen anyone else mention symmetric pipelines.

It would seem improbable to me as well. But I wouldn't rule out the second NEON pipeline being able to handle various arithmetic operations that can be issued in parallel with the "main" arithmetic pipeline.

On the other hand, even if the fused MAC uses both the MUL and ADD pipelines, maybe the processor can also co-issue ADDs and MULs (unlike the A8) especially as even NEON now supports Out-of-Order Issue (so it's less likely to be wasting a dual-issue opportunity with a pending ADD-part-of-MAC).

That's what I think is more likely. A multiply requires 2x the width of the datapath to form the partial product tree, which is basically a giant adder (reduced through booth encoding and compressors). This means that you naturally have a 256-bit ADD datapath already. You'd need more complex control logic as well as muxes to control the different modes (adding to timing problems) but it's possible to have enough of a datapath for 8x FP32/INT32 ADDs without adding extra datawidth in the datapath.

There are also other miscellaneous NEON operations (logicals, shift, recpe, etc.) that pretty much requires a separate pipeline. I wonder if the issue logic is capable of sharing a subpipe. For instance, that issue0 can issue to FMAC or Vlogical, issue1 can issue to Vlogical or Vld.
 
It's not necessary to add a separate multiplier for fused MAC.
Gah yes, I'm being stupid, I meant another addition unit in the same pipeline. Also good point about a fused MAC being cheaper to implement than a chained one.

So here's one possibility: Pipe1 does MUL/ADD/Integer-MAC/Fused-FMAC, Pipe2 does ADD/ALU, and Pipe3 does LD/ST/Misc - and you can issue to two of these pipelines per cycle. Do you think that makes sense? It's far from the only possibility and I still think it's more likely they're reusing Pipe2's ADD for all MACs including Fused-FMACs if that's possible, but I want to make sure we're on the same page.
 
Gah yes, I'm being stupid, I meant another addition unit in the same pipeline. Also good point about a fused MAC being cheaper to implement than a chained one.

Even that's not really needed :)

You just need a 3-input adder as opposed to a 2-input adder. When forming the wallace tree, this just means the use of 4-2 compressors instead of 3-2 compressors. And the circuit design of a 4-2 is actually simpler and faster than a 3-2 due to the ability to simply decode the input into one-hot, pass-gate enables.

So here's one possibility: Pipe1 does MUL/ADD/Integer-MAC/Fused-FMAC, Pipe2 does ADD/ALU, and Pipe3 does LD/ST/Misc - and you can issue to two of these pipelines per cycle. Do you think that makes sense? It's far from the only possibility and I still think it's more likely they're reusing Pipe2's ADD for all MACs including Fused-FMACs if that's possible, but I want to make sure we're on the same page.

Like I said, Fused MAC really doesn't lend itself to reuse a separate adder. Essentially, a fused MAC is a single-pass operation where the addition is built into the multiply circuit's partial product tree.

The reason this can't be done with chained MAC is because the multiply result needs to be rounded/saturated and normalized. So a separate adder is needed at the end of the multiply operation.

If there is to be a dedicated adder separate from the multiply tree, then I suppose having it in a separate pipe makes sense. So the scenario you listed above is perfectly feasible. But if there are 3 subpipes, keeping the LD/ST exclusively LD/ST would probably be a better option as that'd cut down significantly on the pipeline registers, controls and routing necessary.

The slides do seem to imply, however, that there are only 2 subpipes...

Another option is to split the chained MAC instruction into two separate MUL/ADD instructions and send them down the pipeline separately. This allows the reuse of both the multiply with accumulate leg disabled as well as the adder used in the partial product tree.

Additionally, ADD micro-op could be staggered along a separate pipeline such that the main arithmetic pipe can still perform other operations during the second pass.
 
You just need a 3-input adder as opposed to a 2-input adder. When forming the wallace tree, this just means the use of 4-2 compressors instead of 3-2 compressors. And the circuit design of a 4-2 is actually simpler and faster than a 3-2 due to the ability to simply decode the input into one-hot, pass-gate enables.
Okay, I'll be honest and admit that part definitively went over my head :) Turns out that there are disadvantages writing on hardware as a software-centric person, who could have known, hehe. Don't waste your time trying to explain it in simpler terms though, I think I'll survive fine anyhow, and one day I'll hopefully have the time to look into it in some more depth.

Like I said, Fused MAC really doesn't lend itself to reuse a separate adder. Essentially, a fused MAC is a single-pass operation where the addition is built into the multiply circuit's partial product tree.

The reason this can't be done with chained MAC is because the multiply result needs to be rounded/saturated and normalized. So a separate adder is needed at the end of the multiply operation.
Hmm, I think I understand that at least. But even if it's a waste to use a fully separate adder for a Fused MAC, couldn't it still be done in theory if you stored more bits in your MAC FIFO to optionally store the multiplication's result before rounding/saturation and normalisation? Mind you that seems neither practical nor efficient and it makes more sense for ARM to do it as you suggested. Also, is it even possible to reuse the ADD from the Fused-MAC for ADD-only instructions then? If not then I don't understand what NVIDIA does in Fermi, bah. Just making sure we're on the same page again.

If there is to be a dedicated adder separate from the multiply tree, then I suppose having it in a separate pipe makes sense. So the scenario you listed above is perfectly feasible. But if there are 3 subpipes, keeping the LD/ST exclusively LD/ST would probably be a better option as that'd cut down significantly on the pipeline registers, controls and routing necessary.
The reason I partitioned it that way is the Cortex-A8 does all arithmetic operations on one port and Load/Store/Permute/etc. on the other port iirc. But agreed, so it easily could have changed as well.

The slides do seem to imply, however, that there are only 2 subpipes...
I don't think it says anything about the number of pipes inside the unit, only that there are 2 issue ports. There's nothing fundamentally wrong with sharing 3 pipelines between 2 ports (e.g. Port 1 can issue to Pipe 1&2, Port 2 can issue to Pipe 2&3). However it is more complicated and I agree it might make sense to essentially unify what I describe as Pipe2&3.

Another option is to split the chained MAC instruction into two separate MUL/ADD instructions and send them down the pipeline separately. This allows the reuse of both the multiply with accumulate leg disabled as well as the adder used in the partial product tree.
I think that's what they must be doing for chained anyway (with a MAC FIFO), otherwise the presentation wouldn't mention "late accumulator source operand for MAC operations" (unless I'm misunderstanding that as well)
 
Okay, I'll be honest and admit that part definitively went over my head :) Turns out that there are disadvantages writing on hardware as a software-centric person, who could have known, hehe. Don't waste your time trying to explain it in simpler terms though, I think I'll survive fine anyhow, and one day I'll hopefully have the time to look into it in some more depth.

Fair enough :) The thing I was trying to explain is that to do fused MAC requires literally about 10% more hardware added on top of regular stand-alone multiply and isn't any slower. In fact, from a timing margins perspective, it's actually faster.

Hmm, I think I understand that at least. But even if it's a waste to use a fully separate adder for a Fused MAC, couldn't it still be done in theory if you stored more bits in your MAC FIFO to optionally store the multiplication's result before rounding/saturation and normalisation? Mind you that seems neither practical nor efficient and it makes more sense for ARM to do it as you suggested.

Well yes, it's theoretically possible. But I can't fathom a reason of why. For integer MAC, it would be fairly simple, but FP is a different beast altogether.

In order to maintain the precision necessary for IEEE compliance, you'd need the MAC fifo to capture at least 184 bits of the result from the initial multiply (23 bits for each FP32 fraction, doubled to reproduce the dynamic range of the multiply's partial product tree) along with the result of the exponent (before saturation/rounding, so you'd need to pad bits there too).

Then you'd need a second 184-bit adder and second exponent adder to take the results and add them. Only at the end can you truncate, shift, round and saturate.

You'd also have to add the multiply latency on top of the add latency to get total latency of the instruction.

Compare this to adding about 10% more area to a multiplier and same latency for the one-pass solution. You basically lose on all 4 fronts: more power, more latency, more area, more complexity in design.

Also, is it even possible to reuse the ADD from the Fused-MAC for ADD-only instructions then? If not then I don't understand what NVIDIA does in Fermi, bah. Just making sure we're on the same page again.

It's possible and I believe most sane designs do this. And as I pointed out earlier, the fused MAC's partial product tree is double the width of the data type, so you can theoretically do two separate 4xFP32 ADDs with that hardware.

I was accounting for the separate adder for the case of either a chained MAC or the case where two arithmetic instructions could be issued. With a separate adder, a chained MAC wouldn't take up 2 separate slots in the pipeline -- effectively halving chained MAC throughput. And since the multiply portion needs the entire partial product tree, sharing the adder would mean you couldn't issue an ADD and a Mult/MAC in the same cycle. It would be either 2 ADDs or a Mult/MAC.

The reason I partitioned it that way is the Cortex-A8 does all arithmetic operations on one port and Load/Store/Permute/etc. on the other port iirc. But agreed, so it easily could have changed as well.

Oh, my bad. I understood "misc FP" as things like logicals and conversions and such. I absolutely agree that permute would go in there, especially given how ARM's various VLD instructions have permute built in.

I don't think it says anything about the number of pipes inside the unit, only that there are 2 issue ports. There's nothing fundamentally wrong with sharing 3 pipelines between 2 ports (e.g. Port 1 can issue to Pipe 1&2, Port 2 can issue to Pipe 2&3). However it is more complicated and I agree it might make sense to essentially unify what I describe as Pipe2&3.

I think that's what they must be doing for chained anyway (with a MAC FIFO), otherwise the presentation wouldn't mention "late accumulator source operand for MAC operations" (unless I'm misunderstanding that as well)

True. The original topic was simply how the two pipelines (I had assumed there were) were partitioned in terms of what they performed. But dispelling that initial (I admit, rather unfounded) assumption, the possibilities are:

1. 2 subpipes and 2 issue ports, one subpipe being LD/ST/Perm, and the other being arithmetic.
2. 3 subpipes and 2 issue ports, LD/ST/Perm, MAC/Mult/ADD, ADD/DIV/Misc.
3. 2 subpipes and 2 issue ports, LD/ST/Perm/ADD/DIV/Misc. MAC/Mult/ADD.
4. Something complete different and creative and dare I say magical. But Apple doesn't release white papers so we're out of luck.

In cases 2 and 3, one could either have a separate ADD block in the subpipe without MAC, or do a weird cross-pipe datapath where it reuses half the double-width add circuit in the MAC subpipe. But doing so would mean an ADD couldn't be issued at the same time as a Mult/FMAC. But doing so would save you a 128-bit adder.
 
[...] to do fused MAC requires literally about 10% more hardware added on top of regular stand-alone multiply and isn't any slower. In fact, from a timing margins perspective, it's actually faster.
[...]
In order to maintain the precision necessary for IEEE compliance, you'd need the MAC fifo to capture at least 184 bits of the result from the initial multiply
[...]
Compare this to adding about 10% more area to a multiplier and same latency for the one-pass solution. You basically lose on all 4 fronts: more power, more latency, more area, more complexity in design.
[...]
It's possible and I believe most sane designs do this.
Okay, that makes things much clearer, thanks! And woah, 184-bit intermediary result... Yeah, turns out what's theoretically possible and what makes any sense whatsoever aren't always the same thing :)

With a separate adder, a chained MAC wouldn't take up 2 separate slots in the pipeline -- effectively halving chained MAC throughput.
The way Cortex-A8/A9 does it is you can never co-issue a MUL and ADD, so the ADD pipeline is often free and can be reused for the ADD in a chained MAC. So throughput is usually as good as with a true MAC pipeline - but of course with true dual-issue the trick wouldn't be free as frequently.

True. The original topic was simply how the two pipelines (I had assumed there were) were partitioned in terms of what they performed. But dispelling that initial (I admit, rather unfounded) assumption, the possibilities are:

1. 2 subpipes and 2 issue ports, one subpipe being LD/ST/Perm, and the other being arithmetic.
2. 3 subpipes and 2 issue ports, LD/ST/Perm, MAC/Mult/ADD, ADD/DIV/Misc.
3. 2 subpipes and 2 issue ports, LD/ST/Perm/ADD/DIV/Misc. MAC/Mult/ADD.
4. Something complete different and creative and dare I say magical. But Apple doesn't release white papers so we're out of luck.

In cases 2 and 3, one could either have a separate ADD block in the subpipe without MAC, or do a weird cross-pipe datapath where it reuses half the double-width add circuit in the MAC subpipe. But doing so would mean an ADD couldn't be issued at the same time as a Mult/FMAC. But doing so would save you a 128-bit adder.
I think that's a very good summary! And now I find myself hoping Apple is indeed working on a custom processor core ;)

My current guess is very similar to Case 3 with a separate ADD block, but I'm not sure I see much evidence besides the MPR article that NEON will even support division at all (there will still be a separate VFP afaik and the FP registers are still shared so a fast FDIV could be enough) . I also think it the 'Misc' part (i.e. logicals/conversions/etc.) might be sufficiently cheap that it's worth having it on both pipelines just to simplify scheduling logic.

So another possibility which I think is probably as likely as your first three but not quite magical either:
5. 2 subpipes with 2 issue ports, 1) MAC/MUL/ADD/Misc 2) ADD/Misc + LD/ST/Permute (+separate VFP sharing one or both issue ports?)

This is a very good discussion, although I really should reply to a few other posts too (especially Gubbi's suggestion wrt skewed execution for Qualcomm/Marvell - fwiw at first glance I think that's very possible in Qualcomm's case but not enough to explain Marvell's performance), I'll do that tomorrow.
 
Okay, that makes things much clearer, thanks! And woah, 184-bit intermediary result... Yeah, turns out what's theoretically possible and what makes any sense whatsoever aren't always the same thing :)

Ya, multiplies are hardware hogs. Compared to FP/Int MAC and MULTs, the other instructions are a blip on the radar as far as power and area.

The way Cortex-A8/A9 does it is you can never co-issue a MUL and ADD, so the ADD pipeline is often free and can be reused for the ADD in a chained MAC. So throughput is usually as good as with a true MAC pipeline - but of course with true dual-issue the trick wouldn't be free as frequently.

Yup. I believe the way A8/A9 implements it is with a separate adder from the multiply. I would guess this is why ADD latencies are smaller than MULT latencies.

This effectively means a MAC can be truly pipelined but it also means you're using an extra adder. But I guess since A8/A9 only implements 64-bits, that isn't much of a big deal.

I think that's a very good summary! And now I find myself hoping Apple is indeed working on a custom processor core ;)

They didn't hire an ARM micro-architect for no reason :)

My current guess is very similar to Case 3 with a separate ADD block, but I'm not sure I see much evidence besides the MPR article that NEON will even support division at all (there will still be a separate VFP afaik and the FP registers are still shared so a fast FDIV could be enough) . I also think it the 'Misc' part (i.e. logicals/conversions/etc.) might be sufficiently cheap that it's worth having it on both pipelines just to simplify scheduling logic.

I don't know about that. My objection to a combined LD/ST/Arith pipeline isn't the added complexity of the arith portions but rather how different the structure of a LD/ST pipeline is compared to an arithmetic.

A LD/ST usually involves a queue (especially for NEON) that aligns data and writes to the RF and also handles read-modify-writes. The first stage really shouldn't involve two 128-bit operand registers at all.

Arithmetic pipelines, on the other hand, are usually pretty regular in structure. There are two 128-bit operand latches in the first stage, followed by computation logic, followed by another stage with two 128-bit operands or a result buffer. It wouldn't make sense in the logic structure to try to combine these two.

It could be that there are 3 subpipes, but only the second issue port can issue to pipes 1 and 2 while the first issue port can only issue to pipe 0.

So another possibility which I think is probably as likely as your first three but not quite magical either:
5. 2 subpipes with 2 issue ports, 1) MAC/MUL/ADD/Misc 2) ADD/Misc + LD/ST/Permute (+separate VFP sharing one or both issue ports?)

Whoops, you beat me to it :)

That is very likely and probably (IMO) close to ideal design-wise. Although I don't know if they're planning on having a separate VFP pipeline anymore. It doesn't take much to adapt a NEON channel to perform VFP operations as well, now that they're both pipelined. And the logic that can be shared would save a whole lot lot lot lot of area.

But I suppose if ARM wants to offer both VFP-only and VFP+NEON versions, it'd make sense to have them separate.

This is a very good discussion, although I really should reply to a few other posts too (especially Gubbi's suggestion wrt skewed execution for Qualcomm/Marvell - fwiw at first glance I think that's very possible in Qualcomm's case but not enough to explain Marvell's performance), I'll do that tomorrow.

Heh, always nice to bounce ideas back and forth. To be honest, this has given me a few ideas about my next design :)
 
Outstanding work Arun! I think it's slowly high time for a hand-held GPU article...hm?
Cheers! And as I said, hopefully sometime in March, but then again I hoped this article would be published in December and it depends a lot on when different companies announce at least minimal information on their next-gen architectures. I wouldn't want to reveal anything I shouldn't know by mistake ;) (okay, I'm just teasing! and mostly kidding hehe) - and of course it also depends on how much time I have, and maybe I'd want to publish something on Icera first. We'll see. But I'm sure it will happen (be glad I didn't add "when it's ready" :p)

Also thanks to you and rpg.314 for proofreading, and Exophase for a lot of great feedback on the technical aspects of past/present ARM architectures. Given the discussion about Cortex-A15 NEON here, I'm surprised nobody's talking about the fact Cortex-A8/A9 NEON really is 128-bit for non-MUL integer instructions!

The article doesn't mention anything further then Intel 32nm, I expect 22nm for Atom would be an tipping point where ARMv7 and x86 reaches the same power and performance. Do you think so too? And with Intel having a much more aggressive node shrink scale then TSMC, running smaller node one year earlier then other manufacturing should provide enough benefits / advantages to Atom SoC?
I think 32nm High-K Medfield vs 40nm SiOn SoCs is probably as big and long of a process advantage Intel will ever get, so that's a constant at best going forward. One interesting dynamic wrt power consumption is that as IPC goes up in the Cortex-A15, power efficiency will also go down a bit, so that closes the gap with Intel somewhat. But if Intel wants to be performance competitive with the A15 (which they are certainly not with the current Atom architecture) they might have to increase IPC too at the cost of a bit of power efficiency, thus getting us right back to square one. The architectural details and the quality of implementation will probably be the most important points going forward.

iwod said:
For Mobile GPU It is interesting that Ti announced their OMAP 5 for 2H 2012, and we are still not seeing PowerVR 6 !!!!!!!!!!. It is annoying!!!!!! Duke Nukem of Hardware?
No, they announced OMAP5 for end-products in 2H12 and sampling in 2H11. Did you really expect Series 6 barely two years after the first SGX540 products despite the existence of SGX543MP? If so that wasn't very realistic in the first place.

Great read.

Regarding the limited OOO capability of Marvell's PJ4 and Qualcomm's Scorpion: I suspect these use skewed/delayed execution, where ALU ops are delayed a number of cycles in relation to load ops. This reduces the apparent latency of loads at the cost of a higher branch mispredict penalty. Since loads are performed earlier than ALU ops, you can call this out-of-order execution.

[...]The person, posting as Wilco, over at Realworldtech,com's forums has worked on the A8. He mentioned he regretted not skewing the pipes one more cycle to reduce apparent latency of loads from one to zero cycles. [...] A more skewed pipeline, which results in virtual zero latency from L1 hits, could offer an explanation why PJ4 and Scorpion performs so well compared to A8.
Thanks! And that's a very good point, and a strong possibility in Qualcomm's case - I just edited the article to mention it. It wouldn't be 'Out-of-Order Issue' but it does seem to fit very well and might be enough to explain Qualcomm's statements and the 2.1DMIPS/MHz versus 2.0DMIPS/MHz (especially as Dhrystone fits completely into L1).

However it doesn't make sense in the case of the Marvell PJ4 and anyway I don't think it's enough to explain 2.41DMIPS/MHz even with a significantly shorter pipeline. The problem is that the PJ4 has a 6-9 stages variable-length pipeline, so you can't just put your main ALU stage after your load stages. It just doesn't work - and even if it did, any extra stage for it would be much harder to justify with such a short pipeline.

With the basic Out-of-Order Issue with instruction window scheme I am proposing, you get the full benefit of skewing with practically none of the disadvantages. It does cost more silicon, but I believe it's worth the cost in the PJ4's case given that they already have a ROB.

That is very likely and probably (IMO) close to ideal design-wise. Although I don't know if they're planning on having a separate VFP pipeline anymore. It doesn't take much to adapt a NEON channel to perform VFP operations as well, now that they're both pipelined. And the logic that can be shared would save a whole lot lot lot lot of area.
:) I think your description is better than mine though (3 subpipes, but only the second issue port can issue to pipes 1 and 2 while the first issue port can only issue to pipe 0). And now that I've thought about it, I also agree when it comes to sharing the VFP. I think the easiest and most logical implementation would put all VFP ALU operations on Pipe 0.

I'm not sure if the A15 does have a VFP-only variant, but either way everyone on the handheld/tablet side (including NVIDIA) is using NEON this time around AFAIK.
 
Just to get the conversation started again: any opinions on heterogeneous cores and SMT? These seem like the most exciting things we are likely to get after the A15.
 
Just to get the conversation started again: any opinions on heterogeneous cores and SMT? These seem like the most exciting things we are likely to get after the A15.
SMT has often been the classic case of good intention paving the road to $ hell. In this regard, the improvements suggested in the article* are paramount to SMT getting a strong foot in the handheld sector. Cache partitioning by threads, in particular, sounds like a natural evolution to the SMT concept, and bar any prohibitive implementation complexities, would be nice to have, IMHO.

* Great read, BTW! I read it on publication day and then forgot to comment (but recommended it to peers nevertheless).
 
Just to get the conversation started again: any opinions on heterogeneous cores and SMT? These seem like the most exciting things we are likely to get after the A15.

Considering the 2.5GHz quad core Qualcomm just announced, I think it is clear that heterogeneous cores are upon us, even if that chip does not have A5/A15 kind of split personality.
 
Arun, sorry to bring an old thread out of the grave, but I liked this old article so I was wondering if there currently is or planning to be an update to this soon? Krait is out now and we are on the verge of finally seeing powerful Cortex-A15 devices take the lead so it looks like the next phase of the mobile race is just about to start!
 
Back
Top