I think they could probably let that MAC FIFO optionally store a non-rounded result to implement a 'fused MAC' rather than add another multiplier in the same pipeline.
It seems by far most likely that there's still only a 4-way multiplier otherwise they'd brag about it; I've read all the public A15 coverage (so that doesn't include the subscription-only MPR article) and I haven't seen anyone else mention symmetric pipelines.
On the other hand, even if the fused MAC uses both the MUL and ADD pipelines, maybe the processor can also co-issue ADDs and MULs (unlike the A8) especially as even NEON now supports Out-of-Order Issue (so it's less likely to be wasting a dual-issue opportunity with a pending ADD-part-of-MAC).
Gah yes, I'm being stupid, I meant another addition unit in the same pipeline. Also good point about a fused MAC being cheaper to implement than a chained one.It's not necessary to add a separate multiplier for fused MAC.
Gah yes, I'm being stupid, I meant another addition unit in the same pipeline. Also good point about a fused MAC being cheaper to implement than a chained one.
So here's one possibility: Pipe1 does MUL/ADD/Integer-MAC/Fused-FMAC, Pipe2 does ADD/ALU, and Pipe3 does LD/ST/Misc - and you can issue to two of these pipelines per cycle. Do you think that makes sense? It's far from the only possibility and I still think it's more likely they're reusing Pipe2's ADD for all MACs including Fused-FMACs if that's possible, but I want to make sure we're on the same page.
Okay, I'll be honest and admit that part definitively went over my head Turns out that there are disadvantages writing on hardware as a software-centric person, who could have known, hehe. Don't waste your time trying to explain it in simpler terms though, I think I'll survive fine anyhow, and one day I'll hopefully have the time to look into it in some more depth.You just need a 3-input adder as opposed to a 2-input adder. When forming the wallace tree, this just means the use of 4-2 compressors instead of 3-2 compressors. And the circuit design of a 4-2 is actually simpler and faster than a 3-2 due to the ability to simply decode the input into one-hot, pass-gate enables.
Hmm, I think I understand that at least. But even if it's a waste to use a fully separate adder for a Fused MAC, couldn't it still be done in theory if you stored more bits in your MAC FIFO to optionally store the multiplication's result before rounding/saturation and normalisation? Mind you that seems neither practical nor efficient and it makes more sense for ARM to do it as you suggested. Also, is it even possible to reuse the ADD from the Fused-MAC for ADD-only instructions then? If not then I don't understand what NVIDIA does in Fermi, bah. Just making sure we're on the same page again.Like I said, Fused MAC really doesn't lend itself to reuse a separate adder. Essentially, a fused MAC is a single-pass operation where the addition is built into the multiply circuit's partial product tree.
The reason this can't be done with chained MAC is because the multiply result needs to be rounded/saturated and normalized. So a separate adder is needed at the end of the multiply operation.
The reason I partitioned it that way is the Cortex-A8 does all arithmetic operations on one port and Load/Store/Permute/etc. on the other port iirc. But agreed, so it easily could have changed as well.If there is to be a dedicated adder separate from the multiply tree, then I suppose having it in a separate pipe makes sense. So the scenario you listed above is perfectly feasible. But if there are 3 subpipes, keeping the LD/ST exclusively LD/ST would probably be a better option as that'd cut down significantly on the pipeline registers, controls and routing necessary.
I don't think it says anything about the number of pipes inside the unit, only that there are 2 issue ports. There's nothing fundamentally wrong with sharing 3 pipelines between 2 ports (e.g. Port 1 can issue to Pipe 1&2, Port 2 can issue to Pipe 2&3). However it is more complicated and I agree it might make sense to essentially unify what I describe as Pipe2&3.The slides do seem to imply, however, that there are only 2 subpipes...
I think that's what they must be doing for chained anyway (with a MAC FIFO), otherwise the presentation wouldn't mention "late accumulator source operand for MAC operations" (unless I'm misunderstanding that as well)Another option is to split the chained MAC instruction into two separate MUL/ADD instructions and send them down the pipeline separately. This allows the reuse of both the multiply with accumulate leg disabled as well as the adder used in the partial product tree.
Okay, I'll be honest and admit that part definitively went over my head Turns out that there are disadvantages writing on hardware as a software-centric person, who could have known, hehe. Don't waste your time trying to explain it in simpler terms though, I think I'll survive fine anyhow, and one day I'll hopefully have the time to look into it in some more depth.
Hmm, I think I understand that at least. But even if it's a waste to use a fully separate adder for a Fused MAC, couldn't it still be done in theory if you stored more bits in your MAC FIFO to optionally store the multiplication's result before rounding/saturation and normalisation? Mind you that seems neither practical nor efficient and it makes more sense for ARM to do it as you suggested.
Also, is it even possible to reuse the ADD from the Fused-MAC for ADD-only instructions then? If not then I don't understand what NVIDIA does in Fermi, bah. Just making sure we're on the same page again.
The reason I partitioned it that way is the Cortex-A8 does all arithmetic operations on one port and Load/Store/Permute/etc. on the other port iirc. But agreed, so it easily could have changed as well.
I don't think it says anything about the number of pipes inside the unit, only that there are 2 issue ports. There's nothing fundamentally wrong with sharing 3 pipelines between 2 ports (e.g. Port 1 can issue to Pipe 1&2, Port 2 can issue to Pipe 2&3). However it is more complicated and I agree it might make sense to essentially unify what I describe as Pipe2&3.
I think that's what they must be doing for chained anyway (with a MAC FIFO), otherwise the presentation wouldn't mention "late accumulator source operand for MAC operations" (unless I'm misunderstanding that as well)
Okay, that makes things much clearer, thanks! And woah, 184-bit intermediary result... Yeah, turns out what's theoretically possible and what makes any sense whatsoever aren't always the same thing[...] to do fused MAC requires literally about 10% more hardware added on top of regular stand-alone multiply and isn't any slower. In fact, from a timing margins perspective, it's actually faster.
[...]
In order to maintain the precision necessary for IEEE compliance, you'd need the MAC fifo to capture at least 184 bits of the result from the initial multiply
[...]
Compare this to adding about 10% more area to a multiplier and same latency for the one-pass solution. You basically lose on all 4 fronts: more power, more latency, more area, more complexity in design.
[...]
It's possible and I believe most sane designs do this.
The way Cortex-A8/A9 does it is you can never co-issue a MUL and ADD, so the ADD pipeline is often free and can be reused for the ADD in a chained MAC. So throughput is usually as good as with a true MAC pipeline - but of course with true dual-issue the trick wouldn't be free as frequently.With a separate adder, a chained MAC wouldn't take up 2 separate slots in the pipeline -- effectively halving chained MAC throughput.
I think that's a very good summary! And now I find myself hoping Apple is indeed working on a custom processor coreTrue. The original topic was simply how the two pipelines (I had assumed there were) were partitioned in terms of what they performed. But dispelling that initial (I admit, rather unfounded) assumption, the possibilities are:
1. 2 subpipes and 2 issue ports, one subpipe being LD/ST/Perm, and the other being arithmetic.
2. 3 subpipes and 2 issue ports, LD/ST/Perm, MAC/Mult/ADD, ADD/DIV/Misc.
3. 2 subpipes and 2 issue ports, LD/ST/Perm/ADD/DIV/Misc. MAC/Mult/ADD.
4. Something complete different and creative and dare I say magical. But Apple doesn't release white papers so we're out of luck.
In cases 2 and 3, one could either have a separate ADD block in the subpipe without MAC, or do a weird cross-pipe datapath where it reuses half the double-width add circuit in the MAC subpipe. But doing so would mean an ADD couldn't be issued at the same time as a Mult/FMAC. But doing so would save you a 128-bit adder.
Okay, that makes things much clearer, thanks! And woah, 184-bit intermediary result... Yeah, turns out what's theoretically possible and what makes any sense whatsoever aren't always the same thing
The way Cortex-A8/A9 does it is you can never co-issue a MUL and ADD, so the ADD pipeline is often free and can be reused for the ADD in a chained MAC. So throughput is usually as good as with a true MAC pipeline - but of course with true dual-issue the trick wouldn't be free as frequently.
I think that's a very good summary! And now I find myself hoping Apple is indeed working on a custom processor core
My current guess is very similar to Case 3 with a separate ADD block, but I'm not sure I see much evidence besides the MPR article that NEON will even support division at all (there will still be a separate VFP afaik and the FP registers are still shared so a fast FDIV could be enough) . I also think it the 'Misc' part (i.e. logicals/conversions/etc.) might be sufficiently cheap that it's worth having it on both pipelines just to simplify scheduling logic.
So another possibility which I think is probably as likely as your first three but not quite magical either:
5. 2 subpipes with 2 issue ports, 1) MAC/MUL/ADD/Misc 2) ADD/Misc + LD/ST/Permute (+separate VFP sharing one or both issue ports?)
This is a very good discussion, although I really should reply to a few other posts too (especially Gubbi's suggestion wrt skewed execution for Qualcomm/Marvell - fwiw at first glance I think that's very possible in Qualcomm's case but not enough to explain Marvell's performance), I'll do that tomorrow.
Cheers! And as I said, hopefully sometime in March, but then again I hoped this article would be published in December and it depends a lot on when different companies announce at least minimal information on their next-gen architectures. I wouldn't want to reveal anything I shouldn't know by mistake (okay, I'm just teasing! and mostly kidding hehe) - and of course it also depends on how much time I have, and maybe I'd want to publish something on Icera first. We'll see. But I'm sure it will happen (be glad I didn't add "when it's ready" )Outstanding work Arun! I think it's slowly high time for a hand-held GPU article...hm?
I think 32nm High-K Medfield vs 40nm SiOn SoCs is probably as big and long of a process advantage Intel will ever get, so that's a constant at best going forward. One interesting dynamic wrt power consumption is that as IPC goes up in the Cortex-A15, power efficiency will also go down a bit, so that closes the gap with Intel somewhat. But if Intel wants to be performance competitive with the A15 (which they are certainly not with the current Atom architecture) they might have to increase IPC too at the cost of a bit of power efficiency, thus getting us right back to square one. The architectural details and the quality of implementation will probably be the most important points going forward.The article doesn't mention anything further then Intel 32nm, I expect 22nm for Atom would be an tipping point where ARMv7 and x86 reaches the same power and performance. Do you think so too? And with Intel having a much more aggressive node shrink scale then TSMC, running smaller node one year earlier then other manufacturing should provide enough benefits / advantages to Atom SoC?
No, they announced OMAP5 for end-products in 2H12 and sampling in 2H11. Did you really expect Series 6 barely two years after the first SGX540 products despite the existence of SGX543MP? If so that wasn't very realistic in the first place.iwod said:For Mobile GPU It is interesting that Ti announced their OMAP 5 for 2H 2012, and we are still not seeing PowerVR 6 !!!!!!!!!!. It is annoying!!!!!! Duke Nukem of Hardware?
Thanks! And that's a very good point, and a strong possibility in Qualcomm's case - I just edited the article to mention it. It wouldn't be 'Out-of-Order Issue' but it does seem to fit very well and might be enough to explain Qualcomm's statements and the 2.1DMIPS/MHz versus 2.0DMIPS/MHz (especially as Dhrystone fits completely into L1).Great read.
Regarding the limited OOO capability of Marvell's PJ4 and Qualcomm's Scorpion: I suspect these use skewed/delayed execution, where ALU ops are delayed a number of cycles in relation to load ops. This reduces the apparent latency of loads at the cost of a higher branch mispredict penalty. Since loads are performed earlier than ALU ops, you can call this out-of-order execution.
[...]The person, posting as Wilco, over at Realworldtech,com's forums has worked on the A8. He mentioned he regretted not skewing the pipes one more cycle to reduce apparent latency of loads from one to zero cycles. [...] A more skewed pipeline, which results in virtual zero latency from L1 hits, could offer an explanation why PJ4 and Scorpion performs so well compared to A8.
I think your description is better than mine though (3 subpipes, but only the second issue port can issue to pipes 1 and 2 while the first issue port can only issue to pipe 0). And now that I've thought about it, I also agree when it comes to sharing the VFP. I think the easiest and most logical implementation would put all VFP ALU operations on Pipe 0.That is very likely and probably (IMO) close to ideal design-wise. Although I don't know if they're planning on having a separate VFP pipeline anymore. It doesn't take much to adapt a NEON channel to perform VFP operations as well, now that they're both pipelined. And the logic that can be shared would save a whole lot lot lot lot of area.
SMT has often been the classic case of good intention paving the road to $ hell. In this regard, the improvements suggested in the article* are paramount to SMT getting a strong foot in the handheld sector. Cache partitioning by threads, in particular, sounds like a natural evolution to the SMT concept, and bar any prohibitive implementation complexities, would be nice to have, IMHO.Just to get the conversation started again: any opinions on heterogeneous cores and SMT? These seem like the most exciting things we are likely to get after the A15.
Just to get the conversation started again: any opinions on heterogeneous cores and SMT? These seem like the most exciting things we are likely to get after the A15.