Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 07-Feb-2011, 13:24   #1
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,877
Default [Article] Handheld CPUs: Past, Present & Future

The article is finally up - enjoy! And don't be frightened by its size, it's not as if anyone would notice if you skipped a paragraph or two. Just don't forget to leave feedback here!

Handheld CPUs: Past, Present & Future
Arun is offline   Reply With Quote
Old 07-Feb-2011, 17:46   #2
Florin
Merrily dodgy
 
Join Date: Aug 2003
Location: The colonies
Posts: 1,398
Default

Ah sweet, let's have a look see
__________________
"A man generally has two reasons for doing a thing. One that sounds good, and a real one." - J.P. Morgan
Florin is offline   Reply With Quote
Old 07-Feb-2011, 17:51   #3
Tahir2
Itchy
 
Join Date: Feb 2002
Location: United Queendom
Posts: 2,858
Default

Thanks a very nice, easy and informative read.

Perhaps the article could have mentioned the PSP2 and Nintendo 3DS? Also NVIDIA is becoming a player with aspirations of hitting the smartphone market with its Tegra2 and finally I am sure AMD is positioning Bobcat for shrinks to counter not only Atom but ARM.
__________________
Time is an illusion. Lunchtime doubly so - Douglas Adams
Tahir2 is offline   Reply With Quote
Old 07-Feb-2011, 18:22   #4
Wishmaster
Member
 
Join Date: Nov 2008
Location: Warsaw, Poland
Posts: 219
Default

Very nice article.
Not too complicated so even people who aren't fluent at reading in english shouldn't have problems with reading it and yet very informative, it gives us informations needed to know what are the major differences between different CPU's and what to expect next. Good job Arun!

Can we expect something similar about mobile GPU's? If so, when?
Wishmaster is offline   Reply With Quote
Old 07-Feb-2011, 18:37   #5
DavidC
Member
 
Join Date: Sep 2006
Posts: 273
Default

Oh I think I found one more favorite author.

EDIT:
Quote:
First of all, Intel's 45nm process is not very dense compared to TSMC's 40nm process (no immersion lithography and more restrictive design rules - it's very different for Intel 32nm vs TSMC 28nm as TSMC introduced restrictive design rules while Intel uses dual patterning lithography and TSMC apparently doesn't
Wait, TSMC doesn't use double patterning? Really? What about their IEDM presentation? Intel actually used DP since 65nm, its not like 32nm changes that.

There's also TDP figures for Lincroft chip in Moorestown and Oak Trail.

Oak Trail: 3W @ 1.6GHz
Moorestown: 1.3W @ 900MHz/2.2W @ 1.5GHz

Last edited by DavidC; 07-Feb-2011 at 18:52.
DavidC is offline   Reply With Quote
Old 07-Feb-2011, 21:38   #6
convergedw
Junior Member
 
Join Date: Sep 2008
Posts: 19
Default

Quote:
And finally, the next-generation 28nm MSM8960 with LTE support and a completely new CPU architecture is expected to sample in early 2011
I had thought that Qualcomm had simply stated that this chip was going to sample "in 2011", which usually means it will be late in the year . Have you heard something different about the timeline for sampling?
convergedw is offline   Reply With Quote
Old 07-Feb-2011, 22:44   #7
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,877
Default

Thanks for the kind words. On when you can expect something similar for mobile GPUs: mid-March if you're lucky, but then again I was hoping for early/mid-December for this article so that doesn't mean much. I also remember I promised an article about Icera 'early next week' nearly 2 years ago(!) so maybe I'll try to finally deliver on that first!

I think when I decide to write anything about mobile GPUs also depends on the announcement times for the next-generation architectures (the CUDA-capable GPU in Tegra3, IMG's PowerVR Series 6, Qualcomm's next-gen, etc.) - I don't need to be able to do in-depth analysis of any of them, but at least some public info and quick intelligent speculation based on that would help. So we'll see what happens there.

Quote:
Originally Posted by Tahir2
Perhaps the article could have mentioned the PSP2 and Nintendo 3DS? Also NVIDIA is becoming a player with aspirations of hitting the smartphone market with its Tegra2 and finally I am sure AMD is positioning Bobcat for shrinks to counter not only Atom but ARM.
I was thinking of mentioning the PSP2, but it wasn't official when I wrote 99% of the article, and I forgot to make a last-minute addition. So I suppose that's what this thread is for: do you think the PSP2 being quad-core will help legitimise quad-core as an advantage in handhelds as well? As for NVIDIA, obviously except for Project Denver (which I mention on Page 2 and 4), they're using the same ARM11/Cortex-A9/Cortex-A15 cores that are described on Page 1.

Quote:
Originally Posted by DavidC
Wait, TSMC doesn't use double patterning? Really? What about their IEDM presentation? Intel actually used DP since 65nm, its not like 32nm changes that.
Little know fact, I know! Mind you it's not the only one in the article Here's the source for my original claim:
Quote:
Originally Posted by http://www.edn.com/blog/Practical_Chip_Design/37542-No_TSMC_28_nm_is_not_late_they_are_on_the_record.p hp
Another aspect of conservatism is that TSMC has stayed with single-pattern lithography even at the most critical layers. They are using the latest ASML 1950i immersion steppers on some of these layers, however.
This article dates from August 2009 and TSMC could have changed plans since then, but I believe they haven't because it's really quite expensive (GlobalFoundries once singled out as one of their most expensive additions on 28nm iirc) and TSMC has emphasised that 28nm was barely more capital intensive than 40nm in their financial CCs. So yes, I'm pretty confident they are indeed reserving double immersion for 20nm.

Quote:
Originally Posted by DavidC
There's also TDP figures for Lincroft chip in Moorestown and Oak Trail.
Yes, unfortunately that's for the full chip including multimedia and other things, so it's even less comparable than the numbers for the original Atom chip. I think the synthesis is basically identical anyway, so I doubt the numbers would be noticeably different.

Quote:
Originally Posted by convergedw
I had thought that Qualcomm had simply stated that this chip was going to sample "in 2011", which usually means it will be late in the year . Have you heard something different about the timeline for sampling?
Hmm, FWIW, some time ago I had heard 'first half' for Qualcomm's first 28nm chip sampling, and I can find several websites that mention 'early 2011' so I had assumed they were just repeating what Qualcomm said, but it turns out they might just be speculating so maybe my info is outdated. Personally I still think it makes sense they'd be sampling to lead customers in Q2 2011, so I won't edit the article just yet, but we'll see if Qualcomm says anything about the timing at MWC. Here's hoping they do more than that and we actually get some architectural info!
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles)
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is offline   Reply With Quote
Old 07-Feb-2011, 22:56   #8
metafor
Member
 
Join Date: May 2010
Posts: 463
Default

Quote:
Originally Posted by Arun View Post
Hmm, FWIW, some time ago I had heard 'first half' for Qualcomm's first 28nm chip sampling, and I can find several websites that mention 'early 2011' so I had assumed they were just repeating what Qualcomm said, but it turns out they might just be speculating so maybe my info is outdated. Personally I still think it makes sense they'd be sampling to lead customers in Q2 2011, so I won't edit the article just yet, but we'll see if Qualcomm says anything about the timing at MWC. Here's hoping they do more than that and we actually get some architectural info!
The initial announcement for 8960 was a January "sampling" as announced at the investor's meeting. The actual date for commercial sampling is yet to be announced.
metafor is offline   Reply With Quote
Old 07-Feb-2011, 23:09   #9
Tahir2
Itchy
 
Join Date: Feb 2002
Location: United Queendom
Posts: 2,858
Default

Quote:
I was thinking of mentioning the PSP2, but it wasn't official when I wrote 99% of the article, and I forgot to make a last-minute addition. So I suppose that's what this thread is for: do you think the PSP2 being quad-core will help legitimise quad-core as an advantage in handhelds as well?
In a word no, but then a quad core processor maybe more power and performance efficient than an A15 dual core as you mention in your article. I believe the GPU will make the difference however.

Quote:
As for NVIDIA, obviously except for Project Denver (which I mention on Page 2 and 4), they're using the same ARM11/Cortex-A9/Cortex-A15 cores that are described on Page 1.
They also have Tegra2 which is the development platform for Honeycomb but that advantage may be short lived.
__________________
Time is an illusion. Lunchtime doubly so - Douglas Adams
Tahir2 is offline   Reply With Quote
Old 08-Feb-2011, 08:01   #10
iwod
Member
 
Join Date: Jun 2004
Posts: 168
Default

The article doesn't mention anything further then Intel 32nm, I expect 22nm for Atom would be an tipping point where ARMv7 and x86 reaches the same power and performance. Do you think so too? And with Intel having a much more aggressive node shrink scale then TSMC, running smaller node one year earlier then other manufacturing should provide enough benefits / advantages to Atom SoC?

For Mobile GPU It is interesting that Ti announced their OMAP 5 for 2H 2012, and we are still not seeing PowerVR 6 !!!!!!!!!!. It is annoying!!!!!! Duke Nukem of Hardware?

I would love to know the software side of things. It is quite a known fact that software for Graphics matters a lot more then Hardware. As Nvidia famously said they have many more software engineers then Hardware guys.

That is mainly because Drivers,makes or brakes the GPU. You could have a decent GPU but poor Drivers, Application and Gaming support. ( Matrox and S3 ).

Do Qualcomm, ARM, IMG provide drivers for every platform?
iwod is offline   Reply With Quote
Old 08-Feb-2011, 09:57   #11
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,570
Default

Great read.

Regarding the limited OOO capability of Marvell's PJ4 and Qualcomm's Scorpion: I suspect these use skewed/delayed execution, where ALU ops are delayed a number of cycles in relation to load ops. This reduces the apparent latency of loads at the cost of a higher branch mispredict penalty. Since loads are performed earlier than ALU ops, you can call this out-of-order execution.

It's a fairly common trick and it is also used in the Cortex A8.

The person, posting as Wilco, over at Realworldtech,com's forums has worked on the A8. He mentioned he regretted not skewing the pipes one more cycle to reduce apparent latency of loads from one to zero cycles. That way the A8 could issue a load+ALU op pair without stalling if the load hit the data cache.

A more skewed pipeline, which results in virtual zero latency from L1 hits, could offer an explanation why PJ4 and Scorpion performs so well compared to A8.

Cheers
__________________
I'm pink, therefore I'm spam
Gubbi is online now   Reply With Quote
Old 08-Feb-2011, 15:10   #12
Rys
Tiled
 
Join Date: Oct 2003
Location: Kings Langley, UK
Posts: 2,675
ImgTec

Quote:
Originally Posted by iwod View Post
Do Qualcomm, ARM, IMG provide drivers for every platform?
I can't speak for QCT or ARM, but we provide licensed 'reference' drivers for a given core and OS which the licensee is encouraged to use in their products, and we push really hard for a standard of quality, correctness and performance to help sell the IP.
__________________
A major redesign of the core ALU pineapple boomerang fortress.
Rys is offline   Reply With Quote
Old 08-Feb-2011, 16:14   #13
GZ007
Member
 
Join Date: Jan 2010
Posts: 416
Default

Interesting A15 info from ARM http://www.arm.com/files/pdf/AT-Expl...Cortex-A15.pdf.

EDIT from Arun: This was in the OMAP5 thread, but architectural discussion on the A15 fits better in this thread.
GZ007 is offline   Reply With Quote
Old 08-Feb-2011, 16:29   #14
metafor
Member
 
Join Date: May 2010
Posts: 463
Default

Quote:
Originally Posted by GZ007 View Post
I've read in other places (MDR) that A15 supposedly has 2 symmetrical NEON pipelines. I wonder if this means it can perform 2x128-bit arithmetic instructions or whether each pipeline does 64-bit and can be used for either LD/ST or arithmetic.
metafor is offline   Reply With Quote
Old 08-Feb-2011, 16:39   #15
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,579
Default

Quote:
Originally Posted by metafor View Post
I've read in other places (MDR) that A15 supposedly has 2 symmetrical NEON pipelines. I wonder if this means it can perform 2x128-bit arithmetic instructions or whether each pipeline does 64-bit and can be used for either LD/ST or arithmetic.
Do you have a link handy for where you read that there were symmetrical units?

We know for sure that it's vec4 FP32 and can multi-issue something - these two things are probably not mutually exclusive, so I would presume it doesn't achieve vec4 by dual-issue. This probably means that it can at least do a vec4 FP32 FMADD and 128-bit load, store, or permute in the same cycle. There could be some more redundancy, but going for full symmetry seems like overkill and isn't something you see in other SIMD architectures (afaik)

I also think it'd be very unlikely that you'd see the capability of two independent loads per cycle on NEON when the integer core doesn't support such a thing, and it's not really the typical use case for SIMD (unless they added gather/scatter, which they didn't)
Exophase is offline   Reply With Quote
Old 08-Feb-2011, 16:44   #16
metafor
Member
 
Join Date: May 2010
Posts: 463
Default

Quote:
Originally Posted by Exophase View Post
Do you have a link handy for where you read that there were symmetrical units?

We know for sure that it's vec4 FP32 and can multi-issue something - these two things are probably not mutually exclusive, so I would presume it doesn't achieve vec4 by dual-issue. This probably means that it can at least do a vec4 FP32 FMADD and 128-bit load, store, or permute in the same cycle. There could be some more redundancy, but going for full symmetry seems like overkill and isn't something you see in other SIMD architectures (afaik)

I also think it'd be very unlikely that you'd see the capability of two independent loads per cycle on NEON when the integer core doesn't support such a thing, and it's not really the typical use case for SIMD (unless they added gather/scatter, which they didn't)
I was going by Jim Turley's write-up on Eagle at mdronline.

"the twin FP/Neon units are equipped to perform up to eight SIMD multiply/divide operations in parallel—four in each execution unit."

I don't know where that information came from; it could just be speculation. But if both pipelines can handle multiply/divide, that's quite a bit of FP muscle.
metafor is offline   Reply With Quote
Old 08-Feb-2011, 17:20   #17
JohnH
Member
 
Join Date: Mar 2002
Location: UK
Posts: 570
Default

The slides imply 4 way fused FMAD, the divide is in the integer unit.

John.
JohnH is offline   Reply With Quote
Old 08-Feb-2011, 17:51   #18
metafor
Member
 
Join Date: May 2010
Posts: 463
Default

Quote:
Originally Posted by JohnH View Post
The slides imply 4 way fused FMAD, the divide is in the integer unit.

John.
I'm doubting 2-way 128-bit arithmetics as well, but they specifically listed 4-way VFMA's. If one NEON pipe isn't restricted to just LD/ST, there can be 8-way ADD/SUB for F32 without a lot of modification to the datapath.

OTOH, 8-way F32 multiplies are a bit more difficult.
metafor is offline   Reply With Quote
Old 08-Feb-2011, 18:01   #19
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,579
Default

Quote:
Originally Posted by metafor View Post
I was going by Jim Turley's write-up on Eagle at mdronline.

"the twin FP/Neon units are equipped to perform up to eight SIMD multiply/divide operations in parallel—four in each execution unit."

I don't know where that information came from; it could just be speculation. But if both pipelines can handle multiply/divide, that's quite a bit of FP muscle.
Thanks, but it looks like I have to pay $900 to see it. I'm not sure I trust Linley's information 100% at face value anyway.

NEON can currently issue a streamed FMUL + FADD in a single at nearly double the latency, but of course this doesn't qualify as fused in either performance or precision terms. I wouldn't be very surprised if ARM did move for a fused FMADD implementation using the existing instructions, which makes it a good forward thinking strategy.

NEON already has reciprocal approximation support. There are two instructions: an initial step gives a value with 8-bits (don't know precisely what they use to get these, maybe a LUT) and an iteration which performs (2 - (op1 * op2)) and should be using in conjunction with an FMUL to perform a Newton-Rhapson iteration. The latter is functionally equivalent to an FMADD so it has the same issue and latency characteristics.

So if FMADD latency is improved I imagine the reciprocal step latency would improve too; I can't see any reason not to continue using the same pipelines for it. On the chance that there are really 2x vec4 FMADD pipelines you should be able to do reciprocal steps and normal FMADDs in parallel. I wouldn't expect that they'd have two reciprocal initial approximation pipelines. What seems most likely is that there'll be one vec4 FMADD pipeline and one vec4 reciprocal estimation pipeline and you'll be able to run those in parallel.

What I strongly doubt is that we'll be seeing a pipelined full-precision divide instruction that runs 4x in parallel. The current approach is a good way to balance performance and reuse the FMADD pipeline.. fully pipelined divides would take a lot of real estate and probably not much better latency than using the NR steps.
Exophase is offline   Reply With Quote
Old 08-Feb-2011, 18:12   #20
metafor
Member
 
Join Date: May 2010
Posts: 463
Default

Quote:
Originally Posted by Exophase View Post
Thanks, but it looks like I have to pay $900 to see it. I'm not sure I trust Linley's information 100% at face value anyway.

NEON can currently issue a streamed FMUL + FADD in a single at nearly double the latency, but of course this doesn't qualify as fused in either performance or precision terms. I wouldn't be very surprised if ARM did move for a fused FMADD implementation using the existing instructions, which makes it a good forward thinking strategy.
They didn't. ARMv7-A LPA adds VFMA, VFMS, VNFMA and VNFMS for fused mult/add. VFMA and VFMS existing in both NEON and VFP variants whereas VNFMA and VNFMS are VFP only.

Existing VMLA/VMLS/VNMLA/VNMLS instructions will stay chained. And I would imagine optimization guides will discourage the use of the chained instructions.

Quote:
So if FMADD latency is improved I imagine the reciprocal step latency would improve too; I can't see any reason not to continue using the same pipelines for it. On the chance that there are really 2x vec4 FMADD pipelines you should be able to do reciprocal steps and normal FMADDs in parallel. I wouldn't expect that they'd have two reciprocal initial approximation pipelines. What seems most likely is that there'll be one vec4 FMADD pipeline and one vec4 reciprocal estimation pipeline and you'll be able to run those in parallel.
Yes, but the main question is how flexible the issue queue will be. If the second pipeline contains VRECPE as well as VLD/ST logic along with other simpler arithmetics, that will be quite a bloated pipeline. I guess we won't know until this thing can be benchmarked.

Are there really enough scenarios out there where one pipeline can be saturated with FMAC while the other needs to handle miscellaneous FP?

Quote:
What I strongly doubt is that we'll be seeing a pipelined full-precision divide instruction that runs 4x in parallel. The current approach is a good way to balance performance and reuse the FMADD pipeline.. fully pipelined divides would take a lot of real estate and probably not much better latency than using the NR steps.
There isn't an instruction for it. Just the legacy FDIV for VFP.
metafor is offline   Reply With Quote
Old 08-Feb-2011, 18:58   #21
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 7,768
Default

Outstanding work Arun! I think it's slowly high time for a hand-held GPU article...hm?
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 08-Feb-2011, 18:59   #22
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,877
Default

I moved the A15 NEON architecture discussion here since it's much more on-topic here than in the OMAP5 thread.

The one thing I would point out is that the ARM A15 presentation indicates that one of the challenges on the NEON side is "late accumulator source operand for MAC operations" and the ISA extension slide mentions "Fused MAC [...] New instructions to complement current chained multiply+add". So it's very clear that it still supports the chained multiply+add and it's still implemented with a dedicated MAC FIFO (as mentioned for the Cortex-A9 on Page 12 of this presentation).

I think they could probably let that MAC FIFO optionally store a non-rounded result to implement a 'fused MAC' rather than add another multiplier in the same pipeline. It seems by far most likely that there's still only a 4-way multiplier otherwise they'd brag about it; I've read all the public A15 coverage (so that doesn't include the subscription-only MPR article) and I haven't seen anyone else mention symmetric pipelines. On the other hand, even if the fused MAC uses both the MUL and ADD pipelines, maybe the processor can also co-issue ADDs and MULs (unlike the A8) especially as even NEON now supports Out-of-Order Issue (so it's less likely to be wasting a dual-issue opportunity with a pending ADD-part-of-MAC).

In theory it's also possible that the MUL pipeline supports ADD but FP-wise can do fused FMACs (+MUL/ADDs) and not non-fused FMACs so you still need the dedicated MAC FIFO. I don't think it's very likely though (and neither are a wide variety of other options, many of which are still far from impossible though).
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles)
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is offline   Reply With Quote
Old 08-Feb-2011, 19:28   #23
metafor
Member
 
Join Date: May 2010
Posts: 463
Default

Quote:
Originally Posted by Arun View Post
I think they could probably let that MAC FIFO optionally store a non-rounded result to implement a 'fused MAC' rather than add another multiplier in the same pipeline.
It's not necessary to add a separate multiplier for fused MAC. But you do need to expand your current multiplier's partial product tree to include another accumulate leg. That's one of the advantages of fused MAC over chained; it's actually easier and faster to implement.

For chained MAC, the extra accumulate leg can simply be tied to 0 and the result written to a MAC FIFO (or forwarded to a separate ADD instruction).

Quote:
It seems by far most likely that there's still only a 4-way multiplier otherwise they'd brag about it; I've read all the public A15 coverage (so that doesn't include the subscription-only MPR article) and I haven't seen anyone else mention symmetric pipelines.
It would seem improbable to me as well. But I wouldn't rule out the second NEON pipeline being able to handle various arithmetic operations that can be issued in parallel with the "main" arithmetic pipeline.

Quote:
On the other hand, even if the fused MAC uses both the MUL and ADD pipelines, maybe the processor can also co-issue ADDs and MULs (unlike the A8) especially as even NEON now supports Out-of-Order Issue (so it's less likely to be wasting a dual-issue opportunity with a pending ADD-part-of-MAC).
That's what I think is more likely. A multiply requires 2x the width of the datapath to form the partial product tree, which is basically a giant adder (reduced through booth encoding and compressors). This means that you naturally have a 256-bit ADD datapath already. You'd need more complex control logic as well as muxes to control the different modes (adding to timing problems) but it's possible to have enough of a datapath for 8x FP32/INT32 ADDs without adding extra datawidth in the datapath.

There are also other miscellaneous NEON operations (logicals, shift, recpe, etc.) that pretty much requires a separate pipeline. I wonder if the issue logic is capable of sharing a subpipe. For instance, that issue0 can issue to FMAC or Vlogical, issue1 can issue to Vlogical or Vld.
metafor is offline   Reply With Quote
Old 08-Feb-2011, 19:43   #24
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,877
Default

Quote:
Originally Posted by metafor View Post
It's not necessary to add a separate multiplier for fused MAC.
Gah yes, I'm being stupid, I meant another addition unit in the same pipeline. Also good point about a fused MAC being cheaper to implement than a chained one.

So here's one possibility: Pipe1 does MUL/ADD/Integer-MAC/Fused-FMAC, Pipe2 does ADD/ALU, and Pipe3 does LD/ST/Misc - and you can issue to two of these pipelines per cycle. Do you think that makes sense? It's far from the only possibility and I still think it's more likely they're reusing Pipe2's ADD for all MACs including Fused-FMACs if that's possible, but I want to make sure we're on the same page.
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles)
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is offline   Reply With Quote
Old 08-Feb-2011, 20:10   #25
metafor
Member
 
Join Date: May 2010
Posts: 463
Default

Quote:
Originally Posted by Arun View Post
Gah yes, I'm being stupid, I meant another addition unit in the same pipeline. Also good point about a fused MAC being cheaper to implement than a chained one.
Even that's not really needed

You just need a 3-input adder as opposed to a 2-input adder. When forming the wallace tree, this just means the use of 4-2 compressors instead of 3-2 compressors. And the circuit design of a 4-2 is actually simpler and faster than a 3-2 due to the ability to simply decode the input into one-hot, pass-gate enables.

Quote:
So here's one possibility: Pipe1 does MUL/ADD/Integer-MAC/Fused-FMAC, Pipe2 does ADD/ALU, and Pipe3 does LD/ST/Misc - and you can issue to two of these pipelines per cycle. Do you think that makes sense? It's far from the only possibility and I still think it's more likely they're reusing Pipe2's ADD for all MACs including Fused-FMACs if that's possible, but I want to make sure we're on the same page.
Like I said, Fused MAC really doesn't lend itself to reuse a separate adder. Essentially, a fused MAC is a single-pass operation where the addition is built into the multiply circuit's partial product tree.

The reason this can't be done with chained MAC is because the multiply result needs to be rounded/saturated and normalized. So a separate adder is needed at the end of the multiply operation.

If there is to be a dedicated adder separate from the multiply tree, then I suppose having it in a separate pipe makes sense. So the scenario you listed above is perfectly feasible. But if there are 3 subpipes, keeping the LD/ST exclusively LD/ST would probably be a better option as that'd cut down significantly on the pipeline registers, controls and routing necessary.

The slides do seem to imply, however, that there are only 2 subpipes...

Another option is to split the chained MAC instruction into two separate MUL/ADD instructions and send them down the pipeline separately. This allows the reuse of both the multiply with accumulate leg disabled as well as the adder used in the partial product tree.

Additionally, ADD micro-op could be staggered along a separate pipeline such that the main arithmetic pipe can still perform other operations during the second pass.
metafor is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 13:55.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.