Cortex A15 fp64 peak?

codedivine · Aug 21, 2012

I am trying to find out the exact fp64 peak of a Cortex A15, for a small report I am writing. I am assuming an implementation similar to Exynos 5 Dual.

My understanding so far is that it should be 2 fp64 MACs/cycle = 4 fp64 flops/cycle.

Is this correct?

Exophase · Aug 21, 2012

ARMv7a's NEON doesn't have FP64 support, so you only get scalar operations via VFP. Although NEON is mentioned as having some level of dual-issue capability, and that should apply to VFP as well, I'm skeptical that it'll contain two independent FMA units, let alone two FP64 ones. The demand for FP64 on the platform just isn't high enough to justify such a thing.

codedivine · Aug 21, 2012

Thanks. I was under the mistaken assumption that the Advanced SIMDv2 instructions do contain fp64 instructions but looks like they do not. Also ARM has mentioned "quad-MAC" capability on SIMDv2, which I had assumed was for fp32, and correspondingly assumed a double-MAC unit for fp64. But looks like I was wrong about presence of fp64.

So we are left with VFPv4 for double-precision. VFPv4 does have fused MAC for FP64 instruction, but I am not sure whether the implementation can sustain a throughput of 1 MAC/cycle.

edit: Actually, not sure about quad-MAC on just the SIMDv2 implementation either. Looks like it is quad-MAC for fp32 for VFPv4 + SIMDv2 units combined.

Exophase · Aug 22, 2012

Architecture (Advanced SIMDv2, VFPv4, etc) doesn't say anything about how many execution units are in a particular implementation. I don't think ARM has said anything about quad MAC for SIMDv2, but has said Cortex-A15 will have it, which also happens to have the new NEON. And I'm sure that it doesn't mean what you get in combined throughput with VFP.

Incidentally, even where 32-bit and 64-bit are both present in vector units you shouldn't assume that quad-32 operations implies dual 64 ones (either for floating point or integer). Multiplication resources for instance grow quadratically, not linearly. So when Intel, AMD, and nVidia had 1:2 FP64 to FP32 ratios it means they're spending some amount of extra silicon for it (compare with the older AMD GPUs that had 1:4 ratios)

metafor · Aug 22, 2012

Integer multiplication doesn't cause quite that much extra area for doubling the width for a relatively sane implementation. Floating point, yes.

codedivine · Aug 22, 2012

Thanks Exophase and metafor!

Exophase · Aug 22, 2012

metafor said:
Integer multiplication doesn't cause quite that much extra area for doubling the width for a relatively sane implementation. Floating point, yes.

Yeah, I'm only looking at the area for partial product generation, assuming it's implemented by summing the results of several parallel generated partial products. No idea what a native N-bit vs 2N-bit would be like, but NEON has at least so far worked with an array of 8 8x16 multipliers (and I suspect most recent ARM integer implementations work with 16x16 multipliers).

I wonder if Cortex-A15 will have higher integer multiplication throughput on NEON. 1-cycle like on Cortex-A8/A9 is kind of weak, I hope it at least grows to stay within 1:2 of FP32.

metafor · Aug 22, 2012

Exophase said:
Yeah, I'm only looking at the area for partial product generation, assuming it's implemented by summing the results of several parallel generated partial products. No idea what a native N-bit vs 2N-bit would be like, but NEON has at least so far worked with an array of 8 8x16 multipliers (and I suspect most recent ARM integer implementations work with 16x16 multipliers).

Even the partial product tree isn't increased all that much for a base-16 multiplier. But more importantly, the actual tree can be mostly reused and shared by 4xINT32 without much waste.

FP adds a lot of waste, but not in the partial product tree itself; the shifter for alignment is the biggest hot area.

I wonder if Cortex-A15 will have higher integer multiplication throughput on NEON. 1-cycle like on Cortex-A8/A9 is kind of weak, I hope it at least grows to stay within 1:2 of FP32.

Do you mean 1-cycle issue? For a single 128-bit NEON, that's pretty good.

IIRC, they have two separate NEON pipes; each 64-bit wide. There was some speculation that some VFP instructions could be issued to each pipe in a dual-issue configuration. Obviously this would be limited to instructions that would only use half the data pipe (so not FP64). It also means NEON doubles can be co-issued for higher integer throughput.

Exophase · Aug 22, 2012

metafor said:
Even the partial product tree isn't increased all that much for a base-16 multiplier. But more importantly, the actual tree can be mostly reused and shared by 4xINT32 without much waste.

Yes, I was saying that the partial product multipliers would take up 4x the space, but of course not a 4x increase in space (ie 5x total).. that's for the same performance, you can of course sequence it over multiple cycles.

metafor said:
Do you mean 1-cycle issue? For a single 128-bit NEON, that's pretty good.

No, I mean the throughput is only 1 32-bit MAC per cycle. Of course you can't use NEON to perform a single 32-bit MAC, so it's 2x32 over 2 cycles or 4x32 over 4 cycles.

metafor said:
IIRC, they have two separate NEON pipes; each 64-bit wide. There was some speculation that some VFP instructions could be issued to each pipe in a dual-issue configuration. Obviously this would be limited to instructions that would only use half the data pipe (so not FP64). It also means NEON doubles can be co-issued for higher integer throughput.

Have you heard any source that confirms that the two NEON pipes are only 64-bit wide on Cortex-A15? The only place I've heard this is your speculation many months ago. When they talk about both quad-MAC and some level of dual issue I don't get the impression that the former is only achieved via the latter.

metafor · Aug 22, 2012

Exophase said:
No, I mean the throughput is only 1 32-bit MAC per cycle. Of course you can't use NEON to perform a single 32-bit MAC, so it's 2x32 over 2 cycles or 4x32 over 4 cycles.

Yes but you can interleave ADD/SH and other non-MUL instructions in there. I realize that's not as flexible or as preferable.

Have you heard any source that confirms that the two NEON pipes are only 64-bit wide on Cortex-A15? The only place I've heard this is your speculation many months ago. When they talk about both quad-MAC and some level of dual issue I don't get the impression that the former is only achieved via the latter.

The public information is the block diagram I've seen in various places:

http://pc.watch.impress.co.jp/video/pcw/docs/513/347/p11.pdf

For instance. There are two "complex clusters" with two issue ports and they look symmetrical based on both pipeline length and type of operations they appear to support. Unless you think they implemented a 256-bit datapath (dual issue 128-bit vecs) which is insanely unlikely, dual-issue doubles is the most sensible.

Exophase · Aug 23, 2012

metafor said:
The public information is the block diagram I've seen in various places:

http://pc.watch.impress.co.jp/video/pcw/docs/513/347/p11.pdf

For instance. There are two "complex clusters" with two issue ports and they look symmetrical based on both pipeline length and type of operations they appear to support. Unless you think they implemented a 256-bit datapath (dual issue 128-bit vecs) which is insanely unlikely, dual-issue doubles is the most sensible.

I don't think Hiroshige Goto is using any more information than what was made available in ARM's presentations. None of it says what types of operations those two pipelines support outside of "NEON" and "VFP." I wouldn't say that anything implies that they're symmetric, definitely not completely so (for instance would you expect both to have division capabilities?).

Compare it with Cortex-A8. All of the NEON pipelines are the same length, although they offer forwarding at different points. And it can dual issue to the arithmetic and load/store/permute pipelines simultaneously, with both offering 128-bit operations. Why couldn't Cortex-A15 be offering something similar, or possibly extended? For instance being able to dual issue 128-bit integer and 128-bit FP operations, or 128-bit add with 128-bit multiply.

Making two symmetric 64-bit pipelines, where a large number of 128-bit operations are supported but by splitting them over both pipes, seems like a less than ideal balancing of resources. I'd only expect them to do this if they were quite asymetric, but they'd at least have to both handle FMAs so that implies a fair bit of redundancy.

metafor · Aug 23, 2012

Exophase said:
I don't think Hiroshige Goto is using any more information than what was made available in ARM's presentations. None of it says what types of operations those two pipelines support outside of "NEON" and "VFP." I wouldn't say that anything implies that they're symmetric, definitely not completely so (for instance would you expect both to have division capabilities?).

Actually, there are dual dividers in A15. I wouldn't expect it to be completely symmetrical, but the things I wouldn't expect to be symmetrical would be FP64 MUL and MAC. Perhaps VFP can't be dual-issued either.

Compare it with Cortex-A8. All of the NEON pipelines are the same length, although they offer forwarding at different points. And it can dual issue to the arithmetic and load/store/permute pipelines simultaneously, with both offering 128-bit operations. Why couldn't Cortex-A15 be offering something similar, or possibly extended? For instance being able to dual issue 128-bit integer and 128-bit FP operations, or 128-bit add with 128-bit multiply.

It could. And I agree public information doesn't say one or the other. But I'll assert that none of those configurations outside of the ld/st/permute + arithmetic pipe make sense. And ld/st/permute looks a lot different than an arithmetic pipe to the point where they'd have nowhere near the same latencies. Also, I believe A15 integrates its ld/st operations across NEON, VFP and ARM standard, so all the ld/st operations will be handled in the CPU's ldst pipeline.

You rarely see integer SIMD mixed with FP SIMD in a tight loop...it really just doesn't happen. Separating ADD vs MUL has problems when using the two for MAC -- which A15 does. It can be done, but scheduling and forwarding would be a nightmare and certainly would add to the critical path for a relatively high-frequency design.

Making two symmetric 64-bit pipelines, where a large number of 128-bit operations are supported but by splitting them over both pipes, seems like a less than ideal balancing of resources. I'd only expect them to do this if they were quite asymetric, but they'd at least have to both handle FMAs so that implies a fair bit of redundancy.

Why wouldn't it be ideal? In the worst case, you have the same throughput as having a single 128-bit pipe. In the best case, you have twice the throughput when doubles are used. For less intensive VFP operations (i.e. not FP64), you can even dual-issue singles, which accounts for the vast majority of code out there...

Note that the subpipe need not be completely separated. But for NEON operations, even FMA channels aren't that dependent on one another. Not to mention A15 grafts FMA onto its chained implementation anyway...

Exophase · Aug 24, 2012

metafor said:
Actually, there are dual dividers in A15. I wouldn't expect it to be completely symmetrical, but the things I wouldn't expect to be symmetrical would be FP64 MUL and MAC. Perhaps VFP can't be dual-issued either.

Are there independent dual dividers or a 2x SIMD one? Are they really referring to dividers? The sources on this don't give good enough detail.

metafor said:
It could. And I agree public information doesn't say one or the other. But I'll assert that none of those configurations outside of the ld/st/permute + arithmetic pipe make sense. And ld/st/permute looks a lot different than an arithmetic pipe to the point where they'd have nowhere near the same latencies. Also, I believe A15 integrates its ld/st operations across NEON, VFP and ARM standard, so all the ld/st operations will be handled in the CPU's ldst pipeline.

Okay, so hopefully it could at least dual-issue a 128-bit NEON load or store with a 128-bit operation no matter what. Not being able to do 128-bit operation + permute would still be a tangible loss though, especially if the NEON permute is used to do load de-interleaving which it most likely would (instead of the load/store unit directly).

metafor said:
You rarely see integer SIMD mixed with FP SIMD in a tight loop...it really just doesn't happen.

True.

metafor said:
Separating ADD vs MUL has problems when using the two for MAC -- which A15 does. It can be done, but scheduling and forwarding would be a nightmare and certainly would add to the critical path for a relatively high-frequency design.

Maybe for FP, but what about integer? Surely the integer MAC pipeline is separate from other integer ALU operations.

Look at for instance Atom's SSE issue capabilities.. it can dual-issue pretty freely to int MUL, 2x int ALU, FMUL, and FADD pipelines. The FMUL is just 64-bit though.

In the current NEON implementation there's a separate FADD and FMUL pipeline which is chained for FMADD, and according to you Cortex-A15 still chains for both this and FMA (I'll take your word on it since I don't think there's a public source for it). Why would this necessarily change for Cortex-A15? If it could freely dual issue to separate FADD and FMUL pipelines wouldn't it be able to handle the FMADD by issuing the FMUL and FADD to both pipelines? And the FMA, if the pipelines are able to output and receive more precision, which you seem to be saying it is. This would mean you can't dual issue an FMADD/FMA with a separate FADD or FMUL, but is it really a major scheduling/critical path problem?

Of course, if the latency of FADD and FMUL are individually even close to that 10 cycle value listed then FMADD is going to have one hell of a long latency. I would like to think the 10 cycles is for FMADDs (and similar like reciprocal steps), which, if there are two up to 10 stage pipelines, would support the idea of there being two 64-bit FMA pipes. However, given the original source for the pipeline description (www.arm.com/files/pdf/AT-Exploring_the_Design_of_the_Cortex-A15.pdf), it doesn't say that there are two 2-10 cycle NEON pipelines, but that there's a 2-10 cycle NEON pipeline that has dual-issue capability. So it could easily mean that it only hits that cycle when the two pipelines are chained.

metafor said:
Why wouldn't it be ideal? In the worst case, you have the same throughput as having a single 128-bit pipe. In the best case, you have twice the throughput when doubles are used. For less intensive VFP operations (i.e. not FP64), you can even dual-issue singles, which accounts for the vast majority of code out there...

Yes, so it could be significantly faster for code that doesn't use NEON well. Is this what ARM wants to target? Dual 128-bit op + 128-bit permute alone could be more useful for some highly optimized code, and not having that would be a regression from Cortex-A8 (even though there isn't a lot of 128-bit permute capability there).

Maybe I'm just looking at this too much from the point of view of peak attainable performance for code that's well optimized for that platform, but I find 2x mostly redundant 64-bit units disappointing. But I don't really know what the costs are like for 2 64-bit vs 1 128-bit, just that it seems like it'd be significant.

metafor · Aug 25, 2012

Exophase said:
Are there independent dual dividers or a 2x SIMD one? Are they really referring to dividers? The sources on this don't give good enough detail.

Divides are only available in scalar mode; there isn't a SIMD divide instruction.

Okay, so hopefully it could at least dual-issue a 128-bit NEON load or store with a 128-bit operation no matter what. Not being able to do 128-bit operation + permute would still be a tangible loss though, especially if the NEON permute is used to do load de-interleaving which it most likely would (instead of the load/store unit directly).

I'm not sure how permute would be handled with regards to forwarding to the arithmetic path. I realize that A15's SIMD/FP pipeline is far more integrated to the other pipes than it was in A9 but whether they're able to do next-cycle-use forwarding from the LD/ST pipe is unknown. I suppose they'd have to or they'd kill SIMD performance.

Maybe for FP, but what about integer? Surely the integer MAC pipeline is separate from other integer ALU operations.

Look at for instance Atom's SSE issue capabilities.. it can dual-issue pretty freely to int MUL, 2x int ALU, FMUL, and FADD pipelines. The FMUL is just 64-bit though.

IIRC, Atom's SSE pipes are 64-bits wide; there are two of them. And more so, only one can take SIMD instructions while the other is for scalar instructions only.

If you're asking whether A15's two "complex" pipes can handle separate scalar instructions, I'd say likely. Which is the configuration I said would be the most beneficial. In fact, with the exception of things like dual FP64 and dual IntMUL64 -- too costly, IMO -- it looks surprisingly like Atom in terms of dispatch rules. Or so I speculate.

In the current NEON implementation there's a separate FADD and FMUL pipeline which is chained for FMADD, and according to you Cortex-A15 still chains for both this and FMA (I'll take your word on it since I don't think there's a public source for it). Why would this necessarily change for Cortex-A15? If it could freely dual issue to separate FADD and FMUL pipelines wouldn't it be able to handle the FMADD by issuing the FMUL and FADD to both pipelines? And the FMA, if the pipelines are able to output and receive more precision, which you seem to be saying it is. This would mean you can't dual issue an FMADD/FMA with a separate FADD or FMUL, but is it really a major scheduling/critical path problem?

Are we talking scalar or SIMD? For scalar, I agree absolutely that it makes sense to allow FADD in a separate pipe than FMUL. But that, again, is inline with what I proposed above: separate 64-bit data paths.

For SIMD, there are issues of pipeline resources and writeback resources. For example, most of the SIMD FP pipe can share the same pipeline registers (and/or latches) which come at a significant cost. More importantly, they can also share some of the same routing and placement resources of those registers -- again, doesn't come cheap.

Write-back is another issue. Register files can only have so many write ports. The more that are added, the more crowding amongst the input signals as well as more decoding logic amongst the rows and columns. This impacts timing and can increase write-back latency by quite a bit. That doesn't necessarily hurt performance in terms of fowarding but it certainly increases the flush penalty, not to mention the added power and area.

What you're proposing would essentially require either the second FP pipe share the write-back ports with the ld/st pipe (which severely limits the permutation of operations in a full throughput scenario) or that there be 3x128-bit write ports in the FP register file which isn't all that desirable until you get up to a class of processor far higher than what A15 is aiming for.

Of course, if the latency of FADD and FMUL are individually even close to that 10 cycle value listed then FMADD is going to have one hell of a long latency. I would like to think the 10 cycles is for FMADDs (and similar like reciprocal steps), which, if there are two up to 10 stage pipelines, would support the idea of there being two 64-bit FMA pipes. However, given the original source for the pipeline description (www.arm.com/files/pdf/AT-Exploring_the_Design_of_the_Cortex-A15.pdf), it doesn't say that there are two 2-10 cycle NEON pipelines, but that there's a 2-10 cycle NEON pipeline that has dual-issue capability. So it could easily mean that it only hits that cycle when the two pipelines are chained.

That's open to interpretation but 10 cycles for a chained implementation is pretty low, especially for the frequency targets A15 is going for. It's pretty high for any other operation though. So either:

1. The FMUL really is pretty slow (as in, long latency)
2. Their "chained" implementation has a separate, dedicated adder after the multiplier to cut down on MACC/FMA latency.

Yes, so it could be significantly faster for code that doesn't use NEON well. Is this what ARM wants to target? Dual 128-bit op + 128-bit permute alone could be more useful for some highly optimized code, and not having that would be a regression from Cortex-A8 (even though there isn't a lot of 128-bit permute capability there).

I don't think permute + op is going away simply by the fact that permute is likely handled in the ld/st pipeline.

Maybe I'm just looking at this too much from the point of view of peak attainable performance for code that's well optimized for that platform, but I find 2x mostly redundant 64-bit units disappointing. But I don't really know what the costs are like for 2 64-bit vs 1 128-bit, just that it seems like it'd be significant.

Not really. We are talking SIMD here, which are separate operations anyway....

Exophase · Aug 26, 2012

metafor said:
Divides are only available in scalar mode; there isn't a SIMD divide instruction.

I know this, what I meant is to say we don't know that it isn't referring to the reciprocal approximation instructions.

Can you really confirm that there are dual full dividers? I can't think of any other major general purpose CPUs that do this for a three-wide part with no more than two of anything in particular.

metafor said:
I'm not sure how permute would be handled with regards to forwarding to the arithmetic path. I realize that A15's SIMD/FP pipeline is far more integrated to the other pipes than it was in A9 but whether they're able to do next-cycle-use forwarding from the LD/ST pipe is unknown. I suppose they'd have to or they'd kill SIMD performance.

So do you expect this permute to happen in a separate pipe or as a previous pipeline stage..? And you think that SIMD performance needs single-cycle load-use? Why exactly?

metafor said:
IIRC, Atom's SSE pipes are 64-bits wide; there are two of them. And more so, only one can take SIMD instructions while the other is for scalar instructions only.

If you're asking whether A15's two "complex" pipes can handle separate scalar instructions, I'd say likely. Which is the configuration I said would be the most beneficial. In fact, with the exception of things like dual FP64 and dual IntMUL64 -- too costly, IMO -- it looks surprisingly like Atom in terms of dispatch rules. Or so I speculate.

Atom is what I'm describing it as. Read Agner Fog's timing tables, read Intel's optimization guide, or try it yourself. It's only 64-bit for FMUL and FADD and FMUL are independent, and simple integer ALU operations can dual issue 128-bit. What you're describing is nothing like Atom..

What I'm expecting is more of a natural progression of Cortex-A8, which makes sense since Cortex-A15 appears to have been designed by the Cortex-A8 team and shares a lot of other things in common with it instead of A9.

metafor said:
Are we talking scalar or SIMD? For scalar, I agree absolutely that it makes sense to allow FADD in a separate pipe than FMUL. But that, again, is inline with what I proposed above: separate 64-bit data paths.

For SIMD, there are issues of pipeline resources and writeback resources. For example, most of the SIMD FP pipe can share the same pipeline registers (and/or latches) which come at a significant cost. More importantly, they can also share some of the same routing and placement resources of those registers -- again, doesn't come cheap.

I don't think I really understand why going wider with SIMD changes everything, why would what you're suggesting not also apply to scalar?

metafor said:
Write-back is another issue. Register files can only have so many write ports. The more that are added, the more crowding amongst the input signals as well as more decoding logic amongst the rows and columns. This impacts timing and can increase write-back latency by quite a bit. That doesn't necessarily hurt performance in terms of fowarding but it certainly increases the flush penalty, not to mention the added power and area.

How does it need more register ports to have 2x128-bit than 2x64-bit? Don't you mean wider ones?

metafor said:
What you're proposing would essentially require either the second FP pipe share the write-back ports with the ld/st pipe (which severely limits the permutation of operations in a full throughput scenario) or that there be 3x128-bit write ports in the FP register file which isn't all that desirable until you get up to a class of processor far higher than what A15 is aiming for.

What about the load pipeline forwarding to the NEON pipes for 128-bit loads (which would then also need to contain the permute instructions) and the load forwarding to the NEON pipes or committing during its writeback? Like it does in Cortex-A8.. since the address of a load is not typically in the critical path of NEON operations it seems like they can get away with having it a few cycles later.. in Cortex-A8's case having it several cycles later didn't seem to really hurt.

metafor said:
That's open to interpretation but 10 cycles for a chained implementation is pretty low, especially for the frequency targets A15 is going for. It's pretty high for any other operation though. So either:

1. The FMUL really is pretty slow (as in, long latency)
2. Their "chained" implementation has a separate, dedicated adder after the multiplier to cut down on MACC/FMA latency.

Is 10 cycles really low, even at A15 frequencies? It's already higher than Cortex-A8's latency.. You wouldn't expect it to be the full sum of FADD + FMUL, but even then we're talking about an average of 5 cycles for each. Is that really so low?

If FMUL is 10 cycles that's going to really suck.

metafor said:
I don't think permute + op is going away simply by the fact that permute is likely handled in the ld/st pipeline.

So you think that most integer loads don't need the whole 4 of those pipeline stages? You'd think ARM would say something to that effect when describing that pipeline. Do you think the permutes happen in a stage that also does sign extension for integer loads?

metafor said:
Not really. We are talking SIMD here, which are separate operations anyway....

Sorry but I don't consider it separate operations, I consider it the same operation over more pieces of data. The implications on control and scheduling are totally different. Surely the overhead of having more independent narrower SIMDs is significant or no one would be using SIMD in the first place. Of course it hits diminishing returns but ARM is already pretty interested in 128-bit, and I doubt they want to stay behind Qualcomm here..

metafor · Aug 30, 2012

Exophase said:
I know this, what I meant is to say we don't know that it isn't referring to the reciprocal approximation instructions.

Can you really confirm that there are dual full dividers? I can't think of any other major general purpose CPUs that do this for a three-wide part with no more than two of anything in particular.

IIRC, most x86 space processors do divide as part of their pipeline. I'm obviously not going to confirm anything but divides in ARM space is more of a separate iterative sequencer deal. Those take up a lot less space but it does mean divide throughput isn't very high, hence the desire for dual.

So do you expect this permute to happen in a separate pipe or as a previous pipeline stage..? And you think that SIMD performance needs single-cycle load-use? Why exactly?

Well, outside of very densely packed matrix manipulation, SIMD workloads tend to be load-bound from what I've seen. Single cycle turnaround between ld-compute-store frees up precious ld/st queues.

Atom is what I'm describing it as. Read Agner Fog's timing tables, read Intel's optimization guide, or try it yourself. It's only 64-bit for FMUL and FADD and FMUL are independent, and simple integer ALU operations can dual issue 128-bit. What you're describing is nothing like Atom..

Erm, according to the optimization guide:

http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf

The only instructions with a throughput of 1 or higher that can issue from either port independently are logicals (AND/OR/BT/XOR/etc).

All others either have a throughput of 0.5, require both ports, or can only issue from one of the ports.

I don't think I really understand why going wider with SIMD changes everything, why would what you're suggesting not also apply to scalar?

Going wider with SIMD requires extra resources obviously. And not a non-trivial amount for anything outside of logicals and perhaps int ADD.

How does it need more register ports to have 2x128-bit than 2x64-bit? Don't you mean wider ones?

No. You can pack 2x64-bit in the write-back buffer. There are only so many architectural registers that you can either write-cancel, write-override, or pack. That obviously depends on your OoOE implementation but I can tell you at least one design does this.

What about the load pipeline forwarding to the NEON pipes for 128-bit loads (which would then also need to contain the permute instructions) and the load forwarding to the NEON pipes or committing during its writeback? Like it does in Cortex-A8.. since the address of a load is not typically in the critical path of NEON operations it seems like they can get away with having it a few cycles later.. in Cortex-A8's case having it several cycles later didn't seem to really hurt.

Forwarding can happen independent of the physical register file. Muxes come a lot cheaper than PRF write ports.

Is 10 cycles really low, even at A15 frequencies? It's already higher than Cortex-A8's latency.. You wouldn't expect it to be the full sum of FADD + FMUL, but even then we're talking about an average of 5 cycles for each. Is that really so low?

If FMUL is 10 cycles that's going to really suck.

Eh? I see A8's FMAC latency as 18-21 cycles for FP32 and 19-26 for FP64:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/Babbgjhi.html

Is there something I'm missing? FMUL takes 10-12 for FP32 and 11-17 for FP64.

So you think that most integer loads don't need the whole 4 of those pipeline stages? You'd think ARM would say something to that effect when describing that pipeline. Do you think the permutes happen in a stage that also does sign extension for integer loads?

Eh? Integer loads (especially something like INTADD) should take a whole of 2 cycles. I'm surprised logicals don't take a single cycle. As for sign extension, that's mainly done in expansion instructions, so I would think that it'd be somewhere in stage 3 or so of the complex pipe....

Sorry but I don't consider it separate operations, I consider it the same operation over more pieces of data. The implications on control and scheduling are totally different. Surely the overhead of having more independent narrower SIMDs is significant or no one would be using SIMD in the first place. Of course it hits diminishing returns but ARM is already pretty interested in 128-bit, and I doubt they want to stay behind Qualcomm here..

The ISA will still be issuing 128-bit vectors for quads; the explicity parallelism is still there and defined. Just because there are separate issue ports for narrower computation itself doesn't mean that parallelism is thrown away. Scheduling and control doesn't need to vary all that much either if you're smart about it.

Of course, if you want to be able to issue doubles in parallel, the OoOE engine would need to work a bit harder.

Exophase · Aug 30, 2012

metafor said:
IIRC, most x86 space processors do divide as part of their pipeline. I'm obviously not going to confirm anything but divides in ARM space is more of a separate iterative sequencer deal. Those take up a lot less space but it does mean divide throughput isn't very high, hence the desire for dual.

I have never seen an x86 processor that has divide instructions that have anything less than several cycle reciprocal throughput. Usually it's almost as high as the latency. And most of them use successive subtraction based algorithms, although usually generating more than 1-bit per cycle (2, 3, or even 4). So while they can operate independently of everything else they aren't really pipelined.

metafor said:
Well, outside of very densely packed matrix manipulation, SIMD workloads tend to be load-bound from what I've seen. Single cycle turnaround between ld-compute-store frees up precious ld/st queues.

Quite a broad statement. There's a lot of room between densely packed matrix manipulation (n^3 time vs n^2 space) and "load-bound." Much less "read-modify-write" bound!

metafor said:
Erm, according to the optimization guide:

http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf

The only instructions with a throughput of 1 or higher that can issue from either port independently are logicals (AND/OR/BT/XOR/etc).

All others either have a throughput of 0.5, require both ports, or can only issue from one of the ports.

You are reading the table incorrectly, "throughput" clearly means reciprocal throughput, or how many cycles it takes to execute the instruction. Notice how simple ALU integer instructions have 0.5 throughput against register or immediate, but 1.0 against memory? And those are the ones that are listed as capable of issuing from either port? Figure it out.

So first up, we see that 128-bit packed integer (not just logical, ALSO add and subtract, abs, avg, cmp, min, max, sign) can execute on both ports. So not only does it have two 128-bit SSE ports but they can co-issue.

Second, 128-bit addps has a throughput of 1, meaning it's also 128-bit, and executes on port 1, meaning it can co-issue with stuff on port 0. And 128-bit mulps has a throughput of 2 and executes on port 0, meaning it can co-issue with adds.

Finally, although I didn't mention them at all first time around, 128-bit integer multiplies and MADs are single cycle and execute on port 0. Even the 32-bit ones.

Nothing I said is contradicted by the manual. Your claims about Atom having 64-bit SIMD pipes or inability to co-issue SIMD are incorrect. The only thing 64-bit about Atom's SSE are its floating point multiplies.

metafor said:
Going wider with SIMD requires extra resources obviously. And not a non-trivial amount for anything outside of logicals and perhaps int ADD.

But the stuff you were saying was still about the control path and scheduling, wasn't it? And how much extra execution resources do you need for a 128-bit FADD and 128-bit FMUL vs 2 64-bit FADDs and 2 64-bit FMULs, which is what you think it'll have?

metafor said:
No. You can pack 2x64-bit in the write-back buffer. There are only so many architectural registers that you can either write-cancel, write-override, or pack. That obviously depends on your OoOE implementation but I can tell you at least one design does this.

So we're talking about write-back buffers now, not register file ports? Can you please answer, how do you perform writes of 2x64-bit that can go to any registers (not just adjacent ones) without having two register file write ports?

metafor said:
Forwarding can happen independent of the physical register file. Muxes come a lot cheaper than PRF write ports.

Exactly, so what's wrong with Cortex-A8's approach of forwarding from the load/store pipes to the NEON units - why would they need a separate PRF write port instead of using the NEON unit's writeback and forwarding?

metafor said:
Eh? I see A8's FMAC latency as 18-21 cycles for FP32 and 19-26 for FP64:

http://infocenter.arm.com/help/index.../Babbgjhi.html

Is there something I'm missing? FMUL takes 10-12 for FP32 and 11-17 for FP64.

.. seriously?

Look again at either the NEON performance of 2x32 FADD or 2x32 FMUL on Cortex-A8, or look at the VFP or NEON performance of those on Cortex-A9. Cortex-A8 has a hopelessly crippled VFP unit which is not pipelined, and has much worse latency AND throughput than the NEON unit or the VFP unit on Cortex-A8. You should already know this. Obviously I was talking about the NEON latencies.

metafor said:
Eh? Integer loads (especially something like INTADD) should take a whole of 2 cycles. I'm surprised logicals don't take a single cycle. As for sign extension, that's mainly done in expansion instructions, so I would think that it'd be somewhere in stage 3 or so of the complex pipe....

Sign extension is done every time you do an ldrsh... and if it's anything like previous ARM designs it handles rotation to extract bytes/halfwords (and possibly words as well) in a separate stage. I don't understand the difference between an "integer load" and "logical load", or do you mean when the load unit makes it available to the respect ALU? Why would this change load-use?

metafor said:
The ISA will still be issuing 128-bit vectors for quads; the explicity parallelism is still there and defined. Just because there are separate issue ports for narrower computation itself doesn't mean that parallelism is thrown away. Scheduling and control doesn't need to vary all that much either if you're smart about it.

Of course, if you want to be able to issue doubles in parallel, the OoOE engine would need to work a bit harder.

I don't see what that has to do with what I said, that I don't consider 128-bit to be four separate operations. And whether or not the front end deals with 128-bit instructions, if they're split into separate 64-bit operations over separate pipes (that can normally handle completely different operations) that's more overhead than if they went to one pipe. Although I don't know if register allocation and scheduling would be done before or after it's split.

If you can't issue two separate 64-bit operations in parallel then we're not even talking about 2x64-bit, are we?

Laurent06 · Aug 31, 2012

Exophase said:
Look again at either the NEON performance of 2x32 FADD or 2x32 FMUL on Cortex-A8, or look at the VFP or NEON performance of those on Cortex-A9. Cortex-A8 has a hopelessly crippled VFP unit which is not pipelined, and has much worse latency AND throughput than the NEON unit or the VFP unit on Cortex-A8. You should already know this. Obviously I was talking about the NEON latencies.

Guess you meant VFP unit on Cortex-A9

Exophase · Aug 31, 2012

Cortex A15 fp64 peak?

Similar threads