IIRC, most x86 space processors do divide as part of their pipeline. I'm obviously not going to confirm anything but divides in ARM space is more of a separate iterative sequencer deal. Those take up a lot less space but it does mean divide throughput isn't very high, hence the desire for dual.
I have never seen an x86 processor that has divide instructions that have anything less than several cycle reciprocal throughput. Usually it's almost as high as the latency. And most of them use successive subtraction based algorithms, although usually generating more than 1-bit per cycle (2, 3, or even 4). So while they can operate independently of everything else they aren't really pipelined.
Well, outside of very densely packed matrix manipulation, SIMD workloads tend to be load-bound from what I've seen. Single cycle turnaround between ld-compute-store frees up precious ld/st queues.
Quite a broad statement. There's a lot of room between densely packed matrix manipulation (n^3 time vs n^2 space) and "load-bound." Much less "read-modify-write" bound!
Erm, according to the optimization guide:
http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf
The only instructions with a throughput of 1 or higher that can issue from either port independently are logicals (AND/OR/BT/XOR/etc).
All others either have a throughput of 0.5, require both ports, or can only issue from one of the ports.
You are reading the table incorrectly, "throughput" clearly means reciprocal throughput, or how many cycles it takes to execute the instruction. Notice how simple ALU integer instructions have 0.5 throughput against register or immediate, but 1.0 against memory? And those are the ones that are listed as capable of issuing from either port? Figure it out.
So first up, we see that 128-bit packed integer (not just logical, ALSO add and subtract, abs, avg, cmp, min, max, sign) can execute on both ports. So not only does it have two 128-bit SSE ports but they can co-issue.
Second, 128-bit addps has a throughput of 1, meaning it's also 128-bit, and executes on port 1, meaning it can co-issue with stuff on port 0. And 128-bit mulps has a throughput of 2 and executes on port 0, meaning it can co-issue with adds.
Finally, although I didn't mention them at all first time around, 128-bit integer multiplies and MADs are single cycle and execute on port 0. Even the 32-bit ones.
Nothing I said is contradicted by the manual. Your claims about Atom having 64-bit SIMD pipes or inability to co-issue SIMD are incorrect. The only thing 64-bit about Atom's SSE are its floating point multiplies.
Going wider with SIMD requires extra resources obviously. And not a non-trivial amount for anything outside of logicals and perhaps int ADD.
But the stuff you were saying was still about the control path and scheduling, wasn't it? And how much extra execution resources do you need for a 128-bit FADD and 128-bit FMUL vs 2 64-bit FADDs and 2 64-bit FMULs, which is what you think it'll have?
metafor said:
No. You can pack 2x64-bit in the write-back buffer. There are only so many architectural registers that you can either write-cancel, write-override, or pack. That obviously depends on your OoOE implementation but I can tell you at least one design does this.
So we're talking about write-back buffers now, not register file ports? Can you please answer, how do you perform writes of 2x64-bit that can go to
any registers (not just adjacent ones) without having two register file write ports?
metafor said:
Forwarding can happen independent of the physical register file. Muxes come a lot cheaper than PRF write ports.
Exactly, so what's wrong with Cortex-A8's approach of forwarding from the load/store pipes to the NEON units - why would they need a separate PRF write port instead of using the NEON unit's writeback and forwarding?
metafor said:
Eh? I see A8's FMAC latency as 18-21 cycles for FP32 and 19-26 for FP64:
http://infocenter.arm.com/help/index.../Babbgjhi.html
Is there something I'm missing? FMUL takes 10-12 for FP32 and 11-17 for FP64.
.. seriously?
Look again at either the NEON performance of 2x32 FADD or 2x32 FMUL on Cortex-A8, or look at the VFP or NEON performance of those on Cortex-A9. Cortex-A8 has a hopelessly crippled VFP unit which is not pipelined, and has much worse latency AND throughput than the NEON unit or the VFP unit on Cortex-A8. You should already know this. Obviously I was talking about the NEON latencies.
metafor said:
Eh? Integer loads (especially something like INTADD) should take a whole of 2 cycles. I'm surprised logicals don't take a single cycle. As for sign extension, that's mainly done in expansion instructions, so I would think that it'd be somewhere in stage 3 or so of the complex pipe....
Sign extension is done every time you do an ldrsh... and if it's anything like previous ARM designs it handles rotation to extract bytes/halfwords (and possibly words as well) in a separate stage. I don't understand the difference between an "integer load" and "logical load", or do you mean when the load unit makes it available to the respect ALU? Why would this change load-use?
metafor said:
The ISA will still be issuing 128-bit vectors for quads; the explicity parallelism is still there and defined. Just because there are separate issue ports for narrower computation itself doesn't mean that parallelism is thrown away. Scheduling and control doesn't need to vary all that much either if you're smart about it.
Of course, if you want to be able to issue doubles in parallel, the OoOE engine would need to work a bit harder.
I don't see what that has to do with what I said, that I don't consider 128-bit to be four separate operations. And whether or not the front end deals with 128-bit instructions, if they're split into separate 64-bit operations over separate pipes (that can normally handle completely different operations) that's more overhead than if they went to one pipe. Although I don't know if register allocation and scheduling would be done before or after it's split.
If you can't issue two separate 64-bit operations in parallel then we're not even talking about 2x64-bit, are we?