arjan de lumens said:6-9 instructions per clock? That sounds like the maximum number of instructions the processor can complete within 1 clock cycle, seen over all its units at the same time; usually there are other steps in the pipeline that hold back maximum IPC and still other factors that limit actual IPC.
Both Pentium4 (Northwoord and Prescott both) and Athlon64 are able to fetch and decode a maximum of 3 instructions per clock in the best case for a max theoretical IPC of 3. IIRC for comparison the PowerPC G5 can do 5 and Itanium can do 6.
In practice, there are additional factors that can reduce actual IPC as well, such as:AFAIK, actual IPC is usually around 1.0 for Athlon64/Opteron/Pemtium-M processors and soemwhat less for Pentium4 processors, except for carefully hand-tuned code.
- Usually, a processor has different functional units, each of which can only handle a small set of the full insturction set, with multiple units beign able to operate in parallel. For example, Athlon64 has only one unit that can do the FMUL i(floating-point multiply) instruction, so if you run a long sequence of FMULs, the Athlon64 cannot do more than 1 instruction per clock. Having too many functional units that do the same thing can hurt clock speed.
- Some instructions may also require the use of more than 1 functional unit or use a unit for everal clcok cycles. For example, x86 has a instruction that fetches an operand from memory and adds the result to a register. This instruction requires 2 execution decode slots in Pentium4 (out of 3), so it cannot execute more than 1.5 such instructions per clock cycle. The same instruction requires only 1 decode slot in Athlon64, so the Athlon64 can execute 3 of them per clock.
- Instruction dependencies and latencies. If a second instruction depends on the result of a first instruction, then it cannot execute before the first instruction has completed. For example, in Athlon64 an FMUL has a latency of 4 clock cycles, so if you feed the processor a long string of dependent FMULs, it will execute 1 instruction every 4 clock cycles for an actual IPC of 0.25. Higher clock speeds usually imply larger instruction latencies, thus trading off effective IPC against clock speed.
- Cache misses. When you get a cache miss, you cannot do any useful work until the missing data have been loaded into the cache, which can take 100s of clock cycles. During these cycles, IPC is 0. THe integrated memory controller of Athlon64 greatly reduces the number of clock cycles it stalls when it gets a cache miss; larger caches, such as the ones in Dothan and Itaniums reduce the number of actual cache misses.
- Branch mispredicts. When the processor mispredics the result of a branch instruction, it must cancel all instructions it has fetched into its pipeline and re-start fetching from the correct branch address. This usually causes about 5-30 cycles where the processor doesn't do any useful work. The mispredict penalty is proportional to the number of pipeline steps; here too, you can trade off IPC against clock speed by adjusting the number of pipeline steps.
The number of registers in the processor doesn't directly impact IPC; rather, havimg many registers means that you need to execute fewer instructions in order to carry out a given task. Tthe less registers you have, the more instructions you need to swap data in and out of the stack and the less your performance at doing actual work will be. If you do actual IPC measurements, it wouldn't surprise me if Athlon64 in 64-bit mode acheves both less IPC AND better clock-for-clock performance at the same time, compared to operating in 32-bit mode.
Interestingly enough, P4 can actually only decode one full x86 instruction from memory per cycle. It makes up for the weaker decode by being able to pull the rest from the trace cache. For most code, the trace cache usually fills up with often repeated code sequences.
I've read some posts on the IPC of K8, and it seems that the average estimate is closer to .8 instructions per cycle. Also, if a K8 is fed a long string of dependent IMULs, IPC will only suffer if those IMULs are all that the processor is given, which probably won't happen too often. Good compilers and the internal buffers hopefuly allow some independent instructions to execute in the meantime.
Having operands in registers when you need them is pretty important outside of reducing memory operands. For example, the Athlon and Athlon64 cores have an L1 cache latency of 3 cycles. Northwood has 2 cycles, while Prescott has 4. The IPC isn't going to go down with more registers because of fewer register swap instructions because all the other useful work gets done faster. One poorly placed load could stall three cycles worth of a particular set of instructions.