Intel vs. AMD

arjan de lumens said:
6-9 instructions per clock? That sounds like the maximum number of instructions the processor can complete within 1 clock cycle, seen over all its units at the same time; usually there are other steps in the pipeline that hold back maximum IPC and still other factors that limit actual IPC.

Both Pentium4 (Northwoord and Prescott both) and Athlon64 are able to fetch and decode a maximum of 3 instructions per clock in the best case for a max theoretical IPC of 3. IIRC for comparison the PowerPC G5 can do 5 and Itanium can do 6.

In practice, there are additional factors that can reduce actual IPC as well, such as:
  • Usually, a processor has different functional units, each of which can only handle a small set of the full insturction set, with multiple units beign able to operate in parallel. For example, Athlon64 has only one unit that can do the FMUL i(floating-point multiply) instruction, so if you run a long sequence of FMULs, the Athlon64 cannot do more than 1 instruction per clock. Having too many functional units that do the same thing can hurt clock speed.
  • Some instructions may also require the use of more than 1 functional unit or use a unit for everal clcok cycles. For example, x86 has a instruction that fetches an operand from memory and adds the result to a register. This instruction requires 2 execution decode slots in Pentium4 (out of 3), so it cannot execute more than 1.5 such instructions per clock cycle. The same instruction requires only 1 decode slot in Athlon64, so the Athlon64 can execute 3 of them per clock.
  • Instruction dependencies and latencies. If a second instruction depends on the result of a first instruction, then it cannot execute before the first instruction has completed. For example, in Athlon64 an FMUL has a latency of 4 clock cycles, so if you feed the processor a long string of dependent FMULs, it will execute 1 instruction every 4 clock cycles for an actual IPC of 0.25. Higher clock speeds usually imply larger instruction latencies, thus trading off effective IPC against clock speed.
  • Cache misses. When you get a cache miss, you cannot do any useful work until the missing data have been loaded into the cache, which can take 100s of clock cycles. During these cycles, IPC is 0. THe integrated memory controller of Athlon64 greatly reduces the number of clock cycles it stalls when it gets a cache miss; larger caches, such as the ones in Dothan and Itaniums reduce the number of actual cache misses.
  • Branch mispredicts. When the processor mispredics the result of a branch instruction, it must cancel all instructions it has fetched into its pipeline and re-start fetching from the correct branch address. This usually causes about 5-30 cycles where the processor doesn't do any useful work. The mispredict penalty is proportional to the number of pipeline steps; here too, you can trade off IPC against clock speed by adjusting the number of pipeline steps.
AFAIK, actual IPC is usually around 1.0 for Athlon64/Opteron/Pemtium-M processors and soemwhat less for Pentium4 processors, except for carefully hand-tuned code.

The number of registers in the processor doesn't directly impact IPC; rather, havimg many registers means that you need to execute fewer instructions in order to carry out a given task. Tthe less registers you have, the more instructions you need to swap data in and out of the stack and the less your performance at doing actual work will be. If you do actual IPC measurements, it wouldn't surprise me if Athlon64 in 64-bit mode acheves both less IPC AND better clock-for-clock performance at the same time, compared to operating in 32-bit mode.

Interestingly enough, P4 can actually only decode one full x86 instruction from memory per cycle. It makes up for the weaker decode by being able to pull the rest from the trace cache. For most code, the trace cache usually fills up with often repeated code sequences.

I've read some posts on the IPC of K8, and it seems that the average estimate is closer to .8 instructions per cycle. Also, if a K8 is fed a long string of dependent IMULs, IPC will only suffer if those IMULs are all that the processor is given, which probably won't happen too often. Good compilers and the internal buffers hopefuly allow some independent instructions to execute in the meantime.

Having operands in registers when you need them is pretty important outside of reducing memory operands. For example, the Athlon and Athlon64 cores have an L1 cache latency of 3 cycles. Northwood has 2 cycles, while Prescott has 4. The IPC isn't going to go down with more registers because of fewer register swap instructions because all the other useful work gets done faster. One poorly placed load could stall three cycles worth of a particular set of instructions.
 
3dilettante said:
Interestingly enough, P4 can actually only decode one full x86 instruction from memory per cycle. It makes up for the weaker decode by being able to pull the rest from the trace cache. For most code, the trace cache usually fills up with often repeated code sequences.

All x86 CPUs are slower on the first fetch, because of a combination of cache misses and instruction boundaries. So this does not prove that P4 has a 'weaker decode' than other x86 at all, to me. In fact, most of the time you're running the code from cache, so the first fetch (the only time when the P4 actually decodes x86 code) is not all that relevant for performance.

I've read some posts on the IPC of K8, and it seems that the average estimate is closer to .8 instructions per cycle. Also, if a K8 is fed a long string of dependent IMULs, IPC will only suffer if those IMULs are all that the processor is given, which probably won't happen too often. Good compilers and the internal buffers hopefuly allow some independent instructions to execute in the meantime.

On the contrary, IPC is mainly below 1 for exactly the reason that you cannot execute independent instructions in the meantime, most of the time. If you could, there's no reason why IPC couldn't get close to the theoretical maximum of 3 in many situations.
A common example is matrix multiplication. You will be doing a lot of dependent multiply operations, and very little else to put in between.
This is exactly the sort of thing why HyperThreading is such an interesting concept. It effectively allows you to run 2 streams of code 'asynchronously', so, they don't have to be rolled into a single loop/function, since they both have their own program counter.

Having operands in registers when you need them is pretty important outside of reducing memory operands. For example, the Athlon and Athlon64 cores have an L1 cache latency of 3 cycles. Northwood has 2 cycles, while Prescott has 4. The IPC isn't going to go down with more registers because of fewer register swap instructions because all the other useful work gets done faster. One poorly placed load could stall three cycles worth of a particular set of instructions.

This is especially a problem on Athlon, which afaik cannot reschedule load or store-operations, while PIII/P4/P-M can.
 
3dilettante said:
Having operands in registers when you need them is pretty important outside of reducing memory operands. For example, the Athlon and Athlon64 cores have an L1 cache latency of 3 cycles. Northwood has 2 cycles, while Prescott has 4. The IPC isn't going to go down with more registers because of fewer register swap instructions because all the other useful work gets done faster. One poorly placed load could stall three cycles worth of a particular set of instructions.
Stack reads are normally rather cheap in out-of-order processors - they consume instrcution slots, but their latency is easy to hide.

It is possible to more or less hide the latency of an L1 load unless you modify one of the address registers immediately before the load. Consider a hypothetical in-order processor with 3-cycle L1 load latency and a pipeline that looks as follows:
  • Fetch
  • Decode
  • Load1
  • Load2
  • Load3
  • Execute
In this processor, you can have a memory operand on every instruction without any loss of max IPC (unless you change an address register or need to flush the pipelines for a branch mispredict) If you have an out-of-order processor, it should schedule operations at least as efficiently as this hypotihetical in-order processor, and since the stack pointer usually doesn't change very often in the middle of a function, the latency will be hidden.
 
Scali said:
3dilettante said:
Interestingly enough, P4 can actually only decode one full x86 instruction from memory per cycle. It makes up for the weaker decode by being able to pull the rest from the trace cache. For most code, the trace cache usually fills up with often repeated code sequences.

All x86 CPUs are slower on the first fetch, because of a combination of cache misses and instruction boundaries. So this does not prove that P4 has a 'weaker decode' than other x86 at all, to me. In fact, most of the time you're running the code from cache, so the first fetch (the only time when the P4 actually decodes x86 code) is not all that relevant for performance.

I didn't mean P4's decoding scheme was inferior, only that it dedicates fewer resources to initial decoding. The P4 has a single decode unit capable of decoding a single x86 instruction per clock from memory. Athlon dedicates something along the lines of three parallel decoders, but winds up redecoding a lot of what the P4 trace cache would hold.

I believe there are some rumors to the effect that K9(now K10) may have a trace cache or a level 0 cache to help offset the 3 cycle L1.
 
Decoders aren't good for clock rate, this is why intel took them out of the criticla path and kept only one.
 
Look at where the solid P4 emphasis on clock rate has led Intel now. Perhaps a proper IPC to clock rate balance is necessary after all. ;)
 
3dilettante said:
I didn't mean P4's decoding scheme was inferior, only that it dedicates fewer resources to initial decoding. The P4 has a single decode unit capable of decoding a single x86 instruction per clock from memory. Athlon dedicates something along the lines of three parallel decoders, but winds up redecoding a lot of what the P4 trace cache would hold.

Read what I said again. The Athlon has to decode x86 at every pass, but the first pass is always slower. It may have 3 decoders instead of P4's single decoder, but because the first pass is slower due to instruction boundary finding and cache issues and such, there's no reason to believe that the first pass on an Athlon is at all faster, based on this information.
In subsequent passes, the P4's trace cache is obviously not slower than the Athlon's x86 decoders, since they can both effectively issue 3 instructions per clk. The difference is that the P4 can ALWAYS issue those 3 instructions, because it is no longer dependent on weird complex x86 instructions, while for Athlon 3 instructions is best-case.
 
Luminescent said:
Look at where the solid P4 emphasis on clock rate has led Intel now. Perhaps a proper IPC to clock rate balance is necessary after all. ;)

I disagree. I think that a P4 with integrated memory controller can beat an Athlon64.
Additionally, an integrated memory controller would call for a new socket spec, which means they can increase the amount of power drawn from the socket. This allows Intel to continue ramping up the clockspeed, while AMD will still be stuck at about 2.6 GHz with their outdated Athlon architecture.

I wouldn't be surprised at all if the next Athlon borrows heavily from the P4's ideas, and adopt things like trace cache, double-pumped ALUs, HyperThreading, and ofcourse a longer pipeline to increase clockspeed.

I think both the P4 and Athlon64 have potential, but the Athlon64 needs a redesign of the core, while the P4 needs relatively simple modifications to move ahead.
 
Scali said:
I disagree. I think that a P4 with integrated memory controller can beat an Athlon64.
Additionally, an integrated memory controller would call for a new socket spec, which means they can increase the amount of power drawn from the socket. This allows Intel to continue ramping up the clockspeed, while AMD will still be stuck at about 2.6 GHz with their outdated Athlon architecture.

You honestly believe Intel can implement an integrated controller in the current P4 design while maintaining 3.6Ghz and minimizing heat levels? Or would a clock speed reduction be in order?
 
Scali said:
I disagree. I think that a P4 with integrated memory controller can beat an Athlon64.
Additionally, an integrated memory controller would call for a new socket spec, which means they can increase the amount of power drawn from the socket.
AFAIK, the LGA775 socket can easily deliver enough power to supply a dual-core Prescott (!); the problems with Prescott's power comsumption or clock speed are not a result of inadequate socket design and not solved with a new socket.
This allows Intel to continue ramping up the clockspeed, while AMD will still be stuck at about 2.6 GHz with their outdated Athlon architecture.
Both Intel and AMD have been quite slow at ramping their clock speeds over the last 2 or so years - why would things change for Intel and not AMD?
I wouldn't be surprised at all if the next Athlon borrows heavily from the P4's ideas, and adopt things like trace cache, double-pumped ALUs, HyperThreading, and ofcourse a longer pipeline to increase clockspeed.
The Prescott ALUs apparently are not double-pumped anymore; at least according to Intel documents, there are no longer any ALU instructions that can be executed in 1/2 cycle - AFAIK, the design of the double-pumpred ALUs didn't lend itself to 64-bit operation. Hyperthreading is very tricky to get right, and tends to cause problems with the cache (2 sets => working set of data doubles => more cache misses => a much smaller performance increment than one might expect).
I think both the P4 and Athlon64 have potential, but the Athlon64 needs a redesign of the core, while the P4 needs relatively simple modifications to move ahead.
 
Scali said:
I think it's naive to think that Intel didn't try to improve the performance per clock. Obviously Intel has a team of highly qualified professionals, who don't leave a stone unturned when it comes to improving performance.

I don't think anyone is questioning the technical expertise of NetBurst's designers. But CPU-designers must work with preliminary (i.e. predicted educated guesses) performance models of their targeted fabrication-process. I.e., they plan the microarchitecture around a certain set of 'process design rules.' By the time the design-team reaches the 'tape-out' milestone, the foundry-guys are just ramping up pilot runs and sample silicon; everyone just crosses their fingers and hopes the development-models aren't too different from actual lab-silicon.

Nothing is worse than finding out your ALU-pipeline, which performed to the limits of your design goals in simulation, suddenly burns 2X power, misses the clock-target by -20%, etc. "But all our practice wafers with SRAM-cells performed so accurately!" With the insane level of manually place&routed structures in a modern CPU, it's too late to redo the architecture, and even simple point-tweaks are a real chore.

If Intel's designers knew back then what they know now, I doubt they'd have invested so much effort in lengthening (from 20 -> 31 stages) the Netburst pipeline -- the CPU's power-dissipation limits the max-frqeuency anyway.

Obviously the P4 is designed in the way that Intel thought would give the best performance. They just happened to choose a different compromise in terms of clockspeed and IPC than AMD. Actually, AMD uses the same compromise as Intel used for their PIII, so AMD is still behind on developments. It just happens to work out to their advantage at this time.

With Intel canning the 4GHz Prescott, it's clear even Intel feels a frequency-oriented microarchitecture (Netburst) has hit a brick wall. Maybe a future fab-process (45nm?!?) with better clock-scalability will once again tip the favor toward high-clock, deep-pipeline architectures. But for the forseeable future, it looks like *both* vendors are pursuing performance-boosts through alternative means (dual-core.) I don't expect AMD to pursue the road Intel just abandoned!
 
trinibwoy said:
You honestly believe Intel can implement an integrated controller in the current P4 design while maintaining 3.6Ghz and minimizing heat levels? Or would a clock speed reduction be in order?

I don't see why not. The memory controller is relatively small compared to the rest of the CPU, and it doesn't have to run at the full clockspeed.
Perhaps they could even trade in some L2 cache for the memory controller.
 
arjan de lumens said:
AFAIK, the LGA775 socket can easily deliver enough power to supply a dual-core Prescott (!); the problems with Prescott's power comsumption or clock speed are not a result of inadequate socket design and not solved with a new socket.

Do you have a source for this?

Both Intel and AMD have been quite slow at ramping their clock speeds over the last 2 or so years - why would things change for Intel and not AMD?

I already answered that. P4's scaling is not limited by the architecture itself. Just use a big PSU and overclock a Prescott yourself, and you'll see how far they can go if you ignore socket specs.
 
asicnewbie said:
If Intel's designers knew back then what they know now, I doubt they'd have invested so much effort in lengthening (from 20 -> 31 stages) the Netburst pipeline -- the CPU's power-dissipation limits the max-frqeuency anyway.

Like I said, perhaps they overshot their goal. But sticking to PIII was not an option either.

With Intel canning the 4GHz Prescott, it's clear even Intel feels a frequency-oriented microarchitecture (Netburst) has hit a brick wall.

Is it? That depends on the reasons why they canned the 4 GHz Prescott, doesn't it?
Also, what do you make of the fact that they chose Netburst as their base for dual-core desktop CPUs?
According to Intel, they favour the move to multi-core over the pursuit of higher clockspeeds. It's a matter of channeling your resources.
This doesn't mean it hit a brick wall, it simply means that at this time they think it's a better investment to go for multi-core. Which makes sense ofcourse, at 90 nm you can put a lot of transistors on a chip. I wouldn't be surprised if there will be a 4 GHz dual-core P4 in the not-so-distant future.
They just opted for a 3.x GHz multicore before a 4 GHz model, single or multi.
See: http://www.pcworld.com/news/article/0,aid,118165,00.asp

I don't expect AMD to pursue the road Intel just abandoned!

Intel didn't abandon anything, they're still using the Netburst architecture, they're just going to use 2 of them per core. And if I'm right about the addition of the memory controller to the P4 architecture beating the Athlon64, then AMD would wish they had taken Intel's road.
 
Scali said:
3dilettante said:
I didn't mean P4's decoding scheme was inferior, only that it dedicates fewer resources to initial decoding. The P4 has a single decode unit capable of decoding a single x86 instruction per clock from memory. Athlon dedicates something along the lines of three parallel decoders, but winds up redecoding a lot of what the P4 trace cache would hold.

Read what I said again. The Athlon has to decode x86 at every pass, but the first pass is always slower. It may have 3 decoders instead of P4's single decoder, but because the first pass is slower due to instruction boundary finding and cache issues and such, there's no reason to believe that the first pass on an Athlon is at all faster, based on this information.
In subsequent passes, the P4's trace cache is obviously not slower than the Athlon's x86 decoders, since they can both effectively issue 3 instructions per clk. The difference is that the P4 can ALWAYS issue those 3 instructions, because it is no longer dependent on weird complex x86 instructions, while for Athlon 3 instructions is best-case.

On initial decode, is x86 code always filled with instructions of unknown boundaries and indeterminate decode/extension status?

I thought with properly tuned compilers the more common case was that there would be groups of relatively simple instructions that would have a higher chance of two or rarely three being decoded at the same time on an athlon.

I guess it doesn't matter too much in the end, as neither P4 or Athlon go back to a full redecode after the first pass. P4 has the trace cache, while Athlon stores predecode bits in the L1 cache. The athlon method isn't as elegant or effective as a trace cache, but it reduces the relative performance delta.
 
3dilettante said:
On initial decode, is x86 code always filled with instructions of unknown boundaries and indeterminate decode/extension status?

Yes, that's the main problem of x86 code. All modern CPUs have a fixed instruction size, so the CPU doesn't need to handle instruction size or alignment or anything. On x86 there's no alternative, you HAVE to be able to handle all code.

I thought with properly tuned compilers the more common case was that there would be groups of relatively simple instructions that would have a higher chance of two or rarely three being decoded at the same time on an athlon.

Yes, but this only affects the subsequent passes. As I stated before, the first pass has other things to worry about. So claiming the P4 is slower just because it only uses one decoder during this pass, is an over-simplification. Perhaps Intel figured that because of the memory/cache/instruction boundary/etc bottlenecks, a single decoder would be sufficient for maximum decoding speed.

I guess it doesn't matter too much in the end, as neither P4 or Athlon go back to a full redecode after the first pass. P4 has the trace cache, while Athlon stores predecode bits in the L1 cache. The athlon method isn't as elegant or effective as a trace cache, but it reduces the relative performance delta.

It does matter though. Without trace-cache, the P4 would be much less efficient than it is now. And without trace cache, you probably will never get any x86 over ~3 GHz. At least the Athlon design won't make it.
 
Scali said:
It does matter though. Without trace-cache, the P4 would be much less efficient than it is now. And without trace cache, you probably will never get any x86 over ~3 GHz. At least the Athlon design won't make it.

I was agreeing with you that the first pass consideration I had been discussing was not particularly important compared to performance that can be gained afterwards.

I didn't mean that the trace cache was meaningless, just that in a comparison between the two architectures, the lead that the P4 would have had if the Athlon didn't store predecode bits would have been much greater.

I still can't tell where and when exactly the Athlon calculates its predecode bits.

Whether or not the K8 can make it to 3 Ghz on 90 nm is something I can't tell. It might be possible, but probably barely.

edit: I just read that there is a very hefty predecode unit in K8 that grabs 16 bytes and brute-force calculates the instruction boundaries in parallel, taking 2 cycles. These predecoded instructions are then passed to the 3 decoders.

This occurs in the first 2 fetch cycles in the pipeline, meaning that the K8's decoders could, potentially, on first pass decode 3 x86 instructions peak.
 
Scali said:
3dilettante said:
On initial decode, is x86 code always filled with instructions of unknown boundaries and indeterminate decode/extension status?

Yes, that's the main problem of x86 code. All modern CPUs have a fixed instruction size, so the CPU doesn't need to handle instruction size or alignment or anything. On x86 there's no alternative, you HAVE to be able to handle all code.

I thought with properly tuned compilers the more common case was that there would be groups of relatively simple instructions that would have a higher chance of two or rarely three being decoded at the same time on an athlon.

Yes, but this only affects the subsequent passes. As I stated before, the first pass has other things to worry about. So claiming the P4 is slower just because it only uses one decoder during this pass, is an over-simplification. Perhaps Intel figured that because of the memory/cache/instruction boundary/etc bottlenecks, a single decoder would be sufficient for maximum decoding speed.

I guess it doesn't matter too much in the end, as neither P4 or Athlon go back to a full redecode after the first pass. P4 has the trace cache, while Athlon stores predecode bits in the L1 cache. The athlon method isn't as elegant or effective as a trace cache, but it reduces the relative performance delta.

It does matter though. Without trace-cache, the P4 would be much less efficient than it is now. And without trace cache, you probably will never get any x86 over ~3 GHz. At least the Athlon design won't make it.

You sure the athlon design won't make it? AMD has the fx-55 at 2.6ghz, which I've seen overclocked stably to 2.8-2.9ghz on stock cooling.
 
OC + boot + some gaming != stable. OCing doesn't actually tell you that much about a clock ceiling, unless we're talking huge sample sizes and everything stock, including voltage and every system going through extensive testing.
 
For 130 nm, I'd agree it's not going to happen, but AMD's 90nm process hasn't really matured yet, so it still might be possible to squeeze the K8 core past 3Ghz within the next year.

If not then, they'll probably make a few K8 derivatives at .65 as their version of a Duron when the next gen stuff comes out. There's a pretty good chance then that they can cross it.
 
Back
Top