Pentium4 Editorial ....... The "Replay" Feature

Cowboy X · Jun 13, 2005

This is the third article in a series at Xbit on the P4 and the now out of vogue NetBurst archietecture . This third article raises some interesting questions about a feature called replay and pointgs to what may be one of the reasons why the P4 loses to other processors running at a low clockspeed . Replay according to what I have rad in the article so far causes the occurence of a cache miss with the P4 to befar worse than would already be expected with heavy pipelining . It results in the entire micro-op havine to be rerun from the beginning ............... replay . This can happen several times and this cause significant performance hits , anyway :

http://www.xbitlabs.com/articles/cpu/display/replay.html

ANova · Jun 13, 2005

There must be some greater point to this replay, Intel's engineers aren't stupid, let alone enough to put something in that reduces speed and efficiency.

Gubbi · Jun 13, 2005

ANova said:
There must be some greater point to this replay, Intel's engineers aren't stupid, let alone enough to put something in that reduces speed and efficiency.

It's part of the speculation mechanisms in P4s. The core speculates that a load will "hit" in the d$ and can schedule dependant instructions earlier that if the schedulers had to wait for it to be a known fact that the load will hit the data cache. If the load misses, the dependant instructions are rescheduled once the data is ready (replay).

This shortens the apparent schedule-execute latency that is so important for performance. The downside is that it requires extra execution resources (like having two double pumped ALUs),- and wastes power.

Cheers
Gubbi

ANova · Jun 13, 2005

Gubbi said:
ANova said:

There must be some greater point to this replay, Intel's engineers aren't stupid, let alone enough to put something in that reduces speed and efficiency.

Click to expand...

It's part of the speculation mechanisms in P4s. The core speculates that a load will "hit" in the d$ and can schedule dependant instructions earlier that if the schedulers had to wait for it to be a known fact that the load will hit the data cache. If the load misses, the dependant instructions are rescheduled once the data is ready (replay).

This shortens the apparent schedule-execute latency that is so important for performance. The downside is that it requires extra execution resources (like having two double pumped ALUs),- and wastes power.

Cheers
Gubbi

Right and this is why they implemented hyperthreading correct? To take advantage of its long pipelines that are sitting and waiting for the next set of instructions.

Saem · Jun 13, 2005

Exactly, ANova.

The thing is going from Willy to Northwood, the IPC went up a fair bit and the performance was quite good to say the least, I wonder why they didn't keep with that direction. It's sad, towards the end of Northwood, before Prescott details surfaced, people were expecting a significant refinement of Northwood, one which would further push the clock rate, but moreover, push the IPC much higher.

Gubbi · Jun 14, 2005

ANova said:
Right and this is why they implemented hyperthreading correct? To take advantage of its long pipelines that are sitting and waiting for the next set of instructions.

Not really, don't confuse pipeline length with data dependency latencies. When speculatively scheduled, instructions experience back-to-back latencies as low as Â½ a cycle. load-to-use d$ latency for Northwood was a record breaking 2 cycles at 3.2GHz. Pipeline length only increases the branch mis-predict penalty.

They added Hyperthreading because it was of limited cost, no extra entries were added in the ROB, and no extra rename registers were added. So it was kind of a "free" capability in terms of silicon real estate (or at least very cheap) and could add latency tolerance through added TLP. The reason why it was of limited benefit (or downright harmful) in P4 prior to Prescott was that the ROB, the rename registers and different buffers (like the write-combine buffers) would all be split between the two contexts, and thereby lowering single thread performance.

Prescott doubled the amount of rename buffers, increased various buffer sizes etc.

Cheers
Gubbi

Pentium4 Editorial ....... The "Replay" Feature

Cowboy X

ANova

Gubbi

ANova

Saem

Gubbi

Similar threads