A better way to do OoOE.

nelg

Veteran
Abstract
RingScalar is a complexity-effective microarchitecture for
out-of-order superscalar processors, that reduces the area,
latency, and power of all major structures in the instruction
flow. The design divides an N-way superscalar into N columns
connected in a unidirectional ring, where each column
contains a portion of the instruction window, a bank of
the register file, and an ALU. The design exploits the fact that
most decoded instructions are waiting on just one operand to
use only a single tag per issue window entry, and to restrict
instruction wakeup and value bypass to only communicate
with the neighboring column. Detailed simulations of fourissue
single-threaded machines running SPECint2000 show
that RingScalar has IPC only 13% lower than an idealized
superscalar, while providing large reductions in area, power,
and circuit latency.
Thoughts?
 

I haven't read it yet, but at first glance it sounds more like an attempt at a more effective way to implement superscalar processing, given that a chip is OO.

OO itself isn't necessarily too painful until the chip becomes superscalar.
From what you've quoted, they appear to have made the result bypass, register access, and instruction issue hardware scale more linearly with respect to width.

This can be helpful, since something like a bypass network scales quadratically in delay with issue width.

Clustered execution isn't a new idea, nor is dividing a processor into lanes behind the front end.
The K8 has a lane concept that is similar for instruction issue and retirement, but it doesn't touch the bypass network or register accesses, nor was it quite so restrictive in ALU association.

What this paper posits seems to be a degenerate case of clustered execution units, where each cluster is the bare minimum needed to execute an instruction.

There are costs to variable bypass delay and inter-lane communication. I think Power5 had an issue where it took 2 cycles to bypass results, which had something like a 2-5% penalty to performance.
With the other restrictions, this could be a bit more.

I'm also wondering how this delay in result propogation could effect register renaming, since conceivably the front-end could rename a register before a neighboring lane knows about it (probably minor synchronization is needed, I'll see).

I'll get back to this to say if I'm too far off when I've read the pdf.
 
Back
Top