That makes sense. So it ignores data dependecies and speculates control dependencies to load as much as possible. It'll be interesting to see if it's actually a win since it's effectively using two threads to do one thread's work.
As long as memory latencies continue to remain stagnant compared to clock frequencies, it can be a big win in a lot of areas. OO cores can speculate past branches, but they can only speculate as far as they have space in load/store buffers and queues. If the processor just fires off loads for cache to pick up, its speculation would reach much farther than an OO core that goes for maybe 20 instructions and then sits there.
From your description it's clear it'll help single thread performance in most cases, since the actual executing thread will have a higher hit rate in caches than without the scout. The only exception would if the scout speculates off on a wild goose chase wasting bandwidth, bandwidth that would have been used by loading needed data (ie. will only happen when the memory bus is saturated).
In a single-threaded instance, both the OO core and the scouted core would be speculating past a long-term event like a cache miss, and both would be waiting on a critical memory access to go through. The difference is that the OO core is going to stop speculating and warming the cache much sooner. The OO will pick up the misses that stay in cache, the scout won't. But OO can't beat the memory wall.
A niave scout would trash the cache, but a good implemenation would capture a lot of the low-hanging memory-level parallelism that OO cores harness. An OO core would probably get minimal benefit, since it already does a lot of what the scout does, but a simpler, higher-clocked in-order or weaker OO would benefit tremendously.
A ROB should scale linearly with size. It will of course be slower and in order to run it at the same latency, power would go up. But, look at two wildly different approaches:
1.) P4 with one, big, fat 128 entry instruction ROB.
2.) K8 with multiple smaller ROBs.
In a standard Tomasulo OO core, the ROB may scale linearly in terms of rename registers and even remain fixed in terms of register ports and result buses.
What does not scale linearly is the cost of dependency checking, which can be done with hardware coupled closely with the ROB or in scheduling hardware. That will scale quadratically. N^2-N is the trend in the number of necessary checks, though it usually is less by some fixed factor.
OO is brute-force, every entry in the ROB must check every other entry for register dependencies. That is a lot of wires, a lot of silicon, and a lot of switching around in the critical loop. This is why modern cores have stagnated in terms of ROB size.
K8 is particularly interesting IMO. It groups instructions in three, and thereby effectively has 3 smaller ROBs in parallel for it's global scheduler.
Originally, K7 would stall if an instruction got stuck in the wrong slot. K8 dedicates an entire pipeline stage to avoiding this, but that's why the pipeline is longer, and instruction scheduling is in the critical timing path.
This is the same technique used by IBM in Power4/5, PPC970 which have 5 (4+branch) instructions in a group. From the global scheduler instructions are issued to the int and fp schedulers which are smaller and faster.
Not exactly, the instruction grouping scheme saves on scheduling sources by tracking instruction groups instead of instructions. As long as the instruction stream meets the rules of full issue. When it doesn't the chip either stalls or inserts some kind of empty slot.
There's no reason this principle of hierarchical ROBs couldn't be extended (indefinately as we see with caches).
Besides physics and mathematics, sure.