In a modern data-capture scheduler the number of register-tag comparators scale with N.
But store queues would be a problem.
Cheers
Mea culpa.
I had a brain failure in that post. The N^2-N scaling deals specifically to the width of a superscalar processor, not whether it is OoO. I mixed up some stuff in my head with that one. That's the reason why nobody but VLIW or similarly explicitly parallel processors has gone beyond a (sometimes) width of 5.
OoO complexity should be linear, or at least not strictly quadratic with a processor with issue width N. It's superscalar processing that mucks things up quadratically, since that is another factor that controls the number of reservation stations/schedulers, result buses, and register ports.
BTW, for those that are interested I found a non-$$$ IEEE publically accessible version of the Out of Order Commit (OoOC) architecture:
http://www.cs.utah.edu/classes/cs7940-010-rajeev/spr04/papers/oooc.pdf
I'd be interested in hearing 3dilettante's opinion on this, and perhaps a combination of this with TLP/scouting architectures.
I think the idea's interesting, though I wonder if calling it Out of Order Commit is what it should be called. Perhaps Deferred Commit or Checkpointed Commit. The instruction commit is in order, it just isn't done on a per-instruction basis.
I think the idea itself is interesting, but I think they were a little light on the details about their theoretical implementation, and a little heavy on the assumptions with regards to performance. It measures costs in a few added bits, but ignores the fact that
actions also count when cycle time is involved. I'll be rereading it again to make sure I didn't miss some of the points involved, because there were some confusingly worded portions of the paper.
To sum up my thoughts for those who don't want to read the rest of this post:
I fear it may be more of a niche than it appears at first glance. Having a lot of instructions in-flight is hard to handle fast in silicon, though easy on paper or in a simulator. Extra speculation isn't always better, it has to have a good chance of being correct. Speculation in general can hurt performance with more than one chip involved.
Long post ahead...
The ROB concept is heavy-handed in the way it works, as it maintains precise exceptions, even when they aren't needed. As long as nothing goes wrong, a checkpointed processor should do just as well, conceptually.
What I'm curious about is whether the ROB itself is the limiting factor in performance. I'll have to look into it, but the fact that it represents a limit of the instruction window may just mean that it is only as big as it needs to be to service other more critical critical portions of the OoO pipeline.
For the most part, the ROB only needs to worry about a total of N instructions and their registers a cycle. (N being the issue width of the processor) It just commits up to N instructions at the top of the ROB. It can commit more if designed to do so, it just won't, because the most it can ever expect to get are those N that are issued.
That's a fixed timing cost, and it doesn't really change if the ROB is five,ten,6000 entries in length. The ROB may be a problem at larger sizes, but other things get in the way sooner.
The CAM, for example has always been a timing-critical headache for designers of OoO processors. Register allocation requires looking up at least two values for every register in a standard CAM: the logical mapping and the valid/free bit. The addition of a future free bit only looks like 1 extra bit, so no big deal in theory. But it means that the CAM's workload per cycle has increased significantly, perhaps by many times in bad cases.
A register rename no longer involves just that one register; it has to search the entire register file list in the CAM to find the one matching logical register of many that doesn't yet have its future free bit set.
This value can be cached perhaps, but this still means something like a doubling to tripling of the work a CAM must do, and it looks very serial in nature.
The next problem that could be handled well by the theoretical chip, but it isn't covered in the paper, is the fact that rename registers are quickly consumed and not released until after a checkpoint is reached. Tomasulo OoO is a big boon in cases where register pressure becomes a problem (x86 really likes register renaming), but the checkpointed processor as defined in the paper stands to fill its rename space up much more quickly.
It brings back register pressure in certain situations where the ROB frees up entries very quickly. They mention some ways to conserving rename registers, but not in detail, and not without considering the costs in hardware and wire delay.
The paper also ignores the problem with load/store queues being cycle-critical, which are one of the big problems facing OoO chips. Even the Commit processor must stall if its queues fill up, and it can just as easily fill its queues as another chip can fill its ROB.
The paper worries more about register renaming and commit phases, but it is vague on the implementation details. There are some problem cases I have ideas about handling, but were not mentioned in the paper.
Having that many in-flight instructions means that even without the ROB, there are still reservation stations and result buses involved (there may not be, but the paper didn't say anything at all about them).
A thousand in-flight instructions means at least 1000 reservation station entries. These were not explained away by the paper. Registers are allocated, but it does not discuss where the knowledge of what will use them goes, I'm assuming some kind of reservation station or scheduling unit exists at each execution unit or cluster thereof.
If the processor has something like 10 reservation stations, that's 200 rename registers that must be searched through every time a result comes down the (undefined in this paper) result bus(es), and every time it looks to send an instruction into an execution unit.
If reservation stations aren't cycle critical before, they will be after.
They could be removed, but that means instruction latencies go waaaaayyy up.
If the processor is superscalar, it becomes a 200*N number of checks in the worst case. Tagged buses and other measures can be used instead of brute force forwarding, but the sheer number of possible targets for an updated value is going to be a problem from a logic, timing, and electrical standpoint. I don't know what their ideal implimentation includes. They may have thought of this, but they didn't talk about it.
A theoretical processor in a simulator can just say commit(value) or update.reservation.station(register,value).
That's one software call and to the human running the simulator can easily assume it's just as easy for silicon to do the same thing. It hardly ever is.
Also, the size of the checkpoint queue entries may be underestimated. I didn't see mention of the instruction pointer at a checkpoint, system registers, or the values of the registers. Those will be needed in the case of a context switch or a procedure call, in which the entire visible register set is supposed to be copied out to memory.
Without storing those values, any kind of context switch becomes a lot more painful.
Either the write-out involves multiple lookups in the CAM, or the processor may violate software permissions or virtualization by copying out data that isn't supposed to be seen.
Processor flags need to be kept up to date (flags are one of the reason why x86 is such a pain to do OoO, they rename parts of the flag register to handle it). A bad flag assumption will of course be the same as a mispredict. The odds of finding one flag mis-set in 1000 instructions is pretty high, but it won't become clear until every instruction in a checkpoint interval indicates it is done.
I'm also curious as to how many instructions in the processor are invalidated due to a mispredict or other issue. They didn't say exactly how much of their IPC was meaningful. I assume all of it, since they should have factored it out, but I feel this is a potential deal breaker on a lot of workloads.
Beyond a certain point, most of the in-flight instructions will be mispredicted.
The larger instruction window may be larger, but it won't be more right. This is probably part of the reason why ROBs aren't pushed much further in size, since the returns just aren't there without better prediction (which is also cycle-critical and not getting much better).
It may be better in the long run to just scout and not try to commit at all without an amazing branch predictor.
In the same vein, it looks like the overall latency of execution is going rise. The average number of mispredicts+exceptions per checkpoint interval is going to add a certain amount of average latency to each instruction.
This factor will also be influenced by the average size of the intervals. Since exceptions are not detected until every instruction in an interval is done, a lot of work is going to be done before it is known something is wrong.
Worse, the chip won't know what went wrong. It will have to go back and go through each instruction and pay attention this time, which may make the pipeline complicated. (There are ways, I think, this could be handled better than how the paper states)
Worst of all, is that it is memory operations that are the most complicated when it comes to exceptions, and they could take forever. So it may be that load 10 of 15 has an access fault, but the chip won't know until all instructions, including loads 1 through 15 are done. Then, it will have to run the loads again to find the one that went wrong. They should be somewhat faster to run since they will likely be in cache, or not, a lot of speculation can thrash a cache, and some streaming loads explicitly avoid cache.
This is a natural side-effect of relaxing commitment timing. If you make things fuzzy, they can be possibly faster, but if things go wrong, reality gets harder on you. Unless correctness is thrown out the window, the pick up case gets harder.
If the number of errors is small and the interval size is small, the overall time may only be slightly longer than an ROB chip of equal timings.
This is contrary to a lot of optimizations that try to lengthen branch intervals, and it is contrary to how a lot of the memory subsystem and fetch logic tries to work.
I'm still looking at the part about slow-lane decode. It has some parallels to what Intel did internally with the P4's replay logic (even though that was something internal to execution, not issue).
A possible issue with that is what may have really hurt the P4 when it came to long replay queues. The corner cases would lead to a storm of pointless execution clogging the pipeline (or issue pipeline in this case). I'm not sure, I need to digest that part of the paper more.
It seems to me an OoOC with a checkpoint table size of 1 entry and no load/store queue/commit is sort of like the scouting case. (all commits are rolled back to checkpoint)
I think for the most part it is, since none of the memory traffic or register alteration is kept.
1) if all loads of interest have succeeded since last checkpoint, and you have executed some more loads after a new checkpoint, and you rollback to last checkpoint, then you have effectively "scouted" forward.
It's the same thing with ROB chips, they just don't go quite as far as the idea in the paper. This isn't necessarily a bad thing if in all likelihood the path of execution is going to be incredibly wrong a thousand instructions out (and it likely will be).
In which case, I'd go for a scout instead and make my chip's clock cycles faster by dumping the extra complicated commit logic.
3) and for a bonus, if you allow *software control* of where to begin/end transaction boundaries in addition to the automatic heuristic mode described in the OoOC paper, one can gain Software Transactional Memory up to the limit of the load/store queue size.
I think the chip might actually
need software control if it is going to handle wierd code.
Already, there is sort of a user control over this, as exceptions/traps cause a rollback and retry since the last checkpoint. However, imagine adding 2 new instructions: CHK_BEGIN (force a new checkpoint), CHK_COMMIT valid_register (ok to commit memory writes when all loads are finished since last checkpoint, or abort, depending on value of register).
It sounds kind of like some kind of expanded predication since it also uses a register's value to make an instruction invalid.
Might be workable on a single core where you might have a unified load/store queue between threads, but if you have an SMP situation, you'd need to snoop across buses, so you'd still need to fallback to the more expensive techniques.
The big problem for any kind of aggressive speculation and scouting is that a lot of traffic out of the chip is going to bog down a multiprocessor setup. If the chips need to play nice, it may be better if the chips restrict their efforts to instructions they know are going to have a good chance of working right.
Perhaps some switch-on-event multithreading to go between threads, then leave them when things get more uncertain.
Quality vs. quantity concerns arise if the chips need to share resources.
I haven't quite thought through the full implications of scouting and checkpointing on
software locking logic in a multiprocessor setup. I think there may be headaches, not that there aren't already a ton of them.
I better wrap this post up, I think the size of this post may already kill most of the discussion in the thread. I have some other ideas that could be discussed, if anyone likes.