Future console CPUs: will they go back to OoOE, and other questions.

I was just browsing to see if I could find some SpecINT figures for Cell

Doubt they'll post any, SPEC Int tests single threaded integer performance and the code can't be modified so it's unlikely to do very well.

That said SPEC tests a lot more than the processor, it's common to see libraries included in tests which cost more than the processor being tested! SPEC friendly Compiler tweaks are also not exactly unknown.

The SPEC 2006 figures are interesting though, they haven't been "cracked" by compiler developers yet and they don't fit in caches (why do you think Core 2 does so well?). Currently they show there's not a lot of difference between the processors listed (reference systems and the iMac excepted).
 
Thats really taking the p*ss on page 29. They may as well just say "we hate windows PC's" and be done with it!

They are just quoting the IBM blurb about the "software Cell" or whatever IBM called it. The idea was that the SPE would be a secure protected sandbox in which code could run. This would allow networked devices to download and run network based mini-programs in SPE object code in much the same way as a Java JVM does with Java bytecode today. Hence the talk about Cell powered mobile phones (presumably as a 3D media player/GPU), TVs etc. (Cell here meaning SPEs not the PS3 Cell chip).

Hence the terms Immersive, 3D interactivity, Distributed, User data accessible
everywhere, Device-agnostic, Wireless, etc.
 
The SPEC 2006 figures are interesting though, they haven't been "cracked" by compiler developers yet and they don't fit in caches (why do you think Core 2 does so well?). Currently they show there's not a lot of difference between the processors listed (reference systems and the iMac excepted).

There's not a lot of difference between them because all submitted results are either Opteron or P4 (Xeons) systems. The only exceptions is the iMac (which is a core duo, not a core 2 duo) and a few low scoring Sun UltraSPARC systems (low end servers and blades).

Cheers
 
There's not a lot of difference between them because all submitted results are either Opteron or P4 (Xeons) systems. The only exceptions is the iMac (which is a core duo, not a core 2 duo) and a few low scoring Sun UltraSPARC systems (low end servers and blades).

Cheers

It is telling, however, that the only in-order processors currently in the benchmark results are at the very bottom of the list (UltraSPARC II, III, and IIIi).

SPECint2006: http://www.spec.org/cpu2006/results/cint2006.html
SPECfp2006: http://www.spec.org/cpu2006/results/cfp2006.html
 
Um DiGuru, I never said anything about instruction sets or # of logical registers exposed, and I am not specifically saying out of order isn't needed, just that the ILP OoOE architecture is showing its age and is not an optimial design for throughput.

I was trying to say that perhaps we can have the best of both worlds, a CPU that will reuse existing on-chip resources to deal with memory loads/stores and branches in an optimal fashion depending on whether workloads are dominated by ILP and TLP, and that perhaps we can even get some hardware support for transactional memory in the process.

You will never get rid of out of order execution, it is inherent in all workloads must do I/O. Even databases load blocks from disk out of order depending on opportunistic prefetching because it costs to wait for the disc to spin to position X, when you know you'll need X through X+5 anyway.

Essentially, any code for which you have operations which can potentially take orders of magnitude more cycles can benefit from OoO techniques or scouting+TLP. That's because the chip needs to be able to move forward and do work even as it is waiting for past instructions to complete, as well as predicting and prefetching/initiating future I/O before you need it.

All of this is a form of piplining to overlap operations which have a fixed latency , and thus you want the latency (where nothing is being done by waiting) to overlap other work.

I guess what I was trying to explain is that current OoO approaches can't extract much more than 4-way ILP at enormous silicon cost, and media-rich applications of tommorow are going to have alot higher TLP, so chips of tommorow will have to be good at ILP and TLP, and while the brute-force "multicore" method may work, perhaps there's a more elegant solution that uses less silicon for near same performance.

Hardware scouting is way way cheaper than OoOE for warming caches and predictors, and it can scout thousands of instructions ahead, whereas OoOE designs are typically limited to small windows and enormous silicon cost.

However, hardware scouting throws away work done decoding instructions ahead. It also could compute results and store them and commit them out of order. But this requires out of order commit techniques.

However, TLP, scouting, out of order commit, and transactional memory all use checkpointing techniques, so there is some natural symmetry between these features which may be exploited.
 
It is telling, however, that the only in-order processors currently in the benchmark results are at the very bottom of the list (UltraSPARC II, III, and IIIi).

It certainly is.

But to be honest the scores are rather fishy, there are faster US IIIi cores out there (1.6GHz vs 1.2GHz), so why benchmark old CPUs with a brand spanking new benchmark (and one that has higher requirements in general than the old SpecCPU)

I think Sun is planning to use this in a push to sell new SPARC-64 and Opteron based servers, and needs the low score to tell their existing customers:

"look our new SUN server range is up to 3 times faster than your old Sun kit" :)

I might be paranoid though.

Cheers
 
...whereas OoOE designs are typically limited to small windows and enormous silicon cost.

Sorry, that "and" should be an or.

The problem is that the distribution of data-dependency latencies in any given program on a moden processor looks something like this:
Code:
 cycles  |  percentile
----------------------
   0-10  |    70%
  10-50  |    25%
 50-100  |     0%
100-500  |     0%
500-5000 |     5%
(this is from memory, dont have the bookmark on the paper these figures are from here and now, so they are likely to be off, but you get the idea).

The bottom bucket is of course main memory accesses.

A ROB can be build at modest si cost to deal with the latencies below 50 cycles. The problem is of course that in order to absorb the latencies from main memory accesses the ROB would have to be at least an order of magnitude larger, and you get a very modest return of gradually increasing your ROB past the 50 cycle mark (really the highest latency of the on-die cache hierarchy) until it can absorb main memory latency.

Trying to extract ILP on this scale looks like an extraordinary waste of time, especially considering that it would require an oracle of a branch predictor to work anyway.

That is what motivates people to look for other types of parallism. TLP is an obvious candidate because threads by nature are independent entities with clearly defined synchronization points.

What you suggest sounds more like a microthread architecture (still threads but not in the OS sense). Where m-threads are spawned and merged all the time, ie. an independent m-thread could be spawned for each iteration of a loop. The main challenge here is of course to merge state in an effective manner (especially when dealing with memory).

Cheers
 
Last edited by a moderator:
It certainly is.

But to be honest the scores are rather fishy, there are faster US IIIi cores out there (1.6GHz vs 1.2GHz), so why benchmark old CPUs with a brand spanking new benchmark (and one that has higher requirements in general than the old SpecCPU)

I think Sun is planning to use this in a push to sell new SPARC-64 and Opteron based servers, and needs the low score to tell their existing customers:

"look our new SUN server range is up to 3 times faster than your old Sun kit" :)

I might be paranoid though.

Cheers

Well, it looks like the Ultra 2 is the baseline for 2006. I believe it was an Ultra 5 for CPU2000.
 
In a modern data-capture scheduler the number of register-tag comparators scale with N.

But store queues would be a problem.

Cheers

Mea culpa.

I had a brain failure in that post. The N^2-N scaling deals specifically to the width of a superscalar processor, not whether it is OoO. I mixed up some stuff in my head with that one. That's the reason why nobody but VLIW or similarly explicitly parallel processors has gone beyond a (sometimes) width of 5.

OoO complexity should be linear, or at least not strictly quadratic with a processor with issue width N. It's superscalar processing that mucks things up quadratically, since that is another factor that controls the number of reservation stations/schedulers, result buses, and register ports.

BTW, for those that are interested I found a non-$$$ IEEE publically accessible version of the Out of Order Commit (OoOC) architecture: http://www.cs.utah.edu/classes/cs7940-010-rajeev/spr04/papers/oooc.pdf

I'd be interested in hearing 3dilettante's opinion on this, and perhaps a combination of this with TLP/scouting architectures.

I think the idea's interesting, though I wonder if calling it Out of Order Commit is what it should be called. Perhaps Deferred Commit or Checkpointed Commit. The instruction commit is in order, it just isn't done on a per-instruction basis.

I think the idea itself is interesting, but I think they were a little light on the details about their theoretical implementation, and a little heavy on the assumptions with regards to performance. It measures costs in a few added bits, but ignores the fact that actions also count when cycle time is involved. I'll be rereading it again to make sure I didn't miss some of the points involved, because there were some confusingly worded portions of the paper.

To sum up my thoughts for those who don't want to read the rest of this post:
I fear it may be more of a niche than it appears at first glance. Having a lot of instructions in-flight is hard to handle fast in silicon, though easy on paper or in a simulator. Extra speculation isn't always better, it has to have a good chance of being correct. Speculation in general can hurt performance with more than one chip involved.

Long post ahead...

The ROB concept is heavy-handed in the way it works, as it maintains precise exceptions, even when they aren't needed. As long as nothing goes wrong, a checkpointed processor should do just as well, conceptually.

What I'm curious about is whether the ROB itself is the limiting factor in performance. I'll have to look into it, but the fact that it represents a limit of the instruction window may just mean that it is only as big as it needs to be to service other more critical critical portions of the OoO pipeline.
For the most part, the ROB only needs to worry about a total of N instructions and their registers a cycle. (N being the issue width of the processor) It just commits up to N instructions at the top of the ROB. It can commit more if designed to do so, it just won't, because the most it can ever expect to get are those N that are issued.

That's a fixed timing cost, and it doesn't really change if the ROB is five,ten,6000 entries in length. The ROB may be a problem at larger sizes, but other things get in the way sooner.

The CAM, for example has always been a timing-critical headache for designers of OoO processors. Register allocation requires looking up at least two values for every register in a standard CAM: the logical mapping and the valid/free bit. The addition of a future free bit only looks like 1 extra bit, so no big deal in theory. But it means that the CAM's workload per cycle has increased significantly, perhaps by many times in bad cases.
A register rename no longer involves just that one register; it has to search the entire register file list in the CAM to find the one matching logical register of many that doesn't yet have its future free bit set.
This value can be cached perhaps, but this still means something like a doubling to tripling of the work a CAM must do, and it looks very serial in nature.

The next problem that could be handled well by the theoretical chip, but it isn't covered in the paper, is the fact that rename registers are quickly consumed and not released until after a checkpoint is reached. Tomasulo OoO is a big boon in cases where register pressure becomes a problem (x86 really likes register renaming), but the checkpointed processor as defined in the paper stands to fill its rename space up much more quickly.
It brings back register pressure in certain situations where the ROB frees up entries very quickly. They mention some ways to conserving rename registers, but not in detail, and not without considering the costs in hardware and wire delay.

The paper also ignores the problem with load/store queues being cycle-critical, which are one of the big problems facing OoO chips. Even the Commit processor must stall if its queues fill up, and it can just as easily fill its queues as another chip can fill its ROB.

The paper worries more about register renaming and commit phases, but it is vague on the implementation details. There are some problem cases I have ideas about handling, but were not mentioned in the paper.
Having that many in-flight instructions means that even without the ROB, there are still reservation stations and result buses involved (there may not be, but the paper didn't say anything at all about them).

A thousand in-flight instructions means at least 1000 reservation station entries. These were not explained away by the paper. Registers are allocated, but it does not discuss where the knowledge of what will use them goes, I'm assuming some kind of reservation station or scheduling unit exists at each execution unit or cluster thereof.

If the processor has something like 10 reservation stations, that's 200 rename registers that must be searched through every time a result comes down the (undefined in this paper) result bus(es), and every time it looks to send an instruction into an execution unit.
If reservation stations aren't cycle critical before, they will be after.
They could be removed, but that means instruction latencies go waaaaayyy up.

If the processor is superscalar, it becomes a 200*N number of checks in the worst case. Tagged buses and other measures can be used instead of brute force forwarding, but the sheer number of possible targets for an updated value is going to be a problem from a logic, timing, and electrical standpoint. I don't know what their ideal implimentation includes. They may have thought of this, but they didn't talk about it.

A theoretical processor in a simulator can just say commit(value) or update.reservation.station(register,value).
That's one software call and to the human running the simulator can easily assume it's just as easy for silicon to do the same thing. It hardly ever is.

Also, the size of the checkpoint queue entries may be underestimated. I didn't see mention of the instruction pointer at a checkpoint, system registers, or the values of the registers. Those will be needed in the case of a context switch or a procedure call, in which the entire visible register set is supposed to be copied out to memory.
Without storing those values, any kind of context switch becomes a lot more painful.
Either the write-out involves multiple lookups in the CAM, or the processor may violate software permissions or virtualization by copying out data that isn't supposed to be seen.

Processor flags need to be kept up to date (flags are one of the reason why x86 is such a pain to do OoO, they rename parts of the flag register to handle it). A bad flag assumption will of course be the same as a mispredict. The odds of finding one flag mis-set in 1000 instructions is pretty high, but it won't become clear until every instruction in a checkpoint interval indicates it is done.

I'm also curious as to how many instructions in the processor are invalidated due to a mispredict or other issue. They didn't say exactly how much of their IPC was meaningful. I assume all of it, since they should have factored it out, but I feel this is a potential deal breaker on a lot of workloads.
Beyond a certain point, most of the in-flight instructions will be mispredicted.

The larger instruction window may be larger, but it won't be more right. This is probably part of the reason why ROBs aren't pushed much further in size, since the returns just aren't there without better prediction (which is also cycle-critical and not getting much better).
It may be better in the long run to just scout and not try to commit at all without an amazing branch predictor.

In the same vein, it looks like the overall latency of execution is going rise. The average number of mispredicts+exceptions per checkpoint interval is going to add a certain amount of average latency to each instruction.

This factor will also be influenced by the average size of the intervals. Since exceptions are not detected until every instruction in an interval is done, a lot of work is going to be done before it is known something is wrong.
Worse, the chip won't know what went wrong. It will have to go back and go through each instruction and pay attention this time, which may make the pipeline complicated. (There are ways, I think, this could be handled better than how the paper states)

Worst of all, is that it is memory operations that are the most complicated when it comes to exceptions, and they could take forever. So it may be that load 10 of 15 has an access fault, but the chip won't know until all instructions, including loads 1 through 15 are done. Then, it will have to run the loads again to find the one that went wrong. They should be somewhat faster to run since they will likely be in cache, or not, a lot of speculation can thrash a cache, and some streaming loads explicitly avoid cache.

This is a natural side-effect of relaxing commitment timing. If you make things fuzzy, they can be possibly faster, but if things go wrong, reality gets harder on you. Unless correctness is thrown out the window, the pick up case gets harder.

If the number of errors is small and the interval size is small, the overall time may only be slightly longer than an ROB chip of equal timings.
This is contrary to a lot of optimizations that try to lengthen branch intervals, and it is contrary to how a lot of the memory subsystem and fetch logic tries to work.

I'm still looking at the part about slow-lane decode. It has some parallels to what Intel did internally with the P4's replay logic (even though that was something internal to execution, not issue).
A possible issue with that is what may have really hurt the P4 when it came to long replay queues. The corner cases would lead to a storm of pointless execution clogging the pipeline (or issue pipeline in this case). I'm not sure, I need to digest that part of the paper more.

It seems to me an OoOC with a checkpoint table size of 1 entry and no load/store queue/commit is sort of like the scouting case. (all commits are rolled back to checkpoint)

I think for the most part it is, since none of the memory traffic or register alteration is kept.

1) if all loads of interest have succeeded since last checkpoint, and you have executed some more loads after a new checkpoint, and you rollback to last checkpoint, then you have effectively "scouted" forward.
It's the same thing with ROB chips, they just don't go quite as far as the idea in the paper. This isn't necessarily a bad thing if in all likelihood the path of execution is going to be incredibly wrong a thousand instructions out (and it likely will be).

In which case, I'd go for a scout instead and make my chip's clock cycles faster by dumping the extra complicated commit logic.

3) and for a bonus, if you allow *software control* of where to begin/end transaction boundaries in addition to the automatic heuristic mode described in the OoOC paper, one can gain Software Transactional Memory up to the limit of the load/store queue size.

I think the chip might actually need software control if it is going to handle wierd code.

Already, there is sort of a user control over this, as exceptions/traps cause a rollback and retry since the last checkpoint. However, imagine adding 2 new instructions: CHK_BEGIN (force a new checkpoint), CHK_COMMIT valid_register (ok to commit memory writes when all loads are finished since last checkpoint, or abort, depending on value of register).
It sounds kind of like some kind of expanded predication since it also uses a register's value to make an instruction invalid.

Might be workable on a single core where you might have a unified load/store queue between threads, but if you have an SMP situation, you'd need to snoop across buses, so you'd still need to fallback to the more expensive techniques.

The big problem for any kind of aggressive speculation and scouting is that a lot of traffic out of the chip is going to bog down a multiprocessor setup. If the chips need to play nice, it may be better if the chips restrict their efforts to instructions they know are going to have a good chance of working right.
Perhaps some switch-on-event multithreading to go between threads, then leave them when things get more uncertain.
Quality vs. quantity concerns arise if the chips need to share resources.

I haven't quite thought through the full implications of scouting and checkpointing on
software locking logic in a multiprocessor setup. I think there may be headaches, not that there aren't already a ton of them.


I better wrap this post up, I think the size of this post may already kill most of the discussion in the thread. I have some other ideas that could be discussed, if anyone likes.
 
Last edited by a moderator:
Mea culpa.

I had a brain failure in that post. The N^2-N scaling deals specifically to the width of a superscalar processor, not whether it is OoO. I mixed up some stuff in my head with that one. That's the reason why nobody but VLIW or similarly explicitly parallel processors has gone beyond a (sometimes) width of 5.

OoO complexity should be linear, or at least not strictly quadratic with a processor with issue width N. It's superscalar processing that mucks things up quadratically, since that is another factor that controls the number of reservation stations/schedulers, result buses, and register ports.



I think the idea's interesting, though I wonder if calling it Out of Order Commit is what it should be called. Perhaps Deferred Commit or Checkpointed Commit. The instruction commit is in order, it just isn't done on a per-instruction basis.

I think the idea itself is interesting, but I think they were a little light on the details about their theoretical implementation, and a little heavy on the assumptions with regards to performance. It measures costs in a few added bits, but ignores the fact that actions also count when cycle time is involved. I'll be rereading it again to make sure I didn't miss some of the points involved, because there were some confusingly worded portions of the paper.

To sum up my thoughts for those who don't want to read the rest of this post:
I fear it may be more of a niche than it appears at first glance. Having a lot of instructions in-flight is hard to handle fast in silicon, though easy on paper or in a simulator. Extra speculation isn't always better, it has to have a good chance of being correct. Speculation in general can hurt performance with more than one chip involved.

Long post ahead...

The ROB concept is heavy-handed in the way it works, as it maintains precise exceptions, even when they aren't needed. As long as nothing goes wrong, a checkpointed processor should do just as well, conceptually.

What I'm curious about is whether the ROB itself is the limiting factor in performance. I'll have to look into it, but the fact that it represents a limit of the instruction window may just mean that it is only as big as it needs to be to service other more critical critical portions of the OoO pipeline.
For the most part, the ROB only needs to worry about a total of N instructions and their registers a cycle. (N being the issue width of the processor) It just commits up to N instructions at the top of the ROB. It can commit more if designed to do so, it just won't, because the most it can ever expect to get are those N that are issued.

That's a fixed timing cost, and it doesn't really change if the ROB is five,ten,6000 entries in length. The ROB may be a problem at larger sizes, but other things get in the way sooner.

The CAM, for example has always been a timing-critical headache for designers of OoO processors. Register allocation requires looking up at least two values for every register in a standard CAM: the logical mapping and the valid/free bit. The addition of a future free bit only looks like 1 extra bit, so no big deal in theory. But it means that the CAM's workload per cycle has increased significantly, perhaps by many times in bad cases.
A register rename no longer involves just that one register; it has to search the entire register file list in the CAM to find the one matching logical register of many that doesn't yet have its future free bit set.
This value can be cached perhaps, but this still means something like a doubling to tripling of the work a CAM must do, and it looks very serial in nature.

The next problem that could be handled well by the theoretical chip, but it isn't covered in the paper, is the fact that rename registers are quickly consumed and not released until after a checkpoint is reached. Tomasulo OoO is a big boon in cases where register pressure becomes a problem (x86 really likes register renaming), but the checkpointed processor as defined in the paper stands to fill its rename space up much more quickly.
It brings back register pressure in certain situations where the ROB frees up entries very quickly. They mention some ways to conserving rename registers, but not in detail, and not without considering the costs in hardware and wire delay.

The paper also ignores the problem with load/store queues being cycle-critical, which are one of the big problems facing OoO chips. Even the Commit processor must stall if its queues fill up, and it can just as easily fill its queues as another chip can fill its ROB.

The paper worries more about register renaming and commit phases, but it is vague on the implementation details. There are some problem cases I have ideas about handling, but were not mentioned in the paper.
Having that many in-flight instructions means that even without the ROB, there are still reservation stations and result buses involved (there may not be, but the paper didn't say anything at all about them).

A thousand in-flight instructions means at least 1000 reservation station entries. These were not explained away by the paper. Registers are allocated, but it does not discuss where the knowledge of what will use them goes, I'm assuming some kind of reservation station or scheduling unit exists at each execution unit or cluster thereof.

If the processor has something like 10 reservation stations, that's 200 rename registers that must be searched through every time a result comes down the (undefined in this paper) result bus(es), and every time it looks to send an instruction into an execution unit.
If reservation stations aren't cycle critical before, they will be after.
They could be removed, but that means instruction latencies go waaaaayyy up.

If the processor is superscalar, it becomes a 200*N number of checks in the worst case. Tagged buses and other measures can be used instead of brute force forwarding, but the sheer number of possible targets for an updated value is going to be a problem from a logic, timing, and electrical standpoint. I don't know what their ideal implimentation includes. They may have thought of this, but they didn't talk about it.

A theoretical processor in a simulator can just say commit(value) or update.reservation.station(register,value).
That's one software call and to the human running the simulator can easily assume it's just as easy for silicon to do the same thing. It hardly ever is.

Also, the size of the checkpoint queue entries may be underestimated. I didn't see mention of the instruction pointer at a checkpoint, system registers, or the values of the registers. Those will be needed in the case of a context switch or a procedure call, in which the entire visible register set is supposed to be copied out to memory.
Without storing those values, any kind of context switch becomes a lot more painful.
Either the write-out involves multiple lookups in the CAM, or the processor may violate software permissions or virtualization by copying out data that isn't supposed to be seen.

Processor flags need to be kept up to date (flags are one of the reason why x86 is such a pain to do OoO, they rename parts of the flag register to handle it). A bad flag assumption will of course be the same as a mispredict. The odds of finding one flag mis-set in 1000 instructions is pretty high, but it won't become clear until every instruction in a checkpoint interval indicates it is done.

I'm also curious as to how many instructions in the processor are invalidated due to a mispredict or other issue. They didn't say exactly how much of their IPC was meaningful. I assume all of it, since they should have factored it out, but I feel this is a potential deal breaker on a lot of workloads.
Beyond a certain point, most of the in-flight instructions will be mispredicted.

The larger instruction window may be larger, but it won't be more right. This is probably part of the reason why ROBs aren't pushed much further in size, since the returns just aren't there without better prediction (which is also cycle-critical and not getting much better).
It may be better in the long run to just scout and not try to commit at all without an amazing branch predictor.

In the same vein, it looks like the overall latency of execution is going rise. The average number of mispredicts+exceptions per checkpoint interval is going to add a certain amount of average latency to each instruction.

This factor will also be influenced by the average size of the intervals. Since exceptions are not detected until every instruction in an interval is done, a lot of work is going to be done before it is known something is wrong.
Worse, the chip won't know what went wrong. It will have to go back and go through each instruction and pay attention this time, which may make the pipeline complicated. (There are ways, I think, this could be handled better than how the paper states)

Worst of all, is that it is memory operations that are the most complicated when it comes to exceptions, and they could take forever. So it may be that load 10 of 15 has an access fault, but the chip won't know until all instructions, including loads 1 through 15 are done. Then, it will have to run the loads again to find the one that went wrong. They should be somewhat faster to run since they will likely be in cache, or not, a lot of speculation can thrash a cache, and some streaming loads explicitly avoid cache.

This is a natural side-effect of relaxing commitment timing. If you make things fuzzy, they can be possibly faster, but if things go wrong, reality gets harder on you. Unless correctness is thrown out the window, the pick up case gets harder.

If the number of errors is small and the interval size is small, the overall time may only be slightly longer than an ROB chip of equal timings.
This is contrary to a lot of optimizations that try to lengthen branch intervals, and it is contrary to how a lot of the memory subsystem and fetch logic tries to work.

I'm still looking at the part about slow-lane decode. It has some parallels to what Intel did internally with the P4's replay logic (even though that was something internal to execution, not issue).
A possible issue with that is what may have really hurt the P4 when it came to long replay queues. The corner cases would lead to a storm of pointless execution clogging the pipeline (or issue pipeline in this case). I'm not sure, I need to digest that part of the paper more.



I think for the most part it is, since none of the memory traffic or register alteration is kept.


It's the same thing with ROB chips, they just don't go quite as far as the idea in the paper. This isn't necessarily a bad thing if in all likelihood the path of execution is going to be incredibly wrong a thousand instructions out (and it likely will be).

In which case, I'd go for a scout instead and make my chip's clock cycles faster by dumping the extra complicated commit logic.



I think the chip might actually need software control if it is going to handle wierd code.


It sounds kind of like some kind of expanded predication since it also uses a register's value to make an instruction invalid.



The big problem for any kind of aggressive speculation and scouting is that a lot of traffic out of the chip is going to bog down a multiprocessor setup. If the chips need to play nice, it may be better if the chips restrict their efforts to instructions they know are going to have a good chance of working right.
Perhaps some switch-on-event multithreading to go between threads, then leave them when things get more uncertain.
Quality vs. quantity concerns arise if the chips need to share resources.

I haven't quite thought through the full implications of scouting and checkpointing on
software locking logic in a multiprocessor setup. I think there may be headaches, not that there aren't already a ton of them.


I better wrap this post up, I think the size of this post may already kill most of the discussion in the thread. I have some other ideas that could be discussed, if anyone likes.
Believe it or not, I've read your whole post. I must admit I had to read some paragraphs twice or thrice before I understand some concepts. There are a number of terms -ROB, mspredicts, exceptions, CAM, etc- that are still unknown to me. You are an engineer, aren't you? :smile:

Your opinions makes for a good and interesting read. I for one appreciate the effort put into this post.

Btw, back on topic, I firmly believe MS wanted an OoO chip mainly because of backwards compatibility (it would make it an easier task, imo)... In-order chips like Xenon can emulate OoO ones perfectly, I can't say the same the other way round, though. Imho, Xenos flops are unreachable for OoO, finite processors.

_____________________________________________

"Unlucky in games, lucky in love"

Xbox 360 gamertag: Cyaneyes or Arctic
 
Believe it or not, I've read your whole post. I must admit I had to read some paragraphs twice or thrice before I understand some concepts. There are a number of terms -ROB, mspredicts, exceptions, CAM, etc- that are still unknown to me.

I can give some definitions for those.
I don't know how much you already know, so I'm going to set out some basic concepts, as far as I know them. Forgive me if it's too basic, I just want to help out anyone who wants to know.

ROB (ReOrder Buffer): In an out of order processor, the chip fetches, decodes, and tries to issue instructions so that they begin executing as soon as their data is available.
Because data doesn't always arrive in the order that the program's instructions are found, this leads to instructions executing out of program order, or Out of Order.

However, the results still must make sense, and should look to the program like the chip was following its directions. The ROB is the part of the chip that takes the results and commits (or changes the system's external state) the results in-order.

Until an instruction's result is commited, it is considered speculative, because something could happen to make the processor discard the results.

*******
Mispredicts: In pretty much all but the lowest-power processors, the chip doesn't just work on one instruction at a time. If it did, a lot of its hardware would go unused most of the time.

In a basic theoretical chip, every instruction goes through a cycle of fetch-decode-execute-writeback(commit). Each stage takes a set amount of time, or tick on a clock.

If the chip only worked on one instruction at a time, then all the hardware for all but one of the stages would be idle, and three quarters of the time would be spent waiting.
In pipelining, the chip tries to run things so that there is instruction in one of those stages at all times. That way, the chip overlaps the work of multiple instructions, so that they can be spit out one after the other.

However, there's a risk involved. Branch instructions change the direction of execution, which means that the next instruction after a branch is uncertain until that branch has completed execution.
A chip could just sit and wait, but that opens up a lot of idle time, and pipelining has a performance penalty if the pipeline is not kept well-fed.

So what is usually done is the chip picks a direction using some kind of branch prediction, and it hopes that it picks well. The pipeline then fills up with instructions that the chip assumes are the correct ones, but they won't be confirmed to be right until the branch is finished.

Depending on how long a pipeline is, and how many instructions are in any given stage, that can amount to a lot of instructions the chip hopes are the right ones.

With good hardware/software prediction, the chip is usually right about the path it takes.

A mispredict occurs when the path chosen was wrong, and now the processor must discard all the work put into every instruction that followed the mispredicted branch. This wastes time, energy, and incurs a significant performance penalty.

****
Exceptions: The dirty secret about all instructions is that it isn't just branches that can change program order.
An exception is just another way of saying something has gone wrong, and execution cannot continue the way it was expected.
Bad memory accesses or a divide by zero are examples of exceptions.
When those are encountered, the processor has to try to handle them, which usually involves branching to special code that picks up the pieces and possibly throws an ugly error message on the screen. Either that or it locks up or destroys data.

Everything that follows an exception in the pipeline can't be trusted, so those have to be junked as well.
Sometimes, it isn't always necessary to catch every little error, or even care when exactly it happens. This is imprecise exception handling, and when a chip is designed to allow it, it saves a lot of checking that has to be crammed into a clock cycle.

Programs usually like precise exceptions, since it helps to know when exactly something goes wrong and where it went wrong as soon as possible.

****
CAM - Content Addressible Memory

The most popular way to execute instructions Out of Order is based on Tomasulo's Algorithm, which basically takes an instruction and secretly renames the register locations it looks for, and places the instruction and its renamed registers in a holding area called a reservation station that sits in front of an execution unit like an adder or multiplier.
Every cycle, the reservation station checks to see which instructions have all their rename registers filled, and tries to send them to the execution unit. After executing, the results are fed to the ROB and also back to the reservation station to update other renamed registers.

Register renaming helps get around a lot of the stumbling blocks created by software, which by definition is set in a given order and rarely changes. If the register names given in software are not changed, then the first instruction that happens to reuse a register name will force the processor to stop execution and wait, even if the data being passed around doesn't conflict.

The idea that it's the actual data (or Content) in a register and not its address that is important leads to the CAM. The CAM has the difficult job of renaming registers so that false dependencies based on register name are avoided.

There are other ways to rename registers, but the general idea is the same.

You are an engineer, aren't you? :smile:

I'm in the computer field, but I'm not an engineer, at least not yet.
I'm still working on that.

Your opinions makes for a good and interesting read. I for one appreciate the effort put into this post.
Thanks. I might go back and try to reorganize my thoughts, they were a bit disjoint above.

Btw, back on topic, I firmly believe MS wanted an OoO chip mainly because of backwards compatibility (it would make it an easier task, imo)...

There are a lot of reasons why OoO is desireable. MS has a lot of PC developers who would be a great supply of talent to make games, but the PC realm hasn't really dealt with the nitty-gritty stuff that console devs have worried about for a long time. PC guys just waited for Intel to put out a better chip that would handle code well, even if it wasn't really good code.
In-order chips can perform well, but they are a lot more picky. Unpredictable latencies and other inconvenient "uh-ohs" are often handled better if the chip can bend the rules a little.

Game developers have a lot of stuff they need to work on, so it would have been convenient and cost effective if they could have gotten a design that allowed them to save time and lots of cash by making it easer to get performance out of the chip.
Compilers are not yet that perfect, and optimizations are not always friendly for stability or meeting deadlines.

Multi-threading is also new to a lot of PC devs, so the in-order stuff is a double-whammy for the less elite companies out there.

In-order chips like Xenon can emulate OoO ones perfectly, I can't say the same the other way round, though.

The fun part about computational machines is that any machine can be made to simulate another machine. The only question is how much time you have to waste, and whether you want to get sued.

Imho, Xenos flops are unreachable for OoO, finite processors.
It's unreachable for current processors, and probably unreachable for many,many years. Being special-purpose allows Xenos to focus its resources on only a few tasks. General-purpose chips have to worry about a lot of other workloads and problems Xenos never needs to worry about.
 
Last edited by a moderator:
3dilettante said:
Game developers have a lot of stuff they need to work on, so it would have been convenient and cost effective if they could have gotten a design that allowed them to save time and lots of cash by making it easer to get performance out of the chip.
I disagree with the notion of saving 'lots of cash' - why would there be any dramatic impact on costs with having lower upper tier performance early on? Assuming your team isn't led by idiots, all it should mean is somewhat scaled back (or worse performing, especially if you're EA) game.

People still carry this stigma from PS2 early days where the myth was conveniently attached to supposedly "uber-complex" hardware. Early projects on PS2 drove up the costs(aside for the nexgen transition driving the costs up in the first place) because there was no startup software to work with, not because hardware was difficult to use.

If 360/PS3 SDK came packaged with nothing but a GCC compiler and Memory Card library, then you would have a point. :p
 
I'm probably just getting old but as much as I like playing with esoteric hardware, I'd rather just have something that works when it comes to production.

It's probably because I consider technology to be a service to the people who build the games, and that I don't consider it the core problem anymore but I think that these esoteric architectures distract focus from the game itself.

Bigger teams make it harder to control code quality, and architectures that are very sensitive to the way code is written will be under exploited.

I'm working on a PC product currently and some of the code in that is just plain scary, but its amazing how well the system just copes. A lot of the really bad stuff will be fixed before we ship, a lot of it will have to be, but at some level it's nice to get decent performance when your still experimenting.

Don't get me wrong if it's just me writing code, I want to work with CELL or with Xenos but if I'm shipping a game with a big team, I'm all for easing the technical burden as far as possible.

And I hate working on PC :p
 
ERP said:
Bigger teams make it harder to control code quality, and architectures that are very sensitive to the way code is written will be under exploited.
I agree - it'll take longer to properly exploit even in top titles, and in average apps - it will be under-exploited, but I was just arguing how that in itself doesn't increase time or costs to develop.

Don't get me wrong if it's just me writing code, I want to work with CELL or with Xenos but if I'm shipping a game with a big team, I'm all for easing the technical burden as far as possible.
I can relate to that, but isn't there some massive paradigm shift required here on software side with advent of paralel computing? If we continue working primarily in C++, easing technical burden will only become MORE difficult as time goes by.
 
I disagree with the notion of saving 'lots of cash' - why would there be any dramatic impact on costs with having lower upper tier performance early on? Assuming your team isn't led by idiots, all it should mean is somewhat scaled back (or worse performing, especially if you're EA) game.

I'm speaking from the point of view that Microsoft wanted to bring in the broadest possible developer base from the PC side, so it could build up a game library to buffer the PS3's arrival. This would involve the smaller teams with little multi-threading experience and little low-level scheduling experience.

The transition to multicore is a challenge itself, having to worry about performance optimizations just to get acceptable (not good, not great) performance is another one.

There are a lot of gotchas that will lead to missed deadlines and an overall drop in the quality and stability of next-gen games, since the rest of the system (deadlines, asset production, publishing) shows little sign of adjusting.
Or we could just hope for a lot more of Madden *insert number here*.
 
ERP said:
I'm probably just getting old but as much as I like playing with esoteric hardware, I'd rather just have something that works when it comes to production.
Distinct lack of testosterone, practical, negative attitude to effort - yup, getting old. :)

It's probably because I consider technology to be a service to the people who build the games, and that I don't consider it the core problem anymore but I think that these esoteric architectures distract focus from the game itself.
Beware, down that path lieth the Wii.
Seriously, there is a reason why the architectures are changing, and one large factor is that lithographic scaling isn't bringing the benefits it used to, and, while not imminent, the eventual tapering off of lithographic scaling itself is actually in sight. Another large factor is that extracting higher single thread IPC is costing an increasing number of transistors for decreasing dividends. Taken together, you have a choice - a drastic slowdown of computational performance development, or embracing adapted methods for expressing algorithms.

The choice is simple in my book, nobody promised C++ forever (it didn't even exist when I wrote my first lines of code). I can easily understand the inertia in a production environment, but stability in this case equals stagnation, and is it ever realistic to assume that old skills won't be continously obsoleted, adapted and added to? Particularly in a technology field and over a life time? Isn't it, from a personal satisfaction point of view, even preferable? Hell, pottery evolves even today, so why should software engineers be exempt from adaption?

I'm seriously middle-aged, and have no problem understanding the pressures that tight schedules put on developers and that they may be compelled to make compromises. Time and money are limited resources both. But I have also seen in writing a number of statements that I feel indicate something more. (Not pointing to you here, ERP.) A conservatism, a lack of readiness to adapt, a wish to hold the wave back rather than learn to surf. And I find this hostility to having to learn new skills a bit disturbing from people many years younger than me and working in a technology field.
 
Distinct lack of testosterone, practical, negative attitude to effort - yup, getting old. :)

i'm with ERP on this one - pretty much everything he said pertains to me too - i guess we're in the same age group ; )

Beware, down that path lieth the Wii.

beware? - you can't image what joy the above brings to an old man's heart : )
 
Distinct lack of testosterone, practical, negative attitude to effort - yup, getting old. :)

He isn't having a negative attitude toward effort. But I think the key is ERP and others are game developers, who want to make games. At some point the hardware becomes an abstraction and the only thing that truly matters at the end of the day is the game. Obviously this is going to ruffle feathers on a technical forum, but the reality is this:

Where would ERP rather focus his time: Developing the Game or Working on the Hardware? One could argue that the more time spent understanding and exploiting exoteric designs is LESS time on the GAME.

Anyhow, I didn't read anything in his comment about lack of effort, just a desire to refocus effort to where it counts for gamers and game developers: the game. ;) ANd obviously, based on a number of big developer comments, they do feel that the hardware designs don't necessarily solve many of the problems they have/envision.
 
I can relate to that, but isn't there some massive paradigm shift required here on software side with advent of paralel computing? If we continue working primarily in C++, easing technical burden will only become MORE difficult as time goes by.

While I agree in principal. I'm trying to ship games today and I think we could have had comparable experiences on narrower architectures.

Parrallel computing is coming, no question about it (hey me and a lot of the accademic community predicted it 20 years ago), what it looks like when we get to a 100 or 1000 hardware threads I don't know. But I'd be willing to bet that the techniques required to leverage massively parallel processors will be radically different to those required to exploit CELL or the X360 procrssors.

I saw a presentation recently predicting 256 hardware threads in PC's in under 5 years (not that I agree but). You can't use even close to the same approaches when you're trying to exploit 256 hardware threads that you use exploiting 8.

With 8 you run a job model, with 256 you have to start looking at pretty fine grain parallelism to exploit the resources.

As I've said before I don't think that X360 or Cell are bad processor designs, they we're clearly extremly heavilly optimized for one thing at the expense of good general performance.
 
Back
Top