Bob Colwell (PPro designer) interview

Raqia

Regular
Here's a fascinating interview of Bob Colwell, one of the principal architects of the revolutionary and long lived P6 architecture:

http://newsletter.sigmicro.org/sigmicro-oral-history-transcripts/Bob-Colwell-Transcript.pdf

23:58 BC: Yeah it kind of feels like that. That fundamental way of looking at computation was new to us at the time. When it first came up I wondered if this paradigm would reasonably handle all possible instruction flows, because it damn well better, because it's the only one we're going to have, but it's the first machine I'd ever seen that lacked a central controller. Most machines including the Pentium or 486/386, you could identify a central agent somewhere in that machine as a finite state machine of some sort who knew what the state of the machine was, knew what all the pipelines were doing, knew what the interrupts were being requested; it was that finite state machine that was running the whole show. But it was a centralized resource, and everyone had to report into it from all of the corners of the chip. P6 could not do that; in P6 we said "we're going to split these machines instructions up into constituent atoms (micro ops) and then we're going to watch them find their own way through the machine." The first thing that should jump to your mind is, wait a minute what if one gets lost? That would stall everything wouldn't it? The answer is yep, if you lose one you're toast. We did have bugs like that when we first started simulating the machine and implementing it. But that was actually a good thing. Because what it meant was if you made a mistake, of any sort, the machine stopped right where the mistake was. So it basically led you by the nose and said "look right here, this is the one that's busted" which was a good thing. What that led to however, I'm kind of getting away from patents, but on both the P6 and Pentium IV there was the question "what are the odds that there is a remaining latent bug in the micro engine, such that some obscure combination of instructions and micro ops causes one of them to get stuck somewhere, you get a deadly embrace, or you live lock or something and the machine stops making forward progress, what are the odds?" And if you can't say that that's literally impossible, then what should we do about it? The answer might be to validate some more until you are 100 per cent sure, but if you become convinced that you can never be 100 per cent sure then you need another plan. And so our plan was to take advantage of the fact that there are no other paths through the machine longer than some maximum number of cycles, so if we had a facility in there that knew how many cycles it had been since the last successfully retired micro op had occurred, we'd have a dead man timer. It would be like "uh, oh we're stuck" and we’d proceed to flush the speculative state. 26:34 PE: _____. 26:36 BC: Yeah, exactly, this is not good and let's roll back the speculative state to the last point where the machine was definitely known to be on track. You already have to be able to do that, that's not a new requirement, because that's how you recover from mis-predicted branches. If we did this rollback stratagem, we could work around those bugs and sure enough, that facility is in all of the P6 and Pentium 4 chips and at least at the time I was there it occasionally engaged, occasionally it would do its job. I don't know what the statistics are anymore or what it still does whether they still have it. But I just remember thinking when I first, when we first started doing this I didn't like it, it smelled funny to me. It was like if you really knew what you were doing would you have to do this, are we just doing this to cover our own ignorance. But at the end of the day you have a schedule, you have to get the machine out, you do what it takes and that's what we did. 27:27 PE: Well they're deterministic machines but our ability to understand them is limited. 27:35 BC: Yep and heaven help you when you do the cross product of all the intrinsic complexity times power downs, and you start trying to save power and the unit's not even powered up when you needed it, oh man. And then of course, you start peppering it with interrupts, traps, break points, faults; it gets real exciting really fast. The complexity is enormous.
 
There's more than a few interesting talks by Colwell from the past decade or so that can be found on youtube. He also has a book that covers his time as a project lead (The Pentium Chronicles?) that I've yet to read. He seems pretty outspoken and with a less than perfect end to his employment with Intel his talks made/make for some pretty candid reflections on the company and industry.

At a layman's level, reflecting on the shift to out of order and the natural byproducts of increased complexity always made me smile inwardly. While we dream about the future sci-fi possibilities of AI or having computers building themselves to the degree that the maker can no longer fully grasp the moment to moment behavior (or even the fact that one could justifiably use a word like "behavior" to describe an operating computer chip), in a lot of ways we've already transitioned through those changes, albeit unceremoniously.
 
Back
Top