Full transcript of John Carmack's QuakeCon 2005 Keynote

scooby_dooby said:
So any "optimizations", ie loop unrolling, you do to the code will speed it up on both a OOO and IO processor, since it's just faster code period. But it will just have a greater impact, and is much more necessary on an IO processor.

As an FYI, a programmer per se does not do loop unrolling. Loop unrolling is generally done by a compiler that is looking at all the various trade offs for the design such as code size expansion. Loop unrolling is helpfull for both OOO and IO and in turn also harmfull to the L1 cache footprint.

The impacts of the various code optimization strategies with generally be roughly equivlent between IO and OOO uArchs.

Aaron Spink
speaking for myself inc.
 
scooby_dooby said:
Is there no way to queue threads so that they are poised to execute if the CPU stalls waiting for a memory request?

Depends on the program and the design of the threading hardware in the design.

Once you dev's have nailed down exactly the best ways to parralize game code, is it not fairly straightforward after that?

depends. How many threads? Its is somewhat reasonable to figure out ways to get to say 3 threads with a lot of rewriting, but beyond that, sub threading threads, it can get very complicated. Thats not even getting into the issues of destructive interference.

There are some workloads that thread very well (transaction DBs being an example), because they have a large amount of well ordered data and instruction independance. This allows you to thread them at an operational level (ie, each DB transaction is an independant thread) with minimal cross communication.

I'm just not convinced that games are one of these.

Aaron Spink
speaking for myself inc.
 
That would be a context switch. Consider the amount of registers, then their size, then consider how much bandwidth would be needed to save that, and then consider where they would be stored and then double that because you have to load the new stuff. So for a short stall, Ouch.
 
Saem said:
That would be a context switch. Consider the amount of registers, then their size, then consider how much bandwidth would be needed to save that, and then consider where they would be stored and then double that because you have to load the new stuff. So for a short stall, Ouch.

Absolutely true, but you do get the first and second thread on each core for "free", so you can have two threads to try and fill each other's stalls in.
 
inefficient said:
Just in case it has not been repeated enough. There is no such thing is code written for "Out-Of-Order CPUs." Both CPUs run the same code. Being able to execute out of order is an OPTIMIZATION!

They run exactly the same code, the OO cpu will just be slightly less senstitive to BAD code.

Even if you tuned the hell out of your code so that it would run well on a in order CPU. The same code would still run far far faster on the CPU that could execute Out of Order.

The ONLY reason why these console cpus are not OO is to save transistors.

Less transistors = Smaller core = cheaper to make + less heat dissipated + higher clock speeds possible.

They traded off a very good optimization technique, to improve clock rate and make room for a few more cores.

Realtime code, like the game's 3D renderer, is already going to have been highly optimal. Especially when written by an experenced programmer like carmack. When carmack says his code is running at 1/2 the speed on an in order CPU, its not because he "has not optimized it" for in order use. It is simply because In Order CPUs are slower period.

And this is no shocker. Both MS and Sony knew this. But they are hoping that giving developers several slow CPUs will be able to give a better price/performance ratio than a single fast CPU. And theoretically, this should be possible. But in reality, some people disagree, including carmack apparently.


Thank you!!! To add to what you said. OoO was not introduce to the Intel x86 line until Pentium Pro.
 
Branching: How are you going to get around branching? What? Predicated excecution? Funny!
Depends on the types of branches they are. There are the cases you can't get around, but there are the kinds that you can inline out and do a conditional move after the fact. I've already seen cases when two or three node branches of even sizeable amounts of computation can run twice or thrice as fast on 360 by just running through all the paths and conditional move at the end (which you can get with a trinary).

Memory Access: what you don't thing memory access hurt on modern PC processors? The access latencies in cycles are as high if not higher on the PC side vs the new console processors. If anything memory accesses hurt MORE on current PC processors.
Yeah, but the larger caches and speculative auto-prefetches reduce the miss rate so that you don't actually hit memory as often. Small caches to be shared by multiple cores and explicit prefetching will prove to be a headache. It's not about how many cycles each cache miss costs you, but about what is the mean CPI cost.

Jumping around in memory: Not sure how this is any different than memory access but I would assert that this hurts as bad if not worse on modern PC processors.
I was more referring to the whole AOS vs. SOA point here. When doing things that don't reference everything in a struct, but loops through a whole lot of them, the difference between AOS and SOA would show itself much more apparently if each access costs you that much more on average. On the PC, the difference is still there, but it doesn't strike you until you're on a platform where the difference is practically night and day.
 
You can still predict branches in software ... for something like writing signal processing software in assembly Id take that over x86 branch prediction any day, it is bloody annoying to know for a fact branches are going to be mispredicted for which you know the branch direction well ahead of time (not helped by the lack of a functional loop instruction on x86 either).
 
Yeah, well... x86 is... x86.

Ultimately, I guess what I'm getting at is that if everything in the world of code was just made up of ginormous linear sets of pure computation that never set foot outside of cache, that would basically be the only thing that performs well on these chips. The further away you are from that is exactly how far away you will be from "fast."
 
ShootMyMonkey said:
Depends on the types of branches they are. There are the cases you can't get around, but there are the kinds that you can inline out and do a conditional move after the fact. I've already seen cases when two or three node branches of even sizeable amounts of computation can run twice or thrice as fast on 360 by just running through all the paths and conditional move at the end (which you can get with a trinary).

and I've seen cases where conditional moves actually slow things down. And all modern CPUs have full support for conditional moves. The cases in which they are effective are few and far between.

Yeah, but the larger caches and speculative auto-prefetches reduce the miss rate so that you don't actually hit memory as often. Small caches to be shared by multiple cores and explicit prefetching will prove to be a headache. It's not about how many cycles each cache miss costs you, but about what is the mean CPI cost.

Larger caches? um, nope.

auto-prefetches? um, nope.

The issues exist for both. The issues cost the same for both.


I was more referring to the whole AOS vs. SOA point here. When doing things that don't reference everything in a struct, but loops through a whole lot of them, the difference between AOS and SOA would show itself much more apparently if each access costs you that much more on average. On the PC, the difference is still there, but it doesn't strike you until you're on a platform where the difference is practically night and day.

I don't see why the difference would be practically night and day.

Aaron Spink
speaking for myself inc.
 
aaronspink said:
As an FYI, a programmer per se does not do loop unrolling. Loop unrolling is generally done by a compiler that is looking at all the various trade offs for the design such as code size expansion. Loop unrolling is helpfull for both OOO and IO and in turn also harmfull to the L1 cache footprint.
Although, in general, I do agree with you, there are cases where manual loop unrolling is beneficial.

(Now, I'm going to assume we've profiled the code and first found that an inner loop is expensive because, as Hoare and Knuth said, "Premature optimisation is the root of all evil")

If you have something like
Code:
void myfunc(float array1[], float array2[], ....)
{
    ....

    for(i blah blah; i++)
    {
        array1[i] *= array2[i];
    }
...
}

Assuming you always know that array1 and array2 are physically separate, then you can do much better by unrolling the loop a little and using some temps, e.g.
Code:
    for(i blah blah; i+=2)
    {
        float tempa, tempb;
        tempa = array2[i];
        tempb = array2[i+1];

        array1[i]    *= tempa;
        array1[i+1] *= tempb;
    }
This allows compiler to overlap the multiplies. Without the unrolling and temps, it can't tell if the arrays overlap or not and so can't calculate the second multiply until the first has completed. (Obviously you can use pointers etc to get rid of the indexing if you want)
 
Simon F said:
This allows compiler to overlap the multiplies. Without the unrolling and temps, it can't tell if the arrays overlap or not and so can't calculate the second multiply until the first has completed. (Obviously you can use pointers etc to get rid of the indexing if you want)

Incidentally, you can tag your parameters with the new __restrict keyword on the latest VC++ , which explicitly indicates to the compiler they will never alias. I would expect that this would let the compiler figure out that it can do a loop unroll for you in this particular case.
 
Last edited by a moderator:
aaaaa00 said:
"VC++". I think that sums it up. I work with multi-platform and as ANSI standard as possible.

I suppose you could stay ANSI by locally creating "const" aliases to the data but I'd still have my doubts how well that will work - I've found that declaring local scalars and making "copies" of the data tends to work best. The compiler often eliminates the actual copy.
 
Simon F said:
"VC++". I think that sums it up. I work with multi-platform and as ANSI standard as possible.

I suppose you could stay ANSI by locally creating "const" aliases to the data but I'd still have my doubts how well that will work - I've found that declaring local scalars and making "copies" of the data tends to work best. The compiler often eliminates the actual copy.

Const actually doesn't work because it only applies to one reference. Because of the vagarieties of C/C++ you really can't assume that things that are const don't alias (yes it is stupid). That is why the C standard added the restrict keyword with tells the compiler there is no aliasing. Supports in ICC, VCC, and GCC plus many others.

Yes, `const' doesn't promise that things won't be aliased.
The next revision of the C standard, C9X, will have a new keyword
"restrict" for that purpose, although it will probably be at least
a few years before many C or C++ compilers really support it.
 
ShootMyMonkey said:
Ultimately, I guess what I'm getting at is that if everything in the world of code was just made up of ginormous linear sets of pure computation that never set foot outside of cache, that would basically be the only thing that performs well on these chips. The further away you are from that is exactly how far away you will be from "fast."
As soon as you are going outside of cache you are screwed period ... you cannot make the instruction window large enough to make a dent on the latency anymore.
 
aaronspink said:
Const actually doesn't work because it only applies to one reference..
Yes, you are quite right. Should have thought a bit more before posting. Guess that explains why I haven't done that in the past :)
 
And all modern CPUs have full support for conditional moves. The cases in which they are effective are few and far between.
Yeah, and about zero compilers do even a halfway decent job with them. As long as the cost of the branch is less than the cost of the computation within, it is pretty useless. The chances that it will be worse pretty much rest on misprediction rate and whether you need to branch outside of i-cache. Either way, it's pretty much a matter of bad luck, but that's the only kind of luck there is.

Also, in game code, at least in computational tasks, the amount of computation done in the body of each branch is rarely *that* big. That too, there's all kinds of miscellaneous little insertions that are usually meant to *avoid* computation.

The issues exist for both. The issues cost the same for both.
So you're saying larger caches and/or speculative prefetches do nothing for miss rate? When you keep in mind that in 360's case, the 1 MB is shared among 3 cores, so you either make the cache smaller (partitioning), or you endure the increase in miss rate caused by the fact that multiple threads can map whatever they're accessing to many of the same sets. Again, bad luck, but there isn't any other kind. If you have 2 MB for 2 cores vs. 1 MB for 3 cores vs. 512k for 1 main core (the rest are not really cache), guess who has the highest miss rate?

If your cache miss rate is 2% and your memory latency is 800 cycles, that's still lower mean CPI cost than a miss rate of 5% for a memory latency of 600 cycles.

I don't see why the difference would be practically night and day.
When you get right down to it, they're pretty much made for stream computing. The most significant fraction of their computing power is there. It's a matter of how frequently madame bad luck rears her head. And the thing is that with these next-gen CPUs, the performance loss per moment of bad luck of some type is pretty linear all the way.
 
Oof, massively parrelel compilers as a general case. I remember seeing the statement of the problem in some computer science classes as an undergrad. Needless to say, its a horrendous headache that went way beyond me, as theres a lot of high brow technical mathematical no go theorems that are hard to bypass.
 
John Carmack

Hello,
I actually don't play video games, but I know John. I just wanted to answer one question that was posed (he mentioned he gave a fairly windy speech).
Does John use a teleprompter?
No, in truth you can ask John questions on a friggning variety of topics so wide it is truly impressive and he will give you good information. (or will tell you clearly: I don't know.)
I am signing off, have work to do elsewhere.
To the gentleman who transcribed the speech, my thanks. It was an interesting read.

Laters,
Nemisis
 
Back
Top