Interview: Nick Baker on the XBox 360 Architecture

Well that's quite a different argument to Pipo's query about tools for next-gen consoles, which is what I was talking about (and not Nick's comments on parallelism).

I don't think it is that different. Game development has the same fundamental constraints as the rest of the software industry, the only difference is that is has a different emphasis on the various constraints. Schedule is by far the most important everywhere. Game development probably has more emphasis on performance in general, but not in all cases. Cost is also an important consideration.

Cost is directly related to schedule, so time-to-market matters. Having the proper tools for a job is essential to succeed.

But still, if multicore happens, tools will happen. Inevitably. Intel and AMD and IBM will create their own if there's no universal standards, because they are creating these processor and they will need development tools to work them!

I just don't share your optimism. IBM certainly didn't have a clue as to how to program CELL initially, presenting three different programming models (inter-communicating SPEs, batch jobs, and managed code) before ending up with the only one that is feasible in the long run (the batch job one). AMD hasn't provided anything to help the anaemic performance you get from a X2 when you have one thread that is bounced between two cores causing massive amounts of L1+L2 cache misses.

Relying on hardware companies to come up with a software paradigm shift is optimistic IMO.

Cheers
 
IRelying on hardware companies to come up with a software paradigm shift is optimistic IMO.
As i mentioned, they're more a 'last resort'. If no-one else does, they'll offer something. I expect proper software developers, especially MS, to develop the necessary tools.

On the flip side, what if they don't? What if the tools never materialise? Are people going to buy multi-core machines where most of the cores sit idle? Surely without the tools, the whole progress of computing is going to crash as the hardware advances will result in no tangible benefits? And that in itself is a guarentee solutions will appear! MS and Intel and AMD aren't going to pack up their businesses in 10 years time and say 'well, we couldn't get any more progress than those monolithic x86s of yesterday, so we're all resigning.'! I don't see it as optimism as the inevitable demand for a solution inspiring human endeavour and creativity. People also find solutions to problems. One will come from somewhere. It may be ten years in the making (looking at higher level language development on the first computers) before parallel programming is as commonplace as monolithic core programming, but it's going to happen eventually.
 
I think if there was going to be a software programming paradigm shift this generation, shift we would have already seen hints of it out there at least in the academic community. Right now there is nothing really promising out there that I know of in either academic or commercial research.

There is a paper here summarizing the problem and the recent attempts at tackling it. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
 
AMD hasn't provided anything to help the anaemic performance you get from a X2 when you have one thread that is bounced between two cores causing massive amounts of L1+L2 cache misses.

How about setting thread affinity?

Bouncing threads back and forth between cores is determined by the application and OS scheduler.
What good is having the hardware second-guess the OS?
 
But surely this is just a matter of time? It seems to me that with experience developers/the industry will learn more ways of effectively breaking down problems into parrallelised solutions and as more time is spent doing so, new methods of enhancing and optimising these techniques will grow and evolve..

As a developer (not of games), I'd say the hard part with multi-core development isn't breaking down a problem into multiple multiple threads executing in pararrel. What is hard is making sure the solution you came up with works all the time. When things happen sequentially, you have predictability. If something works correctly once, it'll works correctly always (provided that other variables like uninitalized pointers are eliminated). When things happen in pararrel, as with today's technology, you lose a lot of that predictability. Just because something happens one way once, doesn't mean it'll happen the same way another time--some timing difference could change the situation. In a word, instead of knowing you have the correct answer empirically (i.e. seeing that it works), you have to prove it analytically, a priori. It's much bigger mental challenge. Unless our brains are getting bigger, I doubt the situation will improve much over time.

As much as the Itanium processor is ill-spoken of, I still think the design path is correct. The task of finding parallelism should fall on the CPU and compiler. Having multiple independent cores is a dead-end IMHO.
 
He explained the X-86 thing in the interview, maybe you didn't watch that far. Basically top two priorities were 2005 and reasonable cost. As he said, those drove a lot of the decisions, and one X-86 core became too expensive compared to multiple smaller cores.

According to Takahashi's book, Microsoft asked Intel for Tejas but at a price point that was unacceptable. Given that that processor died and the whole P4 line went to the grave with it, one can say that MS made the right choice. In retrospect maybe Intel should have yield on the IP issue, since it would have ceased to matter ;)
 
Just because something happens one way once, doesn't mean it'll happen the same way another time--some timing difference could change the situation. In a word, instead of knowing you have the correct answer empirically (i.e. seeing that it works), you have to prove it analytically, a priori. It's much bigger mental challenge. Unless our brains are getting bigger, I doubt the situation will improve much over time.
I don't know that that's quite the wording I'd use, but that's about the size of it. That's mainly why a lot of the early use cases will be trivial problems for parallelization (in the sense that it works on independent datablocks). Of course, the problems that still arise even with these otherwise trivial problems come up when subsequent tasks are dependent on them and you don't know whether data will be ready in time or not. I mean, we have things like locks and fences for just that purpose, but they kind of defeat the purpose of multiprocessing in the first place when you're talking about performance-minded development.

For the same reasons, though, that's why automating the process at the compiler or CPU level is also a losing battle. If human brain power is too wee to figure it out on individual apps, how can one expect to generalize the problem across all conceivable software? The best we can possibly ever hope to do is to try and provide some interface for demarking candidates for the issuing of jobs/threads and hoping things turn out all right, or reach a point where hardware in its raw power performs at the point where it doesn't really matter, which to me is the crappiest solution.

As much as the Itanium processor is ill-spoken of, I still think the design path is correct. The task of finding parallelism should fall on the CPU and compiler. Having multiple independent cores is a dead-end IMHO.
Itanium's big problem has more to do with its genesis and the fact that HP wanted everything in a single package. While you can argue that ILP and TLP are not inherently superior to one another, the fact remains ILP costs you a massive transistor budget. I mean, getting 4x the sustained ILP of CPUs from 15 years ago has costed us 11x the transistors. Comparatively speaking, allowing scaling for more TLP is easy... on the hardware side, anyway.

I don't think multiple independent cores is a dead-end by any means. I think just favoring ILP or TLP specifically is a dead end -- it was already showing signs for ILP, and so a lot of architectures ended up moving towards TLP only. Ultimately, the ideal architecture is one that somehow strikes a balance between ILP and TLP (and there's really no one universal answer to this balance), and also having programming tools that actually enable us to make use of that.

I don't refute your lack of faith in human ability to make use of multiprocessing effectively, but I happen to have even less faith that taking away human control is a path to victory over the unclimbable mountain.
 
The fundamental problem with TLP solutions is synchronisation, if you can't trivially encapsulate the synchronisation, it can be a real world of hurt.

We've moved systems not originally designed for mutli threaded operation to be asynchronous with a minimal set of locks. That much is fairly trivial (although error prone). However what tends to happen is that locks are too broad resulting in significan serialisation and less than expected speed ups, and you have to make a second itteration to make the locks finer grain, increasing the number of synchronisation primitives and as a result the source of errors. You quickly reach a point where the cost of maintenance can out way the value, and you either have to re-engineer or live with the sub optimal solution.

It's easy to paint yourself into a corrner in Multithreaded design, everyone understands the principles, the execution is not simple. My current recommendation to people is, If your design requires mutex' or semaphores, redesign it so it doesn't. if all your synchronisation occurs in small well defined data structures (say a thread safe queue - which can trivially be written lock free) then you limit the multithreaded maintenance to those trivial problems, and the surrounding systems aren't burdened with it.
 
As much as the Itanium processor is ill-spoken of, I still think the design path is correct. The task of finding parallelism should fall on the CPU and compiler. Having multiple independent cores is a dead-end IMHO.
I strongly disagree. ILP is nice but it will never be able to scale as well as TLP and that will become even more apparent when the 8 core x86 CPUs hit the road and there is some SW fully using those cores, I think no one will look back, in the end it is the raw performance that counts.

Going multi-core introduces challenges for the software architects, but when the right model has been found the gains are dramatic. Some TLP code may be generated by the compiler, but those cases are fairly rare if you look at code in general, it usually can be done much more efficiently by hand. Perhaps the compiler can be helped by adding additional information through some language extensions, such as constraints for variable loop count limits etc. but that will still leave a lot of responsibility left to the programmer and the old saying "a fool with a tool is still a fool" will remain valid.

I am really curious about how well the upcoming PC games such as UT3 and Crysis will be able to scale performance with the number of cores in the CPU, that may give us a hint about how good the developers have become at using TLP.
 
How about setting thread affinity?

Bouncing threads back and forth between cores is determined by the application and OS scheduler.
What good is having the hardware second-guess the OS?

Erh, I don't want the hardware involved in this. I want somebody to patch the scheduler on Win XP (and Linux for that matter) so that a single proces/thread isn't bounced back and forth. My point was that if AMD can't be relied upon for something that simple (gaining access to MS scheduler might not be simple, but....) then they can't be relied upon to provide state of the art development tools.

Cheers
 
Last edited by a moderator:
Erh, I don't want the hardware involved in this. I want somebody to patch the scheduler on Win XP (and Linux for that matter) so that a single proces/thread isn't bounced back and forth. My point was that if AMD can't be relied upon for something that simple (gaining access to MS scheduler might not be simple, but....) then they can't be relied upon to provide state of the art development tools.
How big deal do you think that bouncing really is? Don't you think that each thread switch (including doing some stuff) will trash a large part of the 1:st and 2:nd level cache anyhow?
 
Last edited by a moderator:
On the flip side, what if they don't? What if the tools never materialise?

We'll be stuck with programming threads the way we are now, which is time consuming, error prone and in many cases non-optimal; -you wan't to make sure you break down your program in enough threads so that all cores are busy, at the same time you want to keep overhead low.

As a start I could imagine CSP primitives becoming native to a programming language. You'd then use them everywhere. The compiler or runtime could then decide which chunks should be run in their own context and which chunks should be collapsed into serial execution in one context.

The main point is that I as a developer don't waste time with low level nitty-gritty inter-thread communication (and b0rking it up in the proces).

Cheers
 
Last edited by a moderator:
How big deal do you think that bouncing really is? Don't you think that each thread switch will trash a large part of the 1:st and 2:nd level cache anyhow?

*Off topic*

It's *huge*. In some games it's something like a 15-20% slow down (Counterstrike:Source for example). Games usually score very high hit rates in caches. I've profiled games in the past, most have hitrates in D$ above 97%.

There are programs, that can start another program (like a game) and pin it to a specific core with affinity masks.

I manually set affinity in Steam games after startup, not really because I can feel a difference (I don't play that many games anymore), but because knowing a program is running sub optimally drives me nuts.

I'd imagine that future multithreaded games would set their own affinity masks which would solve the problem. It's just "old" single threaded games that is the problem.

Cheers
 
I'd imagine that future multithreaded games would set their own affinity masks which would solve the problem. It's just "old" single threaded games that is the problem.
I see your point, thanks. That makes sense for the old games and if you assume the new programs will have pretty static threads on each core.
 
Last edited by a moderator:
I strongly disagree. ILP is nice but it will never be able to scale as well as TLP...
To be accurate with borowki's argument, I don't think it's a matter of scale as performance attainability. If youre 8 core CPU has 5x the peak performance of your fat ILP-focussed core, but the programmers can't get more than one core working effectively most of the time, the scaling in the ILP is more efficient. As SMM points out, to date ILP hasn't got us much, but at the same time we've hardly touched broad TLP. When we do get 80 core architectures, will the performance from the TLP of that be extracted? Or will the hardware be seriously hampered by the code run on it, and a same sized chip focussed on ILP will get better efficiencies?

I think one problem with the development of tools is the speed this is happening. Developing monolithic cores was a decades long process with tools evoloving all the time. Then suddenly in a few years, there's multicore to worry about. The hardware can progress at a huge rate with the software side still barely awake to the problem! For that reason tools may be lacking for quite a while. I have a lot of faith in human ingenuity though!
 
I think one problem with the development of tools is the speed this is happening. Developing monolithic cores was a decades long process with tools evoloving all the time. Then suddenly in a few years, there's multicore to worry about. The hardware can progress at a huge rate with the software side still barely awake to the problem! For that reason tools may be lacking for quite a while. I have a lot of faith in human ingenuity though!

If you just limit the view to gaming consoles and PCs I agree with the above, but multi-core solutions is conceptually no different from discrete mulit-cpu solutions and those have been around for far more than just a few years. I also have faith in human ingenuity and certainly believe PC and game console programmers will get the grip of how to efficiently use multi-cores in the same way as main frame and HPC programmers have done for decades.
 
Let's also not forget that when we signficantly enhance the throughput or processing power of one part of a system, we still eventually have to deal with the other subsystems, as well. Otherwise we are just moving the bottlenecks around without any real gain in performance.

For example, scaling the number of cores from the order of 4 to 100 requires extracting massively more TLP from our code. And even if we were able to do that, we would then have to keep all 100 cores fed with instructions + data - something no memory subsystem could possibly do today (not to mention that the CPU/memory performance gap has been generally getting wider over time, not smaller). Even assuming THAT was solved, we'd still need a massively wide and fast bus to transport all that data, too. And on and on...

My point is simply that focusing solely on CPU processing or code parallelization will ultimately be unproductive; you need to look at systems as a whole.
 
My point is simply that focusing solely on CPU processing or code parallelization will ultimately be unproductive; you need to look at systems as a whole.
That's another thing that I don't have much faith in. Which is to say, I don't have faith that people will actually do this. Mainly because there's more concern on constant marketability of the fuctional components. I can only see an endless vicious cycle of constantly advancing raw CPU power and then realizing memory performance isn't enough, so they'll just patch up the problem with chipset support for newer memory platforms which are still too immature to make any difference.

Consoles have a 1 in 5 chance of seeing some minuscule tinge of effort towards this ideal because they have the advantage of being a fixed platform (and more importantly because of centralization), but even then, I have my doubts that memory platforms will advance to the point that any efforts to design looking at the system as a whole won't be in vain.
 
Wow, they were planning a single x86 core with 8-16 mini array cores for floating point. I think this verifies that a PPC CPU was never a first choice for MS, considering their history with x86. The interviewer is a kiss ass, keeps fluffing this guy instead of asking real questions.

I recall reading in "Xbox 360 Uncloaked", Microsoft's CPU guys thought that they could get 8 to 16 cores on a single chip, circa 2002.

Baker and Andrews estimated they could fit eight or 16 cores on one chip.
 
That was definitely a possibility. Cell has 8 cores. Lose the LS and you could fit 16 cores. Memory management would kill you though. And if they went with 8 cores, may as well go with Cell!
 
Back
Top