Simple(ish) quesions regarding the XBOX360

Ken2012

Newcomer
Right, hopefully you kind folk here will be able to shed some light on my insane ramblings :) Bear with me... I'm taking the time to ponder about this not out of any kind of ******ism of self-aggrivation, simply out of pure interest:-

1) After reading the transcript of John Carmack's QC2005 speech, he initially talked about how a game optomised/coded for Intel/AMD CPU architecture, specifically Quake 4, would run at around half the framerate on the 360's custom triple-core IBM architecture. And law and behold, I've been hearing reports of framerate drops in areas of Q4 that would be much higher running on sufficient PC hardware.

OK, why exactly is this? Going by this X-bit labs article, Quake 4 is already multi-threaded: http://www.xbitlabs.com/articles/cpu/display/28cpu-games_6.html. What exactly was it that made this port so difficult to accomplish, whereas NFS: Most Wanted runs at a higher framerate at a fractionally higher image quality (marginally nicer-looking textures/materials, I believe) on the '360 than on it's PC version? If a single-threaded (?) engine works well on both platforms, why doesn't a dual-threaded engine designed for dual-core architecture run perfectly or better on a 6-threaded CPU?

I assume that it all has do with the descete differences between different (Intel compatible and non-Intel compatible) instruction sets... Lack of programability and all that. That and NFS:MW is no doubt a more 'GPU-dependant' engine than Q4 anyway, so the R500's acclaimed unified shader architecture is probably paying off...

2) Carmack also talked a lot more about multi-core/threading, how it's been around for a looong time and how it's always been difficult/challenging to fully harness the raw performance capabilities. He praises the currect crop of multi-GPU (that is nVidia's SLI and ATI's Crossfire) solutions as basically being the most succesfull incarnation of mainstream parallel processing to date. If his point is the case, however, shouldn't we already be seeing much better performance with multi-GPU than what one does? Not that it doesn't deliver attall (I own dual 7800GTX's and I love 'em :D), but surely currently hardware-intensive PC games like F.E.A.R. should/could benefit a lot more from the sheer power of a second GPU/framebuffer (in conjuction with the right CPU, granted). Unreal Engine 3.0/UT2007 is rumoured to take more specific advantage of SLI/Crossfire, is there any truth in this do you think?

Thanks.
 
The triplecore CPU in Xenon is an in-order CPU, this makes it more difficult to program for.

I also dont buy that multi-threading is something that is known how to do. Many companies claim to be multi-threading, but they are probably just throwing Audio and decompression and such on the other CPU's.

I dont believe game engine timing critical multithreading is being done yet. Anywhere, probably. Not on PC either.

NFS MW is said by INQ to run better at 1920X1200 on PC with 512 GTX than it does on Xbox at 720P! That said, the PC NFSMW benchies I saw were not that impressive (40-60 FPS even at lower resolutions). However a better example of a PC game that runs better on 360 would be call of duty 2, which virtually every review has said looks as good or better and runs better than even the highest end PC's.

But basically you can think of the Xbox CPU as having massive raw power, but being very difficult to develop for. So early games will suffer.
 
Last edited by a moderator:
I think this more or less boils down to how much time there was before launch with final HW or HW comparable to final before launch than anything esle.
 
FWIW I don't think a lot of developers were really prepared for how much difference the in order design would make.

The issue is that basically the CPU makes no attempt to hide memory and instruction latencies, leaving the job to the compiler, an OO core can do a lot of things a compiler can't.

Programming for these cores (or the PS3's PPE) has more in common with programming for PS2 than it does with programming for a Mac or a PC where the OO design and cache bail you out, you can get good performance out of them, but you need to avoid the things they cannot do well.
 
ERP said:
FWIW I don't think a lot of developers were really prepared for how much difference the in order design would make.

The issue is that basically the CPU makes no attempt to hide memory and instruction latencies, leaving the job to the compiler, an OO core can do a lot of things a compiler can't.

Programming for these cores (or the PS3's PPE) has more in common with programming for PS2 than it does with programming for a Mac or a PC where the OO design and cache bail you out, you can get good performance out of them, but you need to avoid the things they cannot do well.

I'm not familiar with the terms OO and PPE, what do they mean? I know that OO is object-orientated, somthing to do with putting code into 'managable sections' as opposed to one huge list of things to do. I'm no coder, as you can tell :oops:
 
ERP wasn't refering to object-orientated, he was refering to Out of order design vs In order design.

PPE is refered to one of the PS3's CELL cores. The PS3 has 1 PPE core and 7 SPE cores. Xbox360s CPU on the other hand has 3 symetrical cores that are very similar to the PPE on CELL.
 
Ken2012 said:
I'm not familiar with the terms OO and PPE, what do they mean? I know that OO is object-orientated, somthing to do with putting code into 'managable sections' as opposed to one huge list of things to do. I'm no coder, as you can tell :oops:
You're right about OO in terms of software, but it has a different meaning here. OO (or OOE of OOOE) in reference to processors is 'Out-of-order Execution'. It allows a processor to rearrange the order of the instructions it has to process so if for example it's waiting for one instruction to fetch something from memory, it can process other instructions. This keeps the processor busy. XB360's CPU is in-order execution, which is typical for consoles, and processes instructions in the order it gets them from the computer program. This means for a developer to make efficient use of the processor's resources, they have to pay careful attention to the way they feed instructions. In-order is harder for the developers to work with, but makes the CPU simpler so, in the case of next-gen, that allows several cores to be used.
 
Shifty Geezer said:
In-order is harder for the developers to work with, but makes the CPU simpler so, in the case of next-gen, that allows several cores to be used.

All this would explain why we won't be seeing quad-core OoO architectures for some time from AMD/Intel... More difficult to design, so credit must be given to current dual-core PC CPU's as a massive technological breakthrough it seems.

But a new question must now be adressed: what is it that makes a (multi-core) IO CPU easier to manufacture than a OoE? Broad question I know, but I don't understand the transition between the primal hardware design of a modern microprocessor (largely a collection of electronic switches arranged into dedicated areas; this transistor does that, that transistor does this) and the implementation of the architecture that is how certain areas exactly handle different situations (the implentation of whether the chip is CISC, RISC, does it use SIMD, MIMD, superscalar etc)... Or are we talkng about the same thing here? Am I just confusing myself further;)? I do understand the concept of reverse engineering, is this the usual way in which a processor is designed?

Seeing as we're in the 90nm era and transistor size/stepping is only getting smaller on the horizon... Manufacturing yeilds have to getting lower... Or does this only apply to clock frequency/raw, overall performance?
 
Are there any code examples of a simple good and bad code-snippet? I recall in the pentium pro to pentium II a lot of code began to perform worse; that is the speed would fluctuate and be unpredictable. Usually that was typical well optimized structured code that simply performed well because of it simplicity. Personally I didn't like this behaviour.
 
Ken2012 said:
But a new question must now be adressed: what is it that makes a (multi-core) IO CPU easier to manufacture than a OoE?

I think a large amount of transistors have to be dedicated to things like branch prediction, a large cache... in OoOE CPUs. I suppose those transistors would have to be duplicated for each core.

Broad question I know, but I don't understand the transition between the primal hardware design of a modern microprocessor (largely a collection of electronic switches arranged into dedicated areas; this transistor does that, that transistor does this) and the implementation of the architecture that is how certain areas exactly handle different situations (the implentation of whether the chip is CISC, RISC, does it use SIMD, MIMD, superscalar etc)... Or are we talkng about the same thing here? Am I just confusing myself further;)? I do understand the concept of reverse engineering, is this the usual way in which a processor is designed?

I remember reading a paper saying that CISC and RISC were not really relevant terms anymore, with RISC CPUs having more and more instruction sets, and CISC CPUs translating their machine code to a relatively small microcode instruction set. I would tend to think that processors are designed with generic goals in mind : scalability (ramping up clock speeds, multiple cores), performance (can be FLOPS, IPC...), cost... After that, the engineering team comes up with a specific design that is supposed to be able to accomplish those goals, sometimes with great success, sometimes with abject failure (remember how the P4 was supposed to ramp easily into 4+GHz territory ?). Of course, reuse of past engineering achievements is paramount to reducing the cost.

Seeing as we're in the 90nm era and transistor size/stepping is only getting smaller on the horizon... Manufacturing yeilds have to getting lower... Or does this only apply to clock frequency/raw, overall performance?

You still have a limited transistor budget, which means choices will still have to be made. Going IOE probably gives the opportunity to reach high clock, multiple cores, and the FLOPS rating that is so important to Internet community-based e-penis measurement. :)
 
Ken2012 said:
All this would explain why we won't be seeing quad-core OoO architectures for some time from AMD/Intel... More difficult to design, so credit must be given to current dual-core PC CPU's as a massive technological breakthrough it seems.

But a new question must now be adressed: what is it that makes a (multi-core) IO CPU easier to manufacture than a OoE?
Actually, a major reason we haven't seen a quad-core CPU from AMD or Intel is that developers are still not using dual core very well, so it'll just be wasted die space.

The answer to your second question is simple. Cost. The logic that decides what order to most efficiently execute instructions is very complicated, and occupies a lot of silicon. CELL wouldn't come close to having this much processing power if they had to squeeze this logic in there.

Actually this ties into the quad-core question as well. In order cores are a lot smaller, so you can fit more in the same space. However, Intel and AMD have to live with the x86 instruction set that's used in all but the smallest fraction of the market.
 
Mintmaster said:
The answer to your second question is simple. Cost. The logic that decides what order to most efficiently execute instructions is very complicated, and occupies a lot of silicon. CELL wouldn't come close to having this much processing power if they had to squeeze this logic in there.
It's worth clafifying that the limiting factor is manufacturing technology AFAIK. Processor cores are printed onto silicon wafers with a certain number of errors per area. Bigger cores are thus more error prone, and above a certain size you'll be guarenteed at least one error per core, making everything produced useless. At the moment 300 million transistors seems to be about the limit, and there's parts of a chip designed to take a hit and become disabled with the rest of the processor still able to work. In the case of Cell for PS3, an error in one of the SPE's doesn't stop the chip being used. That's I think about 235 million transistors for an 8 SPE Cell, and to get suitable yields (useable processors) Sony are allowing one to be defective. A dual-Athlon64 is also about 235 million transistors. So a quad Athlon would be 500 million transistors, and probably you'd be lucky to get a few chips off a 300mm wafer without fault. At something like $10,000 a wafer, those would be very expensive chips!
 
Ken2012 said:
All this would explain why we won't be seeing quad-core OoO architectures for some time from AMD/Intel... More difficult to design, so credit must be given to current dual-core PC CPU's as a massive technological breakthrough it seems.

This is a yes and no kind of situation. Dual core at its simplest is no different from having dual sockets on a motherboard. Just copy a core many times and tie it to a bus or interconnect.

More complex versions that have some kind of arbitration or system request queue involve more work, but can lead to real gains. The biggest limiting factor for AMD and Intel multicores is the size of the chips at a given manufacturing process node.

But a new question must now be adressed: what is it that makes a (multi-core) IO CPU easier to manufacture than a OoE?
Multicore can be as easy as copying a core many times over on a chip. It's even easier if the chip is small and cool enough to be copied many times. OOE does not lend itself to being small or cool, though this is heavily influenced by how aggressively out of order and speculative the processor is.

OOE cores fetch instructions in chunks and store them in a buffer. Every cycle, the chip will scan through buffered instructions and pick out the ones that have their data available, regardless of where they are in the buffer (subject to issue width). Unfortunately, this involves checking each instruction for a dependency on every other instruction in the buffer. More instructions can be found if the amount of instructions being juggled by the core, its instruction window, is larger. However, larger windows mean more checking, usually on the order of N^2-N checks. This is very hard to do quickly with any non-trivial number of instructions. It also requires some serious scheduling logic and long data bus lines.

In effect, design complexity and power draw will scale quadratically with window size (a lot of hardware design features have this problem). However, performance gains are linear in form. OOE for the Alpha processor gave about 30-50% on average in improved performance.

An OOE core can be four or more times larger and hotter than an in-order, but only offer 50-100% gain in performance--usually on code that could be scheduled better (all else being equal). On well-tuned code, the advantage is usually much less. In fact, on well tuned code, an in-order will most likely be able to significantly outclock an OO core and have room left over for cache, execution units, or more cores.

Broad question I know, but I don't understand the transition between the primal hardware design of a modern microprocessor (largely a collection of electronic switches arranged into dedicated areas; this transistor does that, that transistor does this) and the implementation of the architecture that is how certain areas exactly handle different situations (the implentation of whether the chip is CISC, RISC, does it use SIMD, MIMD, superscalar etc)... Or are we talkng about the same thing here? Am I just confusing myself further;)? I do understand the concept of reverse engineering, is this the usual way in which a processor is designed?

RISC is a design philosophy in chip Instruction Set Architecture that boils down to keeping the instructions a processor crunches through as easy to consume as possible. CISC is a disparaging label the creators of RISC put on every design that wasn't theirs.

Usually, a Reduced Instruction Set Computer design has instructions that are fixed in size, orthogonal, and are coupled with simpler memory addressing schemes. In most cases, a task that can be broken down into simpler components will not be assigned an instruction unless there is a compelling performance gain (a number of RISC architectures didn't have a full integer divide instruction, for example). Computation is limited to operands stored in registers, meaning that data must be explicitely loaded in by a load instruction, simplifying pipeline and instruction scheduling immensely.(Load/Store Architectures) Code size is bigger, but this is a fixed burden that has essentially been wiped away in all but the most confined systems.

Non-RISC software architectures like x86 had/have instructions of variable length, often burdened by a wild set of optional or sometimes optional prefix codes, exceptions, and duplication. These are all a pain in the ass if a chip's decoder has to figure out which portion of a stream of binary numbers is an instruction, prefix, or other instruction. The clock cycle doesn't wait. Complicated memory addressing often made pipelining very difficult, as it either added points in execution that could maddeningly screw up later or earlier instructions, or it forced the core to deal with horribly unpredictable main memory latency. Instructions that were register-memory or memory-memory would throw hardware designers for another loop.

What the designers of modern x86 chips did was devote a significant number of transistors on the critical execution path to translating x86 instructions into an internal Load/Store format. So long as transistors are available, this means RISC and CISC designs are much closer in performance than they would be otherwise. x86 was lucky, essentially no other legacy architecture had the market dominance needed to make this a worthwhile investment. In those cases, RISC designs were without a doubt superior.

Supescalar processing is designing a chip that can execute more than one instruction at a time. This is a conceptually easy way to increase performance. However, it is not without cost. Superscalar processors must check to see if the two or more instructions up for issue are not part of some dependency chain. This gets harder to do quickly with wider issue, and it lies right on the critical path of execution. (other concerns such as precise exceptions and result forwarding also become a problem. Forwarding results requires a bypass network latency that scales quadratically with issue width)
This has diminishing returns: A 2-wide processor will almost double performance in most cases compared to a scalar one, but a 4-wide will not be double that of a 2-wide.

RISC made widespread pipelining implementable, which took advantage of Instruction Level Parallelism to increase performance.
It takes extra design and transistors, but essentially no design goes without it unless power is ultra-important.

Superscalar took advantage of ILP as well, at the price of complexity and power. Some complexity can be moved to software. The Itanium uses a version of Very Long Instruction Word computing to make wide issue much easier on hardware. In some cases, this is a real win.

OOE is used to get around a number of the stumbling blocks in sequential program code that hide ILP, at the price of complexity and power.

SIMD and MIMD are ways of defining instructions that are able to better take advantage of data parallelism. In theory, it allows a processor to take bigger chunks of specially prepared data and waste less time on repeating a long chain of simpler instructions. It isn't very RISC-like in that respect, but it saves a lot on overhead and can lead to much higher resource utilization.

Multithreading goes for Thread Level Parallelism (TLP), where each thread is assumed to be totally independent by the hardware and safe to run concurrently. In cases where threads interact, it is the software's job to find conflicts. Hardware will not devote any more to what is an exponentially more difficult task. This is easier on hardware, but harder on software. Debugging is much, much harder, and transient system behavior can do strange things that would not affect a single-threaded program.

Seeing as we're in the 90nm era and transistor size/stepping is only getting smaller on the horizon... Manufacturing yeilds have to getting lower... Or does this only apply to clock frequency/raw, overall performance?

Yields are low at the beginning of a process node's cycle. With time and practice, yields climb up to be comparable to the mature yield rate of earlier ones. At the same time, it takes ever more money and effort to achieve this, which is why Fabs are multibillion dollar investments and you don't see a P4 being made in a garage.

Transistor size, unfortunately, cannot drop forever. With portions like gate dielectrics being shaved down to less than 5 atoms thick, we are running out of things to cut off. Within 20 years we will need a better way than what we do with silicon.
 
Last edited by a moderator:
Thankyou all, interesting reads from everyone who's so far posted.

But I have to re-itterate my original question about SLI/Crossfire/multi-GPU card solutions used in PC architecture, specifically how it is comparable with the R500 and it's stand-out unified memory & unified shader architectures used in the '360 (and PS3's graphics subsystem would be nice too :)).

I'm not 100% sure and don't know the facts and figures, but common sense tells you that a dual-core graphics board (with 2 framebuffers for each GPU) will have slightly less latency than a dual graphics board setup, but even with this solution one will not see "double the FPS" or anything like it even when coupled with the latest whiz-bang CPU and system memory. Is it just a case of current PC games (such as F.E.A.R., Quake 4, CoD2 etc) being all that system CPU bound that the vertices needed to be rendered by the GPU(s) aren't being calculated fast enough by the CPU due to execution time being taken away with AI & physics calculations, Malware infestation etc... Or is it that game engines are not being optomised correctly for multi-GPU? Carmack talks about how one can get "all the transistors firing", but are they? As I said, rumour on the grapevine is that Unreal 3.0 takes more specific advantage of multi-GPU...
 
No EDIT function, sorry:-

In a word, do the same core architecture princaples apply to the GPU, specifically the modern 'programmable' (implying that it is General Purpose) GPU? Do the same princaples of multi-core CPU apply when developing & utilising multi-GPU?
 
Ken2012 said:
Do the same princaples of multi-core CPU apply when developing & utilising multi-GPU?
No, a GPU (even a "general processing" GPU) functions very differently to a CPU, and has a tremendous amount of restrictions put on it on what kind of code it is able to run, and particulary run efficiently. A GPU is primarily designed to fill polygons after all.
 
Well that's what I've always assumed, that the GPU will only 'do what's it's told to do' so to speak by the programmer during development or by the CPU during runtime.

I must ask, although The Internet is a wonderfull source of information, can anyone recommend any books that cover any or all of these subjects? I've just ordered a copy of Real-Time Rendering, Third Edition off amazon.co.uk, hopefully this will aid me in my quest to gaining a deeper knowledge of this stuff, at least in the graphics department.
 
I suppose you could get your heart's content of information over at www.gpgpu.org, if you're able to program yourself. :) It seems to be THE resource for non-3D rendering information/programming of GPUs on the web. Also there might be some infos on NV and ATi's developer sites, but gpgpu.org is probably your best option, seeing as it's a free resource.
 
Guden Oden said:
I suppose you could get your heart's content of information over at www.gpgpu.org, if you're able to program yourself. :) It seems to be THE resource for non-3D rendering information/programming of GPUs on the web. Also there might be some infos on NV and ATi's developer sites, but gpgpu.org is probably your best option, seeing as it's a free resource.

No, I meant the previous topics I've been talking about other than the thing about using a GPU for non-rendering :); CPU & GPU design, architecture and implementation, how it all ties together in running a real-time game envormonment. And no, I'm not a programmer. I simply have a passion for a) games and b) the technology they use/run on, and wish to know everything there is to know or something similar ;).
 
Back
Top