Can the PC compete with Next Gen consoles?

Well, I think it is. There is not much that has to be changed on the PS3 to get your stuff up and running, except for rewiting the functions that handle the calculations, splitting them into batches and recompiling everything. With the Xbox360, there is a whole lot of things that have to be changed, and it all has to run multi-threaded in an efficient way. While I suppose that most things will run on an Xbox360 after some minor changes and recompiling, you would only use 1 core and no VMX units when you do that.

And I think that is more or less what happened at E3.
 
DiGuru said:
Well, I think it is. There is not much that has to be changed on the PS3 to get your stuff up and running, except for rewiting the functions that handle the calculations, splitting them into batches and recompiling everything. With the Xbox360, there is a whole lot of things that have to be changed, and it all has to run multi-threaded in an efficient way. While I suppose that most things will run on an Xbox360 after some minor changes and recompiling, you would only use 1 core and no VMX units when you do that.

And I think that is more or less what happened at E3.

Oh i wasn't disagreeing!
 
london-boy said:
Killer-Kris said:
suryad said:
I was reading the PS 3 supposedly has 2.18 TFLOPS of processing power. I doubt our top of the line P4s and GPUs combined can barely break the 1 TFLOPS barrier!!! I think PCs have hard work cut out for them!! I ran the Java applet Linpack benchmark and my CPU which is a P4 Prescott @ 3.4 ghz on my laptop can barely hit 220 GFLOPS IIRC.

But don't forget that if you throw even moderately branchy code at the PS3, even Prescott will look like an incredibly good design.

Sure? I thought Cell could handle branching fairly well. I don't know, genuine question.

Well I could be wrong, and there are probably ways around it but being in-order and lacking branch prediction definitely hurts. If your code is branchy enough to show a difference in performance between a P4 and A64, the gap between those and Cell are going to be absolutely enormous.

Though that is all conjecture based on Cell being in-order, having no branch prediction (actually it might I was just assuming since they aimed to keep the design simple), and that comment from a developer about some coding running at a fraction of the speed it does on PC CPUs.
 
Killer-Kris said:
Well I could be wrong, and there are probably ways around it but being in-order and lacking branch prediction definitely hurts. If your code is branchy enough to show a difference in performance between a P4 and A64, the gap between those and Cell are going to be absolutely enormous.

Though that is all conjecture based on Cell being in-order, having no branch prediction (actually it might I was just assuming since they aimed to keep the design simple), and that comment from a developer about some coding running at a fraction of the speed it does on PC CPUs.

If the processor itself doesn't do branch predicition, you leave it to the compiler. Generally, it means that the CPU assumes that loops are taken, and that branches are not. The compiler will generate code accordingly, and unroll loops and branches where needed.

Branch prediction on Pentium-class CPUs is mostly important to speed up legacy code that isn't compiled with that in mind. If the compiler handles it, you have nearly the same hit/miss ratio than when the CPU does the same thing.

And having only a single CPU core without branch prediction far outweights the stalling and synchronizing of multiple threads running on multiple CPUs.

Edit: the same goes for in-order processing; lots of (mostly RISC) CPUs leave that to the compiler as well.
 
DiGuru said:
Killer-Kris said:
Well I could be wrong, and there are probably ways around it but being in-order and lacking branch prediction definitely hurts. If your code is branchy enough to show a difference in performance between a P4 and A64, the gap between those and Cell are going to be absolutely enormous.

Though that is all conjecture based on Cell being in-order, having no branch prediction (actually it might I was just assuming since they aimed to keep the design simple), and that comment from a developer about some coding running at a fraction of the speed it does on PC CPUs.

If the processor itself doesn't do branch predicition, you leave it to the compiler. Generally, it means that the CPU assumes that loops are taken, and that branches are not. The compiler will generate code accordingly, and unroll loops and branches where needed.

Branch prediction on Pentium-class CPUs is mostly important to speed up legacy code that isn't compiled with that in mind. If the compiler handles it, you have nearly the same hit/miss ratio than when the CPU does the same thing.

And having only a single CPU core without branch prediction far outweights the stalling and synchronizing of multiple threads running on multiple CPUs.


If all that were true and compilers could truly do that better than hardware we'd have long since all switched to Itanium, but we all know how that turned out.

On the issue of branch prediction being for legacy code, I think more than a few (hundred) Intel and AMD engineers would strongly disagree. The few times I've discussed the issue of compilers taking over for hardware with Intel engineers I've been just about laughed out of the room. Supposedly there are all sorts of dynamic optimizations that can only occur at runtime.

Of course profiling helps make up for that, and is actually quite applicable in a console or embedded environment where you know exactly what your data is going to be and when.

But over all I think I can stand by my statement of on branchy code Cell will make Prescott look like a really good design. But on the flip side when you get into raw floating point processing Cell will be nearly unstoppable.

Edit: the same goes for in-order processing; lots of (mostly RISC) CPUs leave that to the compiler as well.

I was pretty certain that the only remaining in-order processors were the Sun Sparc family, the various VIA processors, and then the enormous embedded market.

The big two, x86 and Power have long since been out-of-order for a while now I thought.
 
Killer-Kris said:
If all that were true and compilers could truly do that better than hardware we'd have long since all switched to Itanium, but we all know how that turned out.

I don't think that has much, if anything, to do with the Itanium being unsuccesfull.

On the issue of branch prediction being for legacy code, I think more than a few (hundred) Intel and AMD engineers would strongly disagree. The few times I've discussed the issue of compilers taking over for hardware with Intel engineers I've been just about laughed out of the room. Supposedly there are all sorts of dynamic optimizations that can only occur at runtime.

Of course profiling helps make up for that, and is actually quite applicable in a console or embedded environment where you know exactly what your data is going to be and when.

But over all I think I can stand by my statement of on branchy code Cell will make Prescott look like a really good design. But on the flip side when you get into raw floating point processing Cell will be nearly unstoppable.

It's a gradual thing and it depends on the construction and depth of the pipeline. It's not that big a deal if you miss even 10% more branches than a CPU that does branch prediction, especially if your pipeline is only a third or a quarter as deep. That would make the processor, what, less than 1% slower overall?

It would be different if the compiler wouldn't do anything and each and every branch would stall the pipeline in any case. But that isn't happening.

Edit: the same goes for in-order processing; lots of (mostly RISC) CPUs leave that to the compiler as well.

I was pretty certain that the only remaining in-order processors were the Sun Sparc family, the various VIA processors, and then the enormous embedded market.

The big two, x86 and Power have long since been out-of-order for a while now I thought.

Yes, the big two, who have to be faster when running that huge amount of legacy code, have done all those optimizations themselves. Every new generation had to run those same programs faster. And they still use the same instruction code. There is no way that can be changed. Look at Itanium, for example. For most other platforms the cost in transistors doesn't offset the marginal gain when you use a good compiler.
 
DiGuru said:
Killer-Kris said:
If all that were true and compilers could truly do that better than hardware we'd have long since all switched to Itanium, but we all know how that turned out.

I don't think that has much, if anything, to do with the Itanium being unsuccesfull.

I wasn't trying to point out its failure to sell widely, but that it has difficulty performing well in most all apps, most especially the lack luster non-FP performance.

This is a pretty endemic of CPUs which depend on the compiler to do a fair amount of work. The two big examples of this are Itanium and Crusoe, both can have good performance, but both usually perform far worse than their out-of-order, branch predicting counter parts.

And if modern compilers truly do as much work as you seem to think, then performance should be pretty even between similarly clocked VIA CPUs and Athlons/Pentiums. Instead clock per clock VIA performs miserably, mainly due to lack of decent branch prediction.

Edit: the same goes for in-order processing; lots of (mostly RISC) CPUs leave that to the compiler as well.

I was pretty certain that the only remaining in-order processors were the Sun Sparc family, the various VIA processors, and then the enormous embedded market.

The big two, x86 and Power have long since been out-of-order for a while now I thought.

Yes, the big two, who have to be faster when running that huge amount of legacy code, have done all those optimizations themselves. Every new generation had to run those same programs faster. And they still use the same instruction code. There is no way that can be changed. Look at Itanium, for example. For most other platforms the cost in transistors doesn't offset the marginal gain when you use a good compiler.

Well for the embedded market there are other factors that play into what features make it into the CPUs. They can reach adequate performance with out branch prediction and out-of-order execution. And by cutting those features they also improve power consumption.

VIA's CPUs just have flat out horrible performance per clock. And they get to use the exact same compilers that Intel and AMD use. So differences here tend to come down to better branch prediction and out-of-order execution, that is of course ignoring FP.

And Sun, in my opinion, is making a very good approach with their new niagara design. The Sparc family on the other hand, had down right miserable performance compaired to the competition at the same speeds. So once again a "good" compiler was only serving to keep them barely competitive (179.art).
 
Killer-Kris said:
VIA's CPUs just have flat out horrible performance per clock. And they get to use the exact same compilers that Intel and AMD use. So differences here tend to come down to better branch prediction and out-of-order execution, that is of course ignoring FP.

Why? Athlons and Pentium M's perform about twice as well as a Pentium 4, clock for clock. And if those VIA chips use the same compilers, that means they use unoptimized ones. Because for Athlons and Pentiums you don't have to do that, or to a much lesser extend.

The VIA's don't do CMOV. And that is the main difference in performance, although other things like the smaller cache and less optimized branch prediction also have influence on their worse performance.
 
DiGuru said:
Killer-Kris said:
VIA's CPUs just have flat out horrible performance per clock. And they get to use the exact same compilers that Intel and AMD use. So differences here tend to come down to better branch prediction and out-of-order execution, that is of course ignoring FP.

Why? Athlons and Pentium M's perform about twice as well as a Pentium 4, clock for clock.

But I think the big difference between the VIA chips and the Pentium 4 is that the P4 can run at a high enough clock speed to offset it's IPC disadvantage. I guess I wasn't very clear about the VIA chips, it's that a 1GHz VIA processor performs more like a 300MHz PII. And that is simply unacceptable, and mainly due to poor branch prediction, lack of OOE, and as you said the lack of CMOV instruction (see I learned something new today).

And if those VIA chips use the same compilers, that means they use unoptimized ones. Because for Athlons and Pentiums you don't have to do that, or to a much lesser extend.

As far as the compilers go I was mainly pointing at the loop unrolling and predetermined branching (or branch hints if VIA supports it).

But yeah, you are right, VIA will be hurt by the assumptions about caches, instruction latencies and what not that is made for Pentium and Athlon. But on the issue of branching it gets most all of the same optimizations I imagine.




Now don't get me wrong on all of this, I actually feel much the same way that you do, that compilers can lift much of the burden that hardware other wise would be doing. Then when combined with a JIT compiler it can perform many of the dynamic optimizations that is currently the realm of hardware only. It's just that in practice it doesn't seem to work that way. (Edit: And that's actually what I'm hoping to do my graduate research on in another year or so. To see if I can help make theory match reality a little more closely.)

And back on topic, in the console world Cell will probably be able to branching just fine because you can use profiling on your games and it will work fairly accurately. But if you are talking about moving a Cell processor into a more general purpose computer you'll likely begin to see some faults.
 
DiGuru said:
Well, I think it is. There is not much that has to be changed on the PS3 to get your stuff up and running, except for rewiting the functions that handle the calculations, splitting them into batches and recompiling everything. With the Xbox360, there is a whole lot of things that have to be changed, and it all has to run multi-threaded in an efficient way. While I suppose that most things will run on an Xbox360 after some minor changes and recompiling, you would only use 1 core and no VMX units when you do that.

And I think that is more or less what happened at E3.

Great post until the bold. Considering 1. Sony confirmed that the game footage for the most part was prerendered "to spec" and not running on any hardware in realtime for what we saw and 2. the Xbox 360 games seem to have been run on significantly underpowered dev units compared to the SLI rigs Sony devs have, it would seem that the CPU arguement you made will have to wait for another day.

Not that CELL is bad or anything, but comparing tech demos (MS had none) and rendered game footage to actual games on dev kits is silly. If you want to make a comparison compare UT2007 to Gears of War. Same engine, different but similar games, same results.
 
Geeforcer said:
It will take as little as a couple of months for console GPUs to be surpassed by PC chips
Only on paper though, not in reality. There are still hardly any software which flexes the shaders of current graphics cards, the most shader-intensive game released so far is doom3 and it runs on DX7-class hardware with almost full image quality. The poor integration in PCs also hampers performance, look at the amount of code needed to run to draw just a single polygon on the screen, it's silly.

Besides, the PC gfx cards that beat next-gen consoles on a raw specs basis will all cost more than an entire console, sort of a pyrrhic victory wouldn't you say?

and a year or two before their CPUs are surpassed as well.
No fricken way. Cell alone is around 10x more powerful than the fastest desktop CPU we have today. It could take half a decade until a desktop chip MATCHES cell at this rate, let alone exceeds. Then there's the issue with efficiency too, which typically is poor on PCs. It's HARD to write code that comes close to fully utilizing the processing power of the hardware, and memory bandwidth is a big issue. Cell has roughly 4x the bandwidth of a dual-channel PC400 memory subsystem just from main RAM, plus totally mindboggling bandwidth on-chip: ~2.8TB/s from SPU SRAMs, plus another ~360GB/s from SPU register files, more if there are multiple ports to the file, which I sort of assume there is but I'm not totally sure.

So NO, no fricken PC will surpass cell in a year, don't be SILLY. :p
 
Killer-Kris said:
But I think the big difference between the VIA chips and the Pentium 4 is that the P4 can run at a high enough clock speed to offset it's IPC disadvantage. I guess I wasn't very clear about the VIA chips, it's that a 1GHz VIA processor performs more like a 300MHz PII. And that is simply unacceptable, and mainly due to poor branch prediction, lack of OOE, and as you said the lack of CMOV instruction (see I learned something new today).

Yes, the VIA is an old design (Cyrix), so it is at quite a disatvantage. And in a way you are right that branching is the single most important problem for CPUs that have deep pipelines. The CMOV (conditional move) instruction is especially introduced to remove as much branches from the code as possible. So the VIA is mostly so much slower, because it runs code that contains much more branches.

Although it isn't as bad when running actual applications as the benchmarks make it out to be. Those benchmarks are too specific. And the close integration with VIA chipsets improves their bandwidth and latency quite a bit. They are quite fast in moving data around.

As far as the compilers go I was mainly pointing at the loop unrolling and predetermined branching (or branch hints if VIA supports it).

But yeah, you are right, VIA will be hurt by the assumptions about caches, instruction latencies and what not that is made for Pentium and Athlon. But on the issue of branching it gets most all of the same optimizations I imagine.

I think it depends on the compiler. While older copilers tended towards very compact code that consisted for a large part of branches and loops, most newer ones tend to generate longer code and less branches, because it is costly in any case for CPUs with deep pipelines. And most general applications don't use CMOV, to be compatible with older processors. So with those applications the difference is speed is far smaller.

The use of CMOV is generally what is meant with "optimized for Pentium processors".

Now don't get me wrong on all of this, I actually feel much the same way that you do, that compilers can lift much of the burden that hardware other wise would be doing. Then when combined with a JIT compiler it can perform many of the dynamic optimizations that is currently the realm of hardware only. It's just that in practice it doesn't seem to work that way. (Edit: And that's actually what I'm hoping to do my graduate research on in another year or so. To see if I can help make theory match reality a little more closely.)

Actually, for things like SPARC and ARM, it works pretty well to have the compiler do it. And they have other optimizations as well. For example, a SPARC has had lots of registers and something akin to Hyperthreading for as long as it exists. That makes it easy to generate optimal code.

It's mainly that Pentium-class processors have to make do with a very old instruction set and only a few registers, that were never made to optimally run multiple processes on deep pipelines. Therefore, the actual processor core runs a totally different RISC instruction set with lots of registers, nowadays. It is quite comparable to a generic RISC processor, like a SPARC, running a Java (i386, actually) p-code interpreter. So you have to optimize quite a bit, and do it on chip, as the compiler really doesn't have the possibility to generate truly "native" code.

:D

And back on topic, in the console world Cell will probably be able to branching just fine because you can use profiling on your games and it will work fairly accurately. But if you are talking about moving a Cell processor into a more general purpose computer you'll likely begin to see some faults.

Let us know how your graduate research turns out. ;)
 
Acert93 said:
DiGuru said:
And I think that is more or less what happened at E3.

Great post until the bold. Considering 1. Sony confirmed that the game footage for the most part was prerendered "to spec" and not running on any hardware in realtime for what we saw and 2. the Xbox 360 games seem to have been run on significantly underpowered dev units compared to the SLI rigs Sony devs have, it would seem that the CPU arguement you made will have to wait for another day.

Not that CELL is bad or anything, but comparing tech demos (MS had none) and rendered game footage to actual games on dev kits is silly. If you want to make a comparison compare UT2007 to Gears of War. Same engine, different but similar games, same results.

Well, quite a bit of the PS3 stuff was a actually real time. And don't you think Microsoft would have shown something better if they could? While they might not have had much actual hardware, I'm pretty sure they would have shown something if it had looked better than the stuff from the dev-boxes. I just think running it on the actual hardware made not enough of a difference, or even made it look worse, so it was more convenient for them to only show the stuff that ran on inferior dev-boxes, and be able to say that it will look much better on an XboX360.
 
DiGuru said:
It's a gradual thing and it depends on the construction and depth of the pipeline. It's not that big a deal if you miss even 10% more branches than a CPU that does branch prediction, especially if your pipeline is only a third or a quarter as deep. That would make the processor, what, less than 1% slower overall?

But it's not like branch prediction triples the amount of pipeline stages, is it ? And it's not like CELL (SPUs) are short pipelined, is it ?

I mean, a 20 cycle mispredict penalty can really cost when your branch prediction scheme degenerates, regardless of if it's static (compiler) or dynamic.

Cheers
Gubbi
 
Gubbi said:
DiGuru said:
It's a gradual thing and it depends on the construction and depth of the pipeline. It's not that big a deal if you miss even 10% more branches than a CPU that does branch prediction, especially if your pipeline is only a third or a quarter as deep. That would make the processor, what, less than 1% slower overall?

But it's not like branch prediction triples the amount of pipeline stages, is it ? And it's not like CELL (SPUs) are short pipelined, is it ?

I mean, a 20 cycle mispredict penalty can really cost when your branch prediction scheme degenerates, regardless of if it's static (compiler) or dynamic.

Cheers
Gubbi

Absolutely. And a cache miss is pretty bad as well. Switching process contexts is also quite bad. Interrupts take hundreds of cycles only to start responding to them. Moving large amounts of data around is really bad. Talking with other devices over a bus is extremely slow. And a page miss will stall everything for millions of cycles before the page is retrieved.

In general, the moment your function and data sets use any branching, don't completely fit inside the cache or consist of multiple processes, you're going to waste quite a bit of all possible cycles. Anywhere from a few percent up to more than 99% of them, worst case.

I mean, it's not as if the CPU is actually using each clockpulse to do meaningful work in the first place. It's only with games that are free-running (not capped) that quite a bit of the capacity is actually used. But even there and when running everything in RAM, it is hurry-up-and-wait to the CPU. And whenever your frame rate goes above 60 fps, you might wonder if everything else isn't wasted as heat as well.
 
Guden Oden said:
So NO, no fricken PC will surpass cell in a year, don't be SILLY. :p

Depends on the workload, no?

Decoding MPEG2 streams: CELL wins hands down.

Executing PERL spaghetti: P4/Athlon wins.

Cheers
Gubbi
 
DiGuru said:
Gubbi said:
DiGuru said:
It's a gradual thing and it depends on the construction and depth of the pipeline. It's not that big a deal if you miss even 10% more branches than a CPU that does branch prediction, especially if your pipeline is only a third or a quarter as deep. That would make the processor, what, less than 1% slower overall?

But it's not like branch prediction triples the amount of pipeline stages, is it ? And it's not like CELL (SPUs) are short pipelined, is it ?

I mean, a 20 cycle mispredict penalty can really cost when your branch prediction scheme degenerates, regardless of if it's static (compiler) or dynamic.

Cheers
Gubbi

Absolutely. And a cache miss is pretty bad as well. Switching process contexts is also quite bad. Interrupts take hundreds of cycles only to start responding to them. Moving large amounts of data around is really bad. Talking with other devices over a bus is extremely slow. And a page miss will stall everything for millions of cycles before the page is retrieved.

So the solution CELL provides is to ignore all these (for the SPEs)?
1. No caches
2. Massive context.
3. No interrupts.
4- No branch prediction.
5. Non-coherent memory model.

.... Make the core as small as possible and compensate with quantity, and leave all the smarts to the developer.

Cheers
Gubbi
 
Gubbi said:
So the solution CELL provides is to ignore all these (for the SPEs)?
1. No caches
2. Massive context.
3. No interrupts.
4- No branch prediction.
5. Non-coherent memory model.

.... Make the core as small as possible and compensate with quantity, and leave all the smarts to the developer.

Cheers
Gubbi

Exactly. Most of the bottlenecks on current CPU's could be prevented by clever multi-threading. Only the sequential model suffers from most of them. And in that context, something like Windows and just about all applications are serialized as far as the user is concerned. Not that it matters much for just about anything but video and 3D.

The other problems that generate immediate flushes / stalls can most easily be solved by using the transistors for more parallel units, instead of caches, out-of-order execution and branch prediction. Only the system management should run as a single stream, split into multiple processes. And for a console, that means a single controller and as much simple, parallel number crunchers as you can get. Although you want those independent cores to have some generalized program flow capabilities, so they can function independently.

Not the most efficient as far as the interruption of program flow is concerned, but the IPC and throughput are much higer than with any sequential or multi-threaded model that tries to minimize wasted cycles, for the same amount of transistors.
 
DiGuru said:
Gubbi said:
So the solution CELL provides is to ignore all these (for the SPEs)?
1. No caches
2. Massive context.
3. No interrupts.
4- No branch prediction.
5. Non-coherent memory model.

.... Make the core as small as possible and compensate with quantity, and leave all the smarts to the developer.

Cheers
Gubbi

Exactly. Most of the bottlenecks on current CPU's could be prevented by clever multi-threading. Only the sequential model suffers from most of them. And in that context, something like Windows and just about all applications are serialized as far as the user is concerned. Not that it matters much for just about anything but video and 3D.

I fail to see how CELL is a solution to this. The SPEs have massive contexts, and next to nothing in the way of virtualization (at least not in hardware, - again software must provide a solution). CELL is made to run with a fixed set of threads at anyone time and you're f*cked if you need more. This works well for a game console where you're only running one big app. But it's useless in a multiprocessing OS system (like a PC running Windows/linux whatever). Similar (or actually even worse) in a server environment.

DiGuru said:
The other problems that generate immediate flushes / stalls can most easily be solved by using the transistors for more parallel units, instead of caches, out-of-order execution and branch prediction.
Strongly disagree.

Now you have more units, - that stall. And they will stall more often because you took out all the smarts that lowered apparent latency..

... And they will flush more often because they are sent on a wild goose chase more often by a non-existant branch predictor.

Cheers
Gubbi
 
Gubbi said:
I fail to see how CELL is a solution to this. The SPEs have massive contexts, and next to nothing in the way of virtualization (at least not in hardware, - again software must provide a solution).

Batch processing. No cache, but a bit of full-speed local memory. Like a very large register file. So, no cache stalls. And the instruction set is (almost surely) optimized for running a small program (SIMD) on a small dataset and generate a new set as the result.

So, it is like a one-shot deal. You load it up, execute, and collect the results. And it has just enough general purpose and flow control logic to do all that by itself.

The branches and loops are (almost surely) quite specific, and they are there for the batch management and the implementation of more complex (mathematical) functions, not to be able to implement any arbritary operation that isn't directly related to the optimal functioning of the device itself.

Having dynamic branches is again mostly just a way to save transistors by not having to implement any possible (and redundant) function directly in silicon.

CELL is made to run with a fixed set of threads at anyone time and you're f*cked if you need more. This works well for a game console where you're only running one big app. But it's useless in a multiprocessing OS system (like a PC running Windows/linux whatever). Similar (or actually even worse) in a server environment.

Again, batches. Instead of having multiple, arbitarily processes run concurrently, you split the computations into batches and throw them at the first unit that is available. Collect and return the results when done. This is much better than the normal multi-threaded model, in that it eliminates the two largest problems with that: deadlocks on data and synchronization stalls.

The reason each unit can have concurrent threads is mostly to reduce large latencies. As long as they can store the state of enough threads to minimize that latency (and taking context switching into account), no more is needed. Because those threads will only run for a limited time, finish, and can be discarded afterwards.

It's a more general solution than the normal multi-threading, multi-processor, as done with most popular operating systems. And there are many things that could benefit from that. Not servers, but most applications that need some serious horse power.

You do have to use another kind of OS, but I think a Linux variant could run very well and fast, if the libraries are changed accordingly.

Strongly disagree.

Now you have more units, - that stall. And they will stall more often because you took out all the smarts that lowered apparent latency..

... And they will flush more often because they are sent on a wild goose chase more often by a non-existant branch predictor.

Cheers
Gubbi

Only the one that is used for system management and running the program logic. Everything else will be chopped up into objects (batches), and dispatched to the other units. You just (more or less) throw your C++ objects in a pool and forget about it.

So, what does it matter if the single Central Processing Unit that is used to control everything is running inefficiently? It's not as if there is much work for it to do in the first place, but wait and react to events, and keep all the others occupied.

:D
 
Back
Top