Most games are 80% general purpose code and 20% FP...

Frank · Jul 1, 2005

If you look at any application, you see that it consists of lots and lots and lots of logic and error checking and correcting, to determine what actions have to be performed. Like a tree, that branches out to some relatively small functions (and collects the required parameters on the way). And then those functions are actually going to do the work that has to be done.

In current games, it is the same. First you collect the states and inputs, fall very fast through a branch of the logic tree, start crunching away, and repeat until done.

So, while the layout of the program resembles lots and lots of simple, general purpose (mostly integer) logic, the actual work is done in only a small piece of the code. And that is mostly floating point streams and data structures, for current 3D games.

Like, when you want to optimize your program, you don't care at all about the efficiency of all that logic. Quite the opposite, actually: slower but better readable logic is much better to understand and maintain. But you spend your time trimming a few clocks of each iteration of those few core functions instead.

ERP · Jul 1, 2005

DiGuru said:
If you look at any application, you see that it consists of lots and lots and lots of logic and error checking and correcting, to determine what actions have to be performed. Like a tree, that branches out to some relatively small functions (and collects the required parameters on the way). And then those functions are actually going to do the work that has to be done.

In current games, it is the same. First you collect the states and inputs, fall very fast through a branch of the logic tree, start crunching away, and repeat until done.

So, while the layout of the program resembles lots and lots of simple, general purpose (mostly integer) logic, the actual work is done in only a small piece of the code. And that is mostly floating point streams and data structures, for current 3D games.

Like, when you want to optimize your program, you don't care at all about the efficiency of all that logic. Quite the opposite, actually: slower but better readable logic is much better to understand and maintain. But you spend your time trimming a few clocks of each iteration of those few core functions instead.

While for the most part that used to be true.

My recent experience, tells me that it's getting less so, the problem is that even as little as 5 years ago FP performance was expensive and dominant. However as FP performance has increased memory latencies have increased even more. In most application code today walking a data structure is MUCH more expensive that working on it.

Now Cell does have an advantage here, it encourages splitting data into chunks and working on them in local memory, and you could set up X360 with the same paradigm. However this relies on the fact that restructuring your data doesn't add a lot of extra work to the process.

An example a simple search algorithm.

I can write a red/black tree and search in logN time, or I can trivially search every element serially in an array in order N time. The latter algorythm is trivially done in stream processing but at the cost of significantly more memory visits. However since the memory accesses are serial, you can actually get much better bandwidth to the memory when doing the search. My guess is on both Cell and X360 the second algorythm would actually be faster for even relatively large (100's and possibly 1000's of entry's) datasets. At some point though the Tree wins because it does less work.

The problem is that for a lot of programmers the above is counter intuitive and they will probably never profile both approaches. Running over complex datastructures with the X360's processor is probably going to be faster.

Less and less of a game is really "optimised", it simple can't be I've worked on a couple of projects recently with >500K lines of code, no one on the project will see all of it.

It's trivial to split any game engine into 2 threads (I've done it over a weekend), it's relatively trivial to make graphics a parallel problem. Going beyond that is a hard problem.

Frank · Jul 1, 2005

True. But the problem and the solution are the same in both cases: reduce random memory access as much as possible and use your data stream to steer the logic, instead of doing it the other way around. Which, I agree, is counter intuitive to most programmers.

But it has to be done for multi-core processors and multi-threaded / stream centric applications anyway. Random accesses to all data structures at random times is exactly why multi-threaded applications are so hard to get right. You cannot do that, as it will trash your game logic for all threads that depend on it.

Fox5 · Jul 1, 2005

Shifty Geezer said:
Perhaps nVidia were a good choice for integrated chipset? Plus the namesake?? :?

Probably part of the reason they went Intel as well, they wanted someone who could supply enough chips and was willing to aggressively price their chips.
Nvidia also had more high end chips under their belt than powervr and was actively working on a 6 month product cycle, powervr's next part may have been a while off.

I don't even remember hearing powervr being considered, though gigapixel was...

ERP · Jul 1, 2005

DiGuru said:
True. But the problem and the solution are the same in both cases: reduce random memory access as much as possible and use your data stream to steer the logic, instead of doing it the other way around. Which, I agree, is counter intuitive to most programmers.

But it has to be done for multi-core processors and multi-threaded / stream centric applications anyway. Random accesses to all data structures at random times is exactly why multi-threaded applications are so hard to get right. You cannot do that, as it will trash your game logic for all threads that depend on it.

Or from a hardware standpoint, reduce cache latencies, and put in some mechanism to hide them.

Both MS and Sony have chosen to remove the mechanisms that hide these latencies in favor of better peak FP performance at a resaonable cost.

This may or may not be the right call.

Frank · Jul 1, 2005

ERP said:
Or from a hardware standpoint, reduce cache latencies, and put in some mechanism to hide them.

Both MS and Sony have chosen to remove the mechanisms that hide these latencies in favor of better peak FP performance at a resaonable cost.

This may or may not be the right call.

But hiding the latencies does nothing for read/write locks, synchronizing and serialization of your threads and streams. So I think what they did is the best way to go, as hiding the latencies doesn't help you use the other processor cores.

ERP · Jul 1, 2005

DiGuru said:
ERP said:

Or from a hardware standpoint, reduce cache latencies, and put in some mechanism to hide them.

Both MS and Sony have chosen to remove the mechanisms that hide these latencies in favor of better peak FP performance at a resaonable cost.

This may or may not be the right call.

Click to expand...

But hiding the latencies does nothing for read/write locks, synchronizing and serialization of your threads and streams. So I think what they did is the best way to go, as hiding the latencies doesn't help you use the other processor cores.

Your opinion, Mine is that Sony is just continuing down the design path they started with the EE and MS feels it has to follow to be competitive.

I don't think either party is basing it's designs on what's best for the software.

Frank · Jul 1, 2005

ERP said:
Your opinion, Mine is that Sony is just continuing down the design path they started with the EE and MS feels it has to follow to be competitive.

I don't think either party is basing it's designs on what's best for the software.

So, what are the specs of the processor that you feel would be best?

seismologist · Jul 1, 2005

DiGuru said:
But it has to be done for multi-core processors and multi-threaded / stream centric applications anyway. Random accesses to all data structures at random times is exactly why multi-threaded applications are so hard to get right. You cannot do that, as it will trash your game logic for all threads that depend on it.

Isn't the idea behind Cell that you wont have to worry about data locking?
Only the PPE would be used for memory management and running general purpose code.

Then you fire off computations to be run in parallel on the SPE array. The local memory for each SPE is only to be used as sort of a scratch pad.
All of that FP power would allow you could do some pretty advanced physics calculations.

Frank · Jul 1, 2005

seismologist said:
Isn't the idea behind Cell that you wont have to worry about data locking?
Only the PPE would be used for memory management and running general purpose code.

Then you fire off computations to be run in parallel on the SPE array. The local memory for each SPE is only to be used as sort of a scratch pad.
All of that FP power would allow you could do some pretty advanced physics calculations.

Exactly.

seismologist · Jul 1, 2005

DiGuru said:
seismologist said:

Isn't the idea behind Cell that you wont have to worry about data locking?
Only the PPE would be used for memory management and running general purpose code.

Then you fire off computations to be run in parallel on the SPE array. The local memory for each SPE is only to be used as sort of a scratch pad.
All of that FP power would allow you could do some pretty advanced physics calculations.

Click to expand...

Exactly.

That makes sense then. So in your game code example you wuld traverse the list, grabbing all of the data then burst it out to the SPE array? Sounds like it should work well.

Xenon is a different story though. I'm still having a hard time seeing the benefit there.

jboldiga · Jul 1, 2005

Apple was pissed with IBM because they wanted faster clock speeds and IBM said they couldnt...then turned around and released 3.2 ghz ppc to sony and m$. IBM simply didnt make enough money from Apple to be concerned with them. It is also true that IBM could not fab enough.

As for the general purpose code thing from M$....heh its true yes that general purpose code is important for games the problem is M$ isnt any better at it then sony...

DeanoC · Jul 1, 2005

seismologist said:
Isn't the idea behind Cell that you wont have to worry about data locking?
Only the PPE would be used for memory management and running general purpose code.

Then you fire off computations to be run in parallel on the SPE array. The local memory for each SPE is only to be used as sort of a scratch pad.
All of that FP power would allow you could do some pretty advanced physics calculations.

This is one (misguided IMO) view of the SPEs. Everybody looks at the FLOPS too much. Just for a second ignore the float units, and relook at Cell. Cell has 8 processors, each one about 10 times faster than the PS2 main core. Thats alot of power!
Of course its nice that there is also 9 float SIMD units at 3.2Ghz, for when you want to burn some FLOPs but your not forced to use them.

The issues of actually getting good performance out of the architecture is a seperate issue.

Shifty Geezer · Jul 1, 2005

It'll be nice when the first real snippets of code and algorithms for Cell appear, so people can get a handle on what it can and can't do. This idea that they're glorified FPU's is pervasive, but what they can and can't do is relatively an unknown; at least to the masses. What sort of apps can a SPE run effectively? A typical Java or Macromedia Flash based web-game perhaps? A Spectrum Emulator? Word or Excel? Or only FFT's and vertex transforms - data crunching processes to feed the PPE?

Frank · Jul 1, 2005

DeanoC said:
This is one (misguided IMO) view of the SPEs. Everybody looks at the FLOPS too much. Just for a second ignore the float units, and relook at Cell. Cell has 8 processors, each one about 10 times faster than the PS2 main core. Thats alot of power!
Of course its nice that there is also 9 float SIMD units at 3.2Ghz, for when you want to burn some FLOPs but your not forced to use them.

The issues of actually getting good performance out of the architecture is a seperate issue.

Good point. But that would be about the use for physics and the like, and what kind of tasks you want them to do in general, if I get what you mean. It doesn't change what would probably be the best strategy to get the most out of them, whatever it is you want them to do. And I agree, it is a lot of power.

But I think, that the current single-thread paradigm calls for an extension, not a radical different way to program. And I think it might be easier to use the stream / micro thread view, than to try and break a current game loop into multiple independent threads.

While the latter might seem much easier to do at first, it doesn't change or solve anything. It mostly complicates matters. Using the PPU for management and game logic, and spawning streams to other units that are available would probably be easier and "feel" much more like the way things work now, with the added benefit of a pool of speedy processors at your disposition, that can be as custom as you want them to be.

Gubbi · Jul 1, 2005

FLOPS are virtually free on both CELL and XeCPU. To the point where performance is going to be determined by other factors, like control logic complexity, memory latency etc.

It'll be interesting to see which is better, the traditional load/store architecture,caches and all, from M$ or the radically different Sony system that [h]has[/b] to be programmed with coarser memory transactions.

Cheers
Gubbi

psurge · Jul 1, 2005

ERP/DeanoC - just curious - have you guys experimented with van Emde Boas layout, or min-max cost layout tailored to cache-line/page size?

Frank · Jul 1, 2005

Gubbi said:
FLOPS are virtually free on both CELL and XeCPU. To the point where performance is going to be determined by other factors, like control logic complexity, memory latency etc.

It'll be interesting to see which is better, the traditional load/store architecture,caches and all, from M$ or the radically different Sony system that [h]has[/b] to be programmed with coarser memory transactions.

Cheers
Gubbi

Think about this: there are very many successful stream architectures in use (most at the API level) at the moment, while symmetric multiple thread architectures still suffer from a lot of serious problems.

But I'm really curious what paradigm will take off as well. This is an interesting time for that.

seismologist · Jul 1, 2005

Gubbi said:
FLOPS are virtually free on both CELL and XeCPU. To the point where performance is going to be determined by other factors, like control logic complexity, memory latency etc.

It'll be interesting to see which is better, the traditional load/store architecture,caches and all, from M$ or the radically different Sony system that [h]has[/b] to be programmed with coarser memory transactions.

Cheers
Gubbi

Is this really true? There must be a poiint where computation becomes a bottleneck. Sure for modern day games where 80% of the interactions are scripted (i.e. general purpose code).
But how about when you're running a real time physics simulation of say a plane crashing through a building. Things might start to get bogged down a bit.

archie4oz · Jul 1, 2005

psurge said:
ERP/DeanoC - just curious - have you guys experimented with van Emde Boas layout, or min-max cost layout tailored to cache-line/page size?

I have... And I can think of a whole bunch of G4 programmers who have too

The only difference is what drove them there...

Most games are 80% general purpose code and 20% FP...

Frank

Certified not a majority

ERP

Frank

Certified not a majority

Fox5

ERP

Frank

Certified not a majority

ERP

Frank

Certified not a majority

seismologist

Frank

Certified not a majority

seismologist

jboldiga

DeanoC

Trust me, I'm a renderer person!

Shifty Geezer

uber-Troll!

Frank

Certified not a majority

Gubbi

psurge

Frank

Certified not a majority

seismologist

archie4oz

ea_spouse is H4WT!

Similar threads