Recently there was a lot of talking about XeCPU and CELL, prominent developers have expressed their thoughts, praises and critics.
I don't want to trigger a new debate about in order multi core processors vs rest of the CPUs world, because I think most of us agree that these CPUs pose a whole new set of problems to solve.
What I would like to say is simply that often problems and limits are a mean to augment creativity, and imho game developers should take this shift as an unexpected opportunity.
Developers face 3 problems:
1) What we have to do in order to extract good performance from these new processors?
2) What are we going to do with all this computational power?
3) How can we solve all these problems and make good games in a tight timeframe?
In the past generation problem 1 was often solved giving developers some guidelines and tools and most of the time a developer could just forgot what was going on under the hoods (PS2 is the exception that proves the rule...)
Once you solve problem 1, problems 2&3 are easy to solve: just make the same stuff you was doing the past generation but remember to multiply everything by a greater than one number (better if the mulitplier is greater than 2 or 3)
Is there sometihng wrong with this? No, IMO there isn't!
It's a common process and I'm pretty confident >90% XBOX360/PS3 first gen titles have already taken this route.
But is is this satisfying? No, it's not..there's a lot more work to do!
There should be a lot more innovation to inject from either technical and gameplay standpoints..but we know producing efficient and fast code will be now a much harder task than before and I'm sceptical about programmer hordes whom barely grasp CPUs architectures suddenly master cache memories inner secrets.
I often argued in the past that I'm not that worried by these new CPUs cause I'm confident a lot of algorithms can be reimplemented to better fit these new architectures, but nonetheless this can be a painfull task to do and it can take a lot of time too.
So how can we help it?
I think there isn't a clear superior solution but I believe we should try to work around these problems abstracting the hw the same way GPU manufacters did with their programmable hardware giving us programming languages that inherently hide everything the hw can't do well or can't do at all.
Now XeCPU and CELL are much more flexible than current GPU architectures, but what we can do is to renounce to some flexibility and pretend this new hw is way more limited, then we can design (or borrow..) a new language that embrace/fits this new 'virtual' architecture (we wouldn't need to write a new compiler/linker toolchain, a preprocessor that translate our code to C/C++, or even ASM, would be enough).
This virtual architecture/programming model would let us assume a lot of stuff that usually we can't assume and would constrain even 'unaware' coders to accordingly design/shape their code to fit the new model.
They should worry much less about cache sizes, memory latencies, loop unrolling and stuff like that, moreover these selfimposed new limits may let us make threads/data syncronization and management much more simple.
Obviously I'm thinking about streaming processors and how programmers can be forced to adhere to a streaming programming style even if they have a C++ compiler in their hands and they can potentially write any kind of code with it.
I'm aware this approach would be far from being a panacea, but it could be a way to improve (any) programmer productivity and code efficiency.
I know it's a long post but I want to make at least a very simple example.
Since we are going to work with next gen machines we would like to impress our customers or publishers with something cool.
Let say we have one milion particles floating around some complicated geometry like a room full of objects of different shape, scale and position.
We would like to check if every particle is going to collide next frame with some object and accordingly take some action (the particle could die, bounce, split, and so on..)
Many programmers would code something like this:
///////////////////////////////////////////////////////////////////////////////////////////
for each particle_in_the_room
{
///////////////////////////////////////////////////////////////////////////////////////////
We have a loop with a lot of memory accesses (we have to load current particle data and search into our room database) that are not hidden at all (and we're not prefetching too..), moreover we have a conditional branch in our loop that probably has a completely random behaviour; this is bad for branch prediction and even for branch hinting since we have no code to execute between the hint and the branch so there's no way to hidden branch misprediction penalty).
This piece of code is guaranteed to run very slow on next gen CPUs.
(and to be fair I don't think it could run blazing fast on out of order CPUs too...)
What if we'd abstract our multicores CPU as it was something very similar to a GPU/streaming machine and we design a language to handle collision 'shaders'.
(note, this would not be different from a GPU with some special support for ray tracing queries..)
We might have a collision shader like this:
///////////////////////////////////////////////////////////////////////////////////////////
common_particle_update( IN vec3 particlePosition, IN vec3 particleSpeed, UNIFORM float deltaTime, UNIFORM polysoup roomDatabase....)
{
///////////////////////////////////////////////////////////////////////////////////////////
Well..this pseudocode is quite similar to the first example code but it's handled in a completely different way.
The shader is evaluated on all particles (but even if we have a one milion particles system this doesn' mean we have to work on so big batches!!), it starts with a simple computation, than it invokes a ray tracing test and here we have the interesting stuff.
Instead of trying to compute if our particle is going to hit some object we just add our raytrace request to a raytrace manager and we save current thread state in memory, then we do in lockstep for all the particles in out batch the same work.
We're not interpretating anything..our shader was compiled to a piece of native code (it could run on a SPE..) thar run full throttle and it just add raytracing queries and does a few calculations, nothing more than that.
When all the queries for a given particles batch are collected a ray tracing manager (that could run on a PPE..) takes care of all these queries spatially sorting them in buckets and assigning different buckets to different processor cores, each core runs a intersection test on a group of rays against a triangles soup, every triangle is loaded from memory and checked against all the grouped rays, results are wrote into memory.
Since we can know in advance batches size we can reserve enough memory to schedule ahead next batch data load and to fill memory whilst we are evaluting the first batch.
While some cores performs rays intersections another core start to execute in lockstep another particles batch, thus filling other ray intesection queries.
When all first batch queries are completed we can evaluate the second part of the shader (imho it's very similar to what ATI did with their pixel shaders phases..).
Raytrace manager has collected intersection tests outcoming for each bucket in 2 different lists, one is filled up with non intersecting rays, the other one with intersecting rays.
This way we can run two different pieces of code, one for each list, without the need to perform any kinf of branch prediction or hinting, we have multipassed everything.
In the end we can do the same work as it was shown in the first example but with greater efficiency, avoiding branches misprediction, reducing random accesses to memory, automatically prefetching everything without the need to explicitely add prefetching code.
Moreover the shader syntax explictely avoid interdependecy so our code parallelize quite well and it can scale nicely if more computing cores are available.
I want to make clear that this is just a simple example and that the same infrastructure could be reused for a lot of game code (many AI scripts and collision systems are based on the same "check a database - take an action" model, and entities interdependency could be handled via shaders multipass)
What thrills me most is that even a programmer who has not a deep knowledge of CPUs architectural minutia could produce very fast code in a small timeframe, like a unxperienced 3D engine programmer can write a dumb shader that transforms a gazilion of vertices per second!
Well...even if you haven't understood a word now you know what happens when you're supposed to be on the beach but it rains all the time
I don't want to trigger a new debate about in order multi core processors vs rest of the CPUs world, because I think most of us agree that these CPUs pose a whole new set of problems to solve.
What I would like to say is simply that often problems and limits are a mean to augment creativity, and imho game developers should take this shift as an unexpected opportunity.
Developers face 3 problems:
1) What we have to do in order to extract good performance from these new processors?
2) What are we going to do with all this computational power?
3) How can we solve all these problems and make good games in a tight timeframe?
In the past generation problem 1 was often solved giving developers some guidelines and tools and most of the time a developer could just forgot what was going on under the hoods (PS2 is the exception that proves the rule...)
Once you solve problem 1, problems 2&3 are easy to solve: just make the same stuff you was doing the past generation but remember to multiply everything by a greater than one number (better if the mulitplier is greater than 2 or 3)
Is there sometihng wrong with this? No, IMO there isn't!
It's a common process and I'm pretty confident >90% XBOX360/PS3 first gen titles have already taken this route.
But is is this satisfying? No, it's not..there's a lot more work to do!
There should be a lot more innovation to inject from either technical and gameplay standpoints..but we know producing efficient and fast code will be now a much harder task than before and I'm sceptical about programmer hordes whom barely grasp CPUs architectures suddenly master cache memories inner secrets.
I often argued in the past that I'm not that worried by these new CPUs cause I'm confident a lot of algorithms can be reimplemented to better fit these new architectures, but nonetheless this can be a painfull task to do and it can take a lot of time too.
So how can we help it?
I think there isn't a clear superior solution but I believe we should try to work around these problems abstracting the hw the same way GPU manufacters did with their programmable hardware giving us programming languages that inherently hide everything the hw can't do well or can't do at all.
Now XeCPU and CELL are much more flexible than current GPU architectures, but what we can do is to renounce to some flexibility and pretend this new hw is way more limited, then we can design (or borrow..) a new language that embrace/fits this new 'virtual' architecture (we wouldn't need to write a new compiler/linker toolchain, a preprocessor that translate our code to C/C++, or even ASM, would be enough).
This virtual architecture/programming model would let us assume a lot of stuff that usually we can't assume and would constrain even 'unaware' coders to accordingly design/shape their code to fit the new model.
They should worry much less about cache sizes, memory latencies, loop unrolling and stuff like that, moreover these selfimposed new limits may let us make threads/data syncronization and management much more simple.
Obviously I'm thinking about streaming processors and how programmers can be forced to adhere to a streaming programming style even if they have a C++ compiler in their hands and they can potentially write any kind of code with it.
I'm aware this approach would be far from being a panacea, but it could be a way to improve (any) programmer productivity and code efficiency.
I know it's a long post but I want to make at least a very simple example.
Since we are going to work with next gen machines we would like to impress our customers or publishers with something cool.
Let say we have one milion particles floating around some complicated geometry like a room full of objects of different shape, scale and position.
We would like to check if every particle is going to collide next frame with some object and accordingly take some action (the particle could die, bounce, split, and so on..)
Many programmers would code something like this:
///////////////////////////////////////////////////////////////////////////////////////////
for each particle_in_the_room
{
// this function checks if a particle will hit some object in our room
collision_check = query_room_database(current_particle)
if ( collision_check.outcoming == true) then current_particle.explosion()
else current_particle.update_position()
}collision_check = query_room_database(current_particle)
if ( collision_check.outcoming == true) then current_particle.explosion()
else current_particle.update_position()
///////////////////////////////////////////////////////////////////////////////////////////
We have a loop with a lot of memory accesses (we have to load current particle data and search into our room database) that are not hidden at all (and we're not prefetching too..), moreover we have a conditional branch in our loop that probably has a completely random behaviour; this is bad for branch prediction and even for branch hinting since we have no code to execute between the hint and the branch so there's no way to hidden branch misprediction penalty).
This piece of code is guaranteed to run very slow on next gen CPUs.
(and to be fair I don't think it could run blazing fast on out of order CPUs too...)
What if we'd abstract our multicores CPU as it was something very similar to a GPU/streaming machine and we design a language to handle collision 'shaders'.
(note, this would not be different from a GPU with some special support for ray tracing queries..)
We might have a collision shader like this:
///////////////////////////////////////////////////////////////////////////////////////////
common_particle_update( IN vec3 particlePosition, IN vec3 particleSpeed, UNIFORM float deltaTime, UNIFORM polysoup roomDatabase....)
{
// compute next frame particle position
vec3 nextPosition = particlePosition + particleSpeed*deltaTime
vec3 nextPosition = particlePosition + particleSpeed*deltaTime
// check if this particle is going to hit something..
intersection check = trace(particlePosition, nextPosition, roomDatabase)
// create a new explosion particle if there is a hit, otherwise update particle position
if (check.outcoming == true) then create (check.intersection_data, particle_explosion)
else particlePosition = nextPosition
}intersection check = trace(particlePosition, nextPosition, roomDatabase)
// create a new explosion particle if there is a hit, otherwise update particle position
if (check.outcoming == true) then create (check.intersection_data, particle_explosion)
else particlePosition = nextPosition
///////////////////////////////////////////////////////////////////////////////////////////
Well..this pseudocode is quite similar to the first example code but it's handled in a completely different way.
The shader is evaluated on all particles (but even if we have a one milion particles system this doesn' mean we have to work on so big batches!!), it starts with a simple computation, than it invokes a ray tracing test and here we have the interesting stuff.
Instead of trying to compute if our particle is going to hit some object we just add our raytrace request to a raytrace manager and we save current thread state in memory, then we do in lockstep for all the particles in out batch the same work.
We're not interpretating anything..our shader was compiled to a piece of native code (it could run on a SPE..) thar run full throttle and it just add raytracing queries and does a few calculations, nothing more than that.
When all the queries for a given particles batch are collected a ray tracing manager (that could run on a PPE..) takes care of all these queries spatially sorting them in buckets and assigning different buckets to different processor cores, each core runs a intersection test on a group of rays against a triangles soup, every triangle is loaded from memory and checked against all the grouped rays, results are wrote into memory.
Since we can know in advance batches size we can reserve enough memory to schedule ahead next batch data load and to fill memory whilst we are evaluting the first batch.
While some cores performs rays intersections another core start to execute in lockstep another particles batch, thus filling other ray intesection queries.
When all first batch queries are completed we can evaluate the second part of the shader (imho it's very similar to what ATI did with their pixel shaders phases..).
Raytrace manager has collected intersection tests outcoming for each bucket in 2 different lists, one is filled up with non intersecting rays, the other one with intersecting rays.
This way we can run two different pieces of code, one for each list, without the need to perform any kinf of branch prediction or hinting, we have multipassed everything.
In the end we can do the same work as it was shown in the first example but with greater efficiency, avoiding branches misprediction, reducing random accesses to memory, automatically prefetching everything without the need to explicitely add prefetching code.
Moreover the shader syntax explictely avoid interdependecy so our code parallelize quite well and it can scale nicely if more computing cores are available.
I want to make clear that this is just a simple example and that the same infrastructure could be reused for a lot of game code (many AI scripts and collision systems are based on the same "check a database - take an action" model, and entities interdependency could be handled via shaders multipass)
What thrills me most is that even a programmer who has not a deep knowledge of CPUs architectural minutia could produce very fast code in a small timeframe, like a unxperienced 3D engine programmer can write a dumb shader that transforms a gazilion of vertices per second!
Well...even if you haven't understood a word now you know what happens when you're supposed to be on the beach but it rains all the time