About nextgen consoles' CPUs..

nAo · Aug 16, 2005

Recently there was a lot of talking about XeCPU and CELL, prominent developers have expressed their thoughts, praises and critics.
I don't want to trigger a new debate about in order multi core processors vs rest of the CPUs world, because I think most of us agree that these CPUs pose a whole new set of problems to solve.
What I would like to say is simply that often problems and limits are a mean to augment creativity, and imho game developers should take this shift as an unexpected opportunity.

Developers face 3 problems:
1) What we have to do in order to extract good performance from these new processors?
2) What are we going to do with all this computational power?
3) How can we solve all these problems and make good games in a tight timeframe?

In the past generation problem 1 was often solved giving developers some guidelines and tools and most of the time a developer could just forgot what was going on under the hoods (PS2 is the exception that proves the rule...)
Once you solve problem 1, problems 2&3 are easy to solve: just make the same stuff you was doing the past generation but remember to multiply everything by a greater than one number (better if the mulitplier is greater than 2 or 3)

Is there sometihng wrong with this? No, IMO there isn't!
It's a common process and I'm pretty confident >90% XBOX360/PS3 first gen titles have already taken this route.
But is is this satisfying? No, it's not..there's a lot more work to do!
There should be a lot more innovation to inject from either technical and gameplay standpoints..but we know producing efficient and fast code will be now a much harder task than before and I'm sceptical about programmer hordes whom barely grasp CPUs architectures suddenly master cache memories inner secrets.
I often argued in the past that I'm not that worried by these new CPUs cause I'm confident a lot of algorithms can be reimplemented to better fit these new architectures, but nonetheless this can be a painfull task to do and it can take a lot of time too.

So how can we help it?
I think there isn't a clear superior solution but I believe we should try to work around these problems abstracting the hw the same way GPU manufacters did with their programmable hardware giving us programming languages that inherently hide everything the hw can't do well or can't do at all.
Now XeCPU and CELL are much more flexible than current GPU architectures, but what we can do is to renounce to some flexibility and pretend this new hw is way more limited, then we can design (or borrow..) a new language that embrace/fits this new 'virtual' architecture (we wouldn't need to write a new compiler/linker toolchain, a preprocessor that translate our code to C/C++, or even ASM, would be enough).
This virtual architecture/programming model would let us assume a lot of stuff that usually we can't assume and would constrain even 'unaware' coders to accordingly design/shape their code to fit the new model.
They should worry much less about cache sizes, memory latencies, loop unrolling and stuff like that, moreover these selfimposed new limits may let us make threads/data syncronization and management much more simple.
Obviously I'm thinking about streaming processors and how programmers can be forced to adhere to a streaming programming style even if they have a C++ compiler in their hands and they can potentially write any kind of code with it.
I'm aware this approach would be far from being a panacea, but it could be a way to improve (any) programmer productivity and code efficiency.

I know it's a long post but I want to make at least a very simple example.
Since we are going to work with next gen machines we would like to impress our customers or publishers with something cool.
Let say we have one milion particles floating around some complicated geometry like a room full of objects of different shape, scale and position.
We would like to check if every particle is going to collide next frame with some object and accordingly take some action (the particle could die, bounce, split, and so on..)
Many programmers would code something like this:

///////////////////////////////////////////////////////////////////////////////////////////
for each particle_in_the_room
{

// this function checks if a particle will hit some object in our room
collision_check = query_room_database(current_particle)

if ( collision_check.outcoming == true) then current_particle.explosion()
else current_particle.update_position()

}
///////////////////////////////////////////////////////////////////////////////////////////

We have a loop with a lot of memory accesses (we have to load current particle data and search into our room database) that are not hidden at all (and we're not prefetching too..), moreover we have a conditional branch in our loop that probably has a completely random behaviour; this is bad for branch prediction and even for branch hinting since we have no code to execute between the hint and the branch so there's no way to hidden branch misprediction penalty).
This piece of code is guaranteed to run very slow on next gen CPUs.
(and to be fair I don't think it could run blazing fast on out of order CPUs too...)
What if we'd abstract our multicores CPU as it was something very similar to a GPU/streaming machine and we design a language to handle collision 'shaders'.
(note, this would not be different from a GPU with some special support for ray tracing queries..)
We might have a collision shader like this:

///////////////////////////////////////////////////////////////////////////////////////////
common_particle_update( IN vec3 particlePosition, IN vec3 particleSpeed, UNIFORM float deltaTime, UNIFORM polysoup roomDatabase....)
{

// compute next frame particle position
vec3 nextPosition = particlePosition + particleSpeed*deltaTime

// check if this particle is going to hit something..
intersection check = trace(particlePosition, nextPosition, roomDatabase)

// create a new explosion particle if there is a hit, otherwise update particle position
if (check.outcoming == true) then create (check.intersection_data, particle_explosion)
else particlePosition = nextPosition

}
///////////////////////////////////////////////////////////////////////////////////////////

Well..this pseudocode is quite similar to the first example code but it's handled in a completely different way.
The shader is evaluated on all particles (but even if we have a one milion particles system this doesn' mean we have to work on so big batches!!), it starts with a simple computation, than it invokes a ray tracing test and here we have the interesting stuff.
Instead of trying to compute if our particle is going to hit some object we just add our raytrace request to a raytrace manager and we save current thread state in memory, then we do in lockstep for all the particles in out batch the same work.
We're not interpretating anything..our shader was compiled to a piece of native code (it could run on a SPE..) thar run full throttle and it just add raytracing queries and does a few calculations, nothing more than that.
When all the queries for a given particles batch are collected a ray tracing manager (that could run on a PPE..) takes care of all these queries spatially sorting them in buckets and assigning different buckets to different processor cores, each core runs a intersection test on a group of rays against a triangles soup, every triangle is loaded from memory and checked against all the grouped rays, results are wrote into memory.
Since we can know in advance batches size we can reserve enough memory to schedule ahead next batch data load and to fill memory whilst we are evaluting the first batch.
While some cores performs rays intersections another core start to execute in lockstep another particles batch, thus filling other ray intesection queries.
When all first batch queries are completed we can evaluate the second part of the shader (imho it's very similar to what ATI did with their pixel shaders phases..).
Raytrace manager has collected intersection tests outcoming for each bucket in 2 different lists, one is filled up with non intersecting rays, the other one with intersecting rays.
This way we can run two different pieces of code, one for each list, without the need to perform any kinf of branch prediction or hinting, we have multipassed everything.
In the end we can do the same work as it was shown in the first example but with greater efficiency, avoiding branches misprediction, reducing random accesses to memory, automatically prefetching everything without the need to explicitely add prefetching code.
Moreover the shader syntax explictely avoid interdependecy so our code parallelize quite well and it can scale nicely if more computing cores are available.
I want to make clear that this is just a simple example and that the same infrastructure could be reused for a lot of game code (many AI scripts and collision systems are based on the same "check a database - take an action" model, and entities interdependency could be handled via shaders multipass)
What thrills me most is that even a programmer who has not a deep knowledge of CPUs architectural minutia could produce very fast code in a small timeframe, like a unxperienced 3D engine programmer can write a dumb shader that transforms a gazilion of vertices per second!
Well...even if you haven't understood a word now you know what happens when you're supposed to be on the beach but it rains all the time

Carl B · Aug 16, 2005

nAo - I'm glad it's raining. Great post!

I do hope the weather improves for you soon though.

Urian · Aug 16, 2005

I am an amateur programmer (my only knowledge is C and a little of C++) and I have to say that when you out of the programmer mind you prefer the easy to program method but when you are programmer you prefer the flexibility because is less restrictive.

ERP · Aug 16, 2005

I've given this a fair amouint of thought as well.

I'v even looked at languages that are more parallel friendly, the conclusion I came to was that maintaining any reasonable frontend or compiler was impractical as a developer.

I'd imagined something along the lines of CSP with C syntax for the actual work code.

What this doesn't really solve is the code size issue on PS3, there are some issues in the current libraries that mean you want to do as much work in a single code upload as possible.

The more I think about it the more I think the solution is to have a very small core of engineers build the core systems, and build job friendly models for simulation and physics. Then let the bulk of the programmers write code that fits in the model. In someways this favors smaller developers since they tend not to try and solve problems with engineering armies.

The problem I see with this is it doesn't stop an engineer adding data dependencies and the engineers still have to consider an asynchronous processing model. I don't believe it's possible to completly hide this and still maintain any sort of real efficiency. But I think with education and good API design you can stop people doing most of the stupid stuff.

example

Handle h = RequestCollisionTest(whatever arguments);

Do stufff

WaitResult(h, result)

Callbacks are the other obvious way to process the result, although in practice I find them less usefull.

Load balancing multiple requests is trivial, but you still have the danger of stalling on the result if you don't make the request early enough.

IMO the naming here is what's important, staying away from things like IsCollided, is a reminder to the programmer that the result in not instantaneous.

I think your going to want to stay away from fine grained parallelism on Cell, the deffered functions are going to have to do significant work to make offloading them to the SPU's a win.

Nite_Hawk · Aug 16, 2005

ERP:

Is there any particular reason you dislike callbacks? At my job we use them fairly often, though we are probably trying to solve different problems than you are. What other methods do you see becoming popular to deal with asynchronous processor models?

nAo:

I'm trying to understand the advantage of the second piece of code over the first. Is it an algorithmic advantage (in that the number of operations have decreased), or that the second procedure is simply better at hiding hardware latency?

Just so that I have a better idea what is going on (bear with me here!) in the first example you simply iterate over the list of particles, and in the query_room_database call you would iterate over all other particles and compute the future positions of both to see if they intersect. As soon as one of them intersects, you would mark the original particle as intersecting.

In the second example it seems that you are storing rays for the particles, doing intersection checks on them, and storing the results for all rays? I guess this gets back to my original question of whether or not this is algorithmically faster, or is just a way to hide latencies?

Thanks,
Nite_Hawk

nAo · Aug 16, 2005

ERP said:
The more I think about it the more I think the solution is to have a very small core of engineers build the core systems, and build job friendly models for simulation and physics. Then let the bulk of the programmers write code that fits in the model.

I've thought about this kind of solution too, but I'm not very inclined to trust programmers to adhere a specific framework/programming model.

ERP said:
I think your going to want to stay away from fine grained parallelism on Cell, the deffered functions are going to have to do significant work to make offloading them to the SPU's a win.

That's true, but I'd give up on trying to be efficient at doing stuff on SPEs when you haven't a significant amount of work to shift from PPE..

Nite_Hawk said:
In the second example it seems that you are storing rays for the particles, doing intersection checks on them, and storing the results for all rays? I guess this gets back to my original question of whether or not this is algorithmically faster, or is just a way to hide latencies?

Algorithmic complexity is the same but the second implementation can be much more efficient, as it can easily hide latencies and it can group memory accesses too.

Heinrich4 · Aug 16, 2005

Very interesting post nAo. simply maker of a way to follow perhaps, therefore we are in the side of a new universe to know themselves and to carry through.

Nite_Hawk · Aug 16, 2005

nAo said:
I've thought about this kind of solution too, but I'm not very inclined to trust programmers to adhere a specific framework/programming model.

That's true, but I'd give up on trying to be efficient at doing stuff on SPEs when you haven't a significant amount of work to shift from PPE..

Algorithmic complexity is the same but the second implementation can be much more efficient, as it can easily hide latencies and it can group memory accesses too.

Forgive my ignorance on this, but it seems like you still have a conditional in both cases? Also, in the second example, you're psuedo code is supposed to run for each particle in a batch right? Would you iterate over them in the same way as you do in the first example?

I'm just trying to understand the subtleties between the two approaches.

Thanks!
Nite_Hawk

nAo · Aug 16, 2005

Nite_Hawk said:
Forgive my ignorance on this, but it seems like you still have a conditional in both cases?

Yeah, I have a conditional in both cases (well...both cases do the same work, just in different ways) but in the second case there is no branching involved, cause the rays intersection engine just uses the intersection test outcoming as an index in a 2 entries look up table to write out intersection results in 2 different lists/arrays.

Also, in the second example, you're psuedo code is supposed to run for each particle in a batch right?

right

Would you iterate over them in the same way as you do in the first example?

In the first example I can only iterate over triangles, intersecting them with the current ray, with the second approach I can swap the inner loop with the outer loop and iterate triangles over rays or rays over triangles.
It's not clear if there's a winning approach in the general case, even if there are some fast triangle-ray intersections algorithms that need to compute some per triangle quantities that can be reused for many rays.
A ray data structure is also likely consume less memory space than a triangle data
structure (less data to load...), moreover a SIMD architecture would be good at performing 4 ray tests at the same time, so I believe that if the number of rays (per bucket) is not much bigger than the number of triangles it would be better to iterate rays over triangles this way:
////////////////////////////////////////////////////////////////////////
for each triangle
{

for each ray
{

// compute intersection test..

}

}
////////////////////////////////////////////////////////////////////////

An improved implementation could check if rays-triangles ratio is bigger than a threshold value (that could be found with some profiling

) and then switch to another intersection function that uses swapped loops.

ERP · Aug 16, 2005

Nite_Hawk said:
ERP:

Is there any particular reason you dislike callbacks? At my job we use them fairly often, though we are probably trying to solve different problems than you are. What other methods do you see becoming popular to deal with asynchronous processor models?

Nite_Hawk

Callbacks are fine, didn't say I disliked them just that in practice I've found them less useful. Most of the time you just want to synchronise with the result inside a sequential series of operations.

My preferred synchronisation primitive is the message, it's simple and obvious. Ideally you try to construct the code so that the "waitmessage" portion is always just getting the result and never actually waiting. If it has to wait, it sleeps the thread and you fill up the cycles with some coarse level parallel task.

But my preference is probably more based on what I've used historically yo solve the problem than the "best" solution.

Panajev2001a · Aug 16, 2005

Nite_Hawk said:
Forgive my ignorance on this, but it seems like you still have a conditional in both cases? Also, in the second example, you're psuedo code is supposed to run for each particle in a batch right? Would you iterate over them in the same way as you do in the first example?

I'm just trying to understand the subtleties between the two approaches.

Thanks!
Nite_Hawk

Sh*t and I said to myself I was not posting lol, but my good friend nAo made a really kick-ass thread and I want to be here

.

In the second approach, as far as I was able to read at this time at night lol, the ray-tracing test is made against simple boundary boxes, perhaps just a against a bounding volume that contains the part of the scene you care (it depends on you, you are the programmer

). This test should be quite computationally simple and it should not stall on tons of random memory requests much as we are not just doing random work on a huge amount of data (it is different that doing a test between the ray and each single triangle in the scene).

Yes, dividing the big interesction test in passes can be done with the other approach too, it is normal in the theory of optimized ray-tracing, but I just wanted to point out that this first intersection was very likely NOT the big block you might expect from just seeing the pseudo-code.

The branch can still be mispredicted and can cause a pipeline flush, but after that instead of trying to do tons of work, we just add a small job to our Ray-tracing Manager as nAo put it.

Edit: He is intending to avoid the branch all-together of course me iDot... if-conversion as they call it... changing the problem from a control dependency to a data dependency (so yes, he does not do branching ther either). Branching or not branching here is not even the big part from where he is gaining performance from. The branch is there for pseudo-code sake, but in the real code he would just have a small stall on the value of the data the first function calculates and then he would resume working as he would use the data as index of a structure like an array to pick what to do next.

In the first approach, for each particle you had to do the whole process.... at the end the particle had fully exploded or it had transitioned to the next position in space.

In the second approach we try to reduce the per-particle custom work to a minimum, just so we can sort out the particles that we are interested in for collision detection possibilities from others which will surely not intersect the scene in their path.

We batch all the little jobs together: we say... here is a list of jobs to perform between these particles and the scene geometry.The particles will all SHARE a common state: they all need a careful database check between themselves and the scene geometry. We are grouping all the checks we need to do to the database together.

We can then sub-divide this groups in smaller ones that better fit SPE's LS constraints and that execute faster (for example in a specific group/batch they all check a certain small region of the 3D space so we avoid loading vertices randomly from the vast scene geometry).

Even though each of these "jobs" for the Ray-tracing Manager might be slow to execute, we are executing it for many particles at the same time not serially one particle at a time.

If you were following the first approach your path to multi-core work subdivision would be: do the job for each single particle on a different core... or "one thread per particle". The problem is that it would run VERY slow on a CPU like CELL as you would be intersecting a HUGE geometry database with a single particle's motion through the scene on each core and you would be swapping in and out overlays in the SPE's LS like crazy to parse through the entire database to have your check done.

Panajev2001a · Aug 16, 2005

ShootMyMonkey · Aug 17, 2005

I've thought about this kind of solution too, but I'm not very inclined to trust programmers to adhere a specific framework/programming model.

The history of places I've worked at has been a little more on the distrust for artists side of things. While programmers having to stick to specific constraints are not something you can entirely trust, I've always seen people adjust given time. Artists, on the other hand, never quite seem to follow what sort of limitations you might have, what sort of concerns a particular technology brings forth, what sort of problems arise when you let certain things go out of control... and to top it all off, artist turnover is always so much higher than programmer turnover, so you just have to keep re-teaching new people. I have even less faith in design decisions to follow suit, though, but that's just because up until recently, I've been working for some galactically stupid people -- (five-sided octagon stupid)...

In my work with threading-oriented languages like Cilk, I've just found them much more suitable for asynchronous tasks that do the heavy lifting and you don't really have to worry about race conditions. I see that sort of model making a lot more sense for CELL, but not so much for peer threading architectures like 360.

While I like the idea of the whole "worker" threads in that it will be a much quicker and easier way to tap into the raw power, the impact is probably not going to be that big and you're bound to hit a wall sooner. I mean, you can thread off hundreds of collision tests, but it probably won't have as big an impact on your physics performance as parallelizing your solver.

When it comes to actually having work to fill up the resources, my preferred method has historically been the job queue. Mainly because it at least gives me some feedback on how much open tasks I have in that, if the queue is ill-filled, that tells me I've got more resources than I'm currently in need of. And likewise, my preferred sync approach relies on message-passing. But like ERP mentioned on his own part, my preferences are probably shaped by what I've worked on before. All my parallel programming work has been academic work involving borrowed time on large clusters. So applying that to the gaming universe is not necessarily the best choice.... though I can see it vaguely working if you had a sort of separate renderer thread that constantly gets packets(jobs) from several various points throughout the codebase. I can't really think of other viable cases off the top of my head just because there are so many cases where you'd just want results so quickly because of dependency on the returned data for further computations.

Nite_Hawk · Aug 18, 2005

Hi guys,

I just wanted to say thank you for the great explanations.

nAo:

Do you know if on current processors it also make sense to use the second approach to take advantage of SIMD being able to processor multiple rays per triangle at once?

Also, in the given approaches, do these algorithms allow for an n-body particle system with particle interactions or just particle-triangle interaction? How is data dependency between the different batches solved?

Thanks again,
Nite_Hawk

nAo · Aug 18, 2005

Nite_Hawk said:
Do you know if on current processors it also make sense to use the second approach to take advantage of SIMD being able to processor multiple rays per triangle at once?

Yes, it makes sense, imho.

Also, in the given approaches, do these algorithms allow for an n-body particle system with particle interactions or just particle-triangle interaction?

My example was about just particle-triangle interaction, particles don't interact with other particles.

How is data dependency between the different batches solved?

Since there is not 'auto' interaction there isn't data dependency between batches either.
It would be possible to handle particle-particle interaction but it would require more 'passes' (an a priori undefined number of passes..)

mckmas8808 · Aug 19, 2005

Good article from an IBM lead engineer, DR. Peter Hofstee. It's a good quick read so enjoy.

IBM explains why the PS3's CELL processor is leading the charge into new territory

by OPM Staff 08.17.2005 Orginially publish in OPM #96 September 2005

You know the Cell processor is a big deal. Companies (especially ones as big as IBM, Sony, and Toshiba) don't often get together and come up with proprietary technology with the intent of it being a small-time project. But the processor is a big deal for a few reasons. Not only is it powering the PlayStation 3 hardware, but it's also marking a fundamental shift in hardware designâ€”a shift that could potentially leave companies like Intel scrambling to keep up. We spoke with Dr. Peter Hofstee, a lead engineer from the Cell design team, to get some more information on the Cell processor and how, with the help of the PS3, it will become the next big thing.

OPM: How did the Cell project get started?

DR. PETER HOFSTEE: It got started with a CEO-level conversation between Sony and IBM. In that conversation, they identified the synergy between IBM and its strength as a technology company, and Sony in the media and content space. They saw a basis for working together, so in the summer of 2000â€”with Toshiba as a technology partner--representatives from the three companies got together and started hammering out a concept for what became the Cell processor.

OPM: Was it always a goal to design a new type of architecture?

PH: We looked at a number of alternativesâ€”a rather broad number, some of them being sort of more in line with what you would expect from general trends in computing. I think it was Mr. Kutaragi who said that we should go beyond that and do something that was more specifically suited for media and gaming in this new space that is supposed to support a new vision of interactivity.

OPM: How is the architecture of the Cell processor specifically beneficial to gaming?

PH: [We had] three main targets. The first was to provide a level of performance in this chip that would be far beyond what you would see in a PC processor today. [For] the second goal, even though the structure of this processor was derived from a server-type expertise that IBM has, we brought in a lot of real-time controls that you wouldn't find in server processors. This is important for gaming. Also, because we were very much thinking of this processor as something that sits between a user and a network, we paid a lot of attention to the security architecture. There's a good bit of innovation in this chip that provides a hardware basis for security and privacy. The last goal: [While] it's important to note that the next-generation game system was the driving force to do this, the vision from the very start was that this would be an architecture more broadly applied. We really defined it to be a standard architecture, something that will hopefully be used for a long time and have many iterations.

OPM: How has the education process for developers progressed? Have they been receptive to the idea of what the Cell processor is?

PH: We had the benefit of having a software team inside the design center. We had representatives from all three companies working very closely with us. We actually went through a number of iterations on the architecture itself based on the feedback from the software team. We wanted to make sure not only that what we created had impressive peak performance numbers, but also that a reasonable programming effort could get to those numbers. Only in the last few months have we talked about programming models, but in the next few months we will publish more information on the models and the architecture of the chip. Right now, it's sort of a mix of reactions. There are people that have come from embedded backgrounds and are very excited about the architecture. For people that come from a more general-purpose programming background, we're heading toward an adjustment in [hardware]. But it takes programmers about three months to become familiar with it and become very enthusiastic.

OPM: How is the multicore design different from just having multiple processors?

PH: We felt we had to do something about the phenomenon known as the memory wall. In the last 20 years, microprocessors have gotten faster by a factor of 1,000, but the memory latency, or the amount of time it takes for memory to keep up with the processor, hasn't changed. It's like a bucket brigade with 100 people between you and the water, but because you can only get five buckets going, that immediately tells you that it's going to be inefficient. For Cell, what we did was use a technique [to make the process much more efficient].

OPM: Is there a point of comparison for the power of the Cell processor?

PH: If you just look at the eight synergistic processors on the chip and the power core as well, on media applications, each of these processors can outperform a PC processor. We usually see a factor-of-10 performance advantage against PC processors, but if you're doing something with a more traditional workload, then you probably won't see that type of performance improvement.

Side bar: CELLPOWER
As Dr. Hofstee explains it, "The Cell processor itself has in excess of 200 gigaflops, which is 200 billion operations per second. So, imagine if you had 5 billion people in the world doing 40 calculations a secondâ€”that would be pretty amazing

Carl B · Aug 19, 2005

Alright, link time.

mckmas8808 · Aug 19, 2005

Oh my bad. http://1up.com/do/feature?cId=3142805&did=1

darkblu · Aug 19, 2005

not wanting to ditch anybody's efforts and ethusiasm but that interview was 95% maketing talk. 'we needed a much more powerful processor so we created one according to our whim'.. yeah, right, i'm gonna put that on a t-shirt.

Carl B · Aug 19, 2005

True. Actually I think this thread is pretty cool overall, and it probably shouldn't be sullied outside of the present programming models question/answer context.

About nextgen consoles' CPUs..

nAo

Nutella Nutellae

Carl B

Friends call me xbd

Urian

ERP

Nite_Hawk

nAo

Nutella Nutellae

Heinrich4

Nite_Hawk

nAo

Nutella Nutellae

ERP

Panajev2001a

Panajev2001a

ShootMyMonkey

Nite_Hawk

nAo

Nutella Nutellae

mckmas8808

Carl B

Friends call me xbd

mckmas8808

darkblu

Carl B

Friends call me xbd

Similar threads