If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 |
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco
Posts: 4,308
|
Recently there was a lot of talking about XeCPU and CELL, prominent developers have expressed their thoughts, praises and critics.
I don't want to trigger a new debate about in order multi core processors vs rest of the CPUs world, because I think most of us agree that these CPUs pose a whole new set of problems to solve. What I would like to say is simply that often problems and limits are a mean to augment creativity, and imho game developers should take this shift as an unexpected opportunity. Developers face 3 problems: 1) What we have to do in order to extract good performance from these new processors? 2) What are we going to do with all this computational power? 3) How can we solve all these problems and make good games in a tight timeframe? In the past generation problem 1 was often solved giving developers some guidelines and tools and most of the time a developer could just forgot what was going on under the hoods (PS2 is the exception that proves the rule...) Once you solve problem 1, problems 2&3 are easy to solve: just make the same stuff you was doing the past generation but remember to multiply everything by a greater than one number (better if the mulitplier is greater than 2 or 3) Is there sometihng wrong with this? No, IMO there isn't! It's a common process and I'm pretty confident >90% XBOX360/PS3 first gen titles have already taken this route. But is is this satisfying? No, it's not..there's a lot more work to do! There should be a lot more innovation to inject from either technical and gameplay standpoints..but we know producing efficient and fast code will be now a much harder task than before and I'm sceptical about programmer hordes whom barely grasp CPUs architectures suddenly master cache memories inner secrets. I often argued in the past that I'm not that worried by these new CPUs cause I'm confident a lot of algorithms can be reimplemented to better fit these new architectures, but nonetheless this can be a painfull task to do and it can take a lot of time too. So how can we help it? I think there isn't a clear superior solution but I believe we should try to work around these problems abstracting the hw the same way GPU manufacters did with their programmable hardware giving us programming languages that inherently hide everything the hw can't do well or can't do at all. Now XeCPU and CELL are much more flexible than current GPU architectures, but what we can do is to renounce to some flexibility and pretend this new hw is way more limited, then we can design (or borrow..) a new language that embrace/fits this new 'virtual' architecture (we wouldn't need to write a new compiler/linker toolchain, a preprocessor that translate our code to C/C++, or even ASM, would be enough). This virtual architecture/programming model would let us assume a lot of stuff that usually we can't assume and would constrain even 'unaware' coders to accordingly design/shape their code to fit the new model. They should worry much less about cache sizes, memory latencies, loop unrolling and stuff like that, moreover these selfimposed new limits may let us make threads/data syncronization and management much more simple. Obviously I'm thinking about streaming processors and how programmers can be forced to adhere to a streaming programming style even if they have a C++ compiler in their hands and they can potentially write any kind of code with it. I'm aware this approach would be far from being a panacea, but it could be a way to improve (any) programmer productivity and code efficiency. I know it's a long post but I want to make at least a very simple example. Since we are going to work with next gen machines we would like to impress our customers or publishers with something cool. Let say we have one milion particles floating around some complicated geometry like a room full of objects of different shape, scale and position. We would like to check if every particle is going to collide next frame with some object and accordingly take some action (the particle could die, bounce, split, and so on..) Many programmers would code something like this: /////////////////////////////////////////////////////////////////////////////////////////// for each particle_in_the_room { // this function checks if a particle will hit some object in our room} /////////////////////////////////////////////////////////////////////////////////////////// We have a loop with a lot of memory accesses (we have to load current particle data and search into our room database) that are not hidden at all (and we're not prefetching too..), moreover we have a conditional branch in our loop that probably has a completely random behaviour; this is bad for branch prediction and even for branch hinting since we have no code to execute between the hint and the branch so there's no way to hidden branch misprediction penalty). This piece of code is guaranteed to run very slow on next gen CPUs. (and to be fair I don't think it could run blazing fast on out of order CPUs too...) What if we'd abstract our multicores CPU as it was something very similar to a GPU/streaming machine and we design a language to handle collision 'shaders'. (note, this would not be different from a GPU with some special support for ray tracing queries..) We might have a collision shader like this: /////////////////////////////////////////////////////////////////////////////////////////// common_particle_update( IN vec3 particlePosition, IN vec3 particleSpeed, UNIFORM float deltaTime, UNIFORM polysoup roomDatabase....) { // compute next frame particle position // check if this particle is going to hit something..} /////////////////////////////////////////////////////////////////////////////////////////// Well..this pseudocode is quite similar to the first example code but it's handled in a completely different way. The shader is evaluated on all particles (but even if we have a one milion particles system this doesn' mean we have to work on so big batches!!), it starts with a simple computation, than it invokes a ray tracing test and here we have the interesting stuff. Instead of trying to compute if our particle is going to hit some object we just add our raytrace request to a raytrace manager and we save current thread state in memory, then we do in lockstep for all the particles in out batch the same work. We're not interpretating anything..our shader was compiled to a piece of native code (it could run on a SPE..) thar run full throttle and it just add raytracing queries and does a few calculations, nothing more than that. When all the queries for a given particles batch are collected a ray tracing manager (that could run on a PPE..) takes care of all these queries spatially sorting them in buckets and assigning different buckets to different processor cores, each core runs a intersection test on a group of rays against a triangles soup, every triangle is loaded from memory and checked against all the grouped rays, results are wrote into memory. Since we can know in advance batches size we can reserve enough memory to schedule ahead next batch data load and to fill memory whilst we are evaluting the first batch. While some cores performs rays intersections another core start to execute in lockstep another particles batch, thus filling other ray intesection queries. When all first batch queries are completed we can evaluate the second part of the shader (imho it's very similar to what ATI did with their pixel shaders phases..). Raytrace manager has collected intersection tests outcoming for each bucket in 2 different lists, one is filled up with non intersecting rays, the other one with intersecting rays. This way we can run two different pieces of code, one for each list, without the need to perform any kinf of branch prediction or hinting, we have multipassed everything. In the end we can do the same work as it was shown in the first example but with greater efficiency, avoiding branches misprediction, reducing random accesses to memory, automatically prefetching everything without the need to explicitely add prefetching code. Moreover the shader syntax explictely avoid interdependecy so our code parallelize quite well and it can scale nicely if more computing cores are available. I want to make clear that this is just a simple example and that the same infrastructure could be reused for a lot of game code (many AI scripts and collision systems are based on the same "check a database - take an action" model, and entities interdependency could be handled via shaders multipass) What thrills me most is that even a programmer who has not a deep knowledge of CPUs architectural minutia could produce very fast code in a small timeframe, like a unxperienced 3D engine programmer can write a dumb shader that transforms a gazilion of vertices per second! Well...even if you haven't understood a word now you know what happens when you're supposed to be on the beach but it rains all the time
__________________
[twitter] More samples, we need more samples! [Dean Calver] First they ignore you, then they laugh at you, then they fight you, then you win. [Mahatma Gandhi] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|
|
|
|
|
#2 |
|
Friends call me xbd
Join Date: Feb 2005
Posts: 6,309
|
nAo - I'm glad it's raining. Great post!
I do hope the weather improves for you soon though.
__________________
Somebody set up us the bomb. |
|
|
|
|
|
#3 |
|
Member
|
I am an amateur programmer (my only knowledge is C and a little of C++) and I have to say that when you out of the programmer mind you prefer the easy to program method but when you are programmer you prefer the flexibility because is less restrictive.
|
|
|
|
|
|
#4 |
|
Moderator
Join Date: Feb 2002
Location: Redmond, WA
Posts: 3,332
|
I've given this a fair amouint of thought as well.
I'v even looked at languages that are more parallel friendly, the conclusion I came to was that maintaining any reasonable frontend or compiler was impractical as a developer. I'd imagined something along the lines of CSP with C syntax for the actual work code. What this doesn't really solve is the code size issue on PS3, there are some issues in the current libraries that mean you want to do as much work in a single code upload as possible. The more I think about it the more I think the solution is to have a very small core of engineers build the core systems, and build job friendly models for simulation and physics. Then let the bulk of the programmers write code that fits in the model. In someways this favors smaller developers since they tend not to try and solve problems with engineering armies. The problem I see with this is it doesn't stop an engineer adding data dependencies and the engineers still have to consider an asynchronous processing model. I don't believe it's possible to completly hide this and still maintain any sort of real efficiency. But I think with education and good API design you can stop people doing most of the stupid stuff. example Handle h = RequestCollisionTest(whatever arguments); Do stufff WaitResult(h, result) Callbacks are the other obvious way to process the result, although in practice I find them less usefull. Load balancing multiple requests is trivial, but you still have the danger of stalling on the result if you don't make the request early enough. IMO the naming here is what's important, staying away from things like IsCollided, is a reminder to the programmer that the result in not instantaneous. I think your going to want to stay away from fine grained parallelism on Cell, the deffered functions are going to have to do significant work to make offloading them to the SPU's a win. |
|
|
|
|
|
#5 |
|
Senior Member
|
ERP:
Is there any particular reason you dislike callbacks? At my job we use them fairly often, though we are probably trying to solve different problems than you are. What other methods do you see becoming popular to deal with asynchronous processor models? nAo: I'm trying to understand the advantage of the second piece of code over the first. Is it an algorithmic advantage (in that the number of operations have decreased), or that the second procedure is simply better at hiding hardware latency? Just so that I have a better idea what is going on (bear with me here!) in the first example you simply iterate over the list of particles, and in the query_room_database call you would iterate over all other particles and compute the future positions of both to see if they intersect. As soon as one of them intersects, you would mark the original particle as intersecting. In the second example it seems that you are storing rays for the particles, doing intersection checks on them, and storing the results for all rays? I guess this gets back to my original question of whether or not this is algorithmically faster, or is just a way to hide latencies? Thanks, Nite_Hawk |
|
|
|
|
|
#6 | |||
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco
Posts: 4,308
|
Quote:
Quote:
Quote:
__________________
[twitter] More samples, we need more samples! [Dean Calver] First they ignore you, then they laugh at you, then they fight you, then you win. [Mahatma Gandhi] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|||
|
|
|
|
|
#7 |
|
Member
Join Date: Aug 2005
Posts: 570
|
Very interesting post nAo. simply maker of a way to follow perhaps, therefore we are in the side of a new universe to know themselves and to carry through.
Last edited by Heinrich4; 16-Aug-2005 at 21:58. |
|
|
|
|
|
#8 | |
|
Senior Member
|
Quote:
I'm just trying to understand the subtleties between the two approaches. Thanks! Nite_Hawk |
|
|
|
|
|
|
#9 | |||
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco
Posts: 4,308
|
Quote:
Quote:
Quote:
It's not clear if there's a winning approach in the general case, even if there are some fast triangle-ray intersections algorithms that need to compute some per triangle quantities that can be reused for many rays. A ray data structure is also likely consume less memory space than a triangle data structure (less data to load...), moreover a SIMD architecture would be good at performing 4 ray tests at the same time, so I believe that if the number of rays (per bucket) is not much bigger than the number of triangles it would be better to iterate rays over triangles this way: //////////////////////////////////////////////////////////////////////// for each triangle { for each ray} //////////////////////////////////////////////////////////////////////// An improved implementation could check if rays-triangles ratio is bigger than a threshold value (that could be found with some profiling
__________________
[twitter] More samples, we need more samples! [Dean Calver] First they ignore you, then they laugh at you, then they fight you, then you win. [Mahatma Gandhi] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|||
|
|
|
|
|
#10 | |
|
Moderator
Join Date: Feb 2002
Location: Redmond, WA
Posts: 3,332
|
Quote:
My preferred synchronisation primitive is the message, it's simple and obvious. Ideally you try to construct the code so that the "waitmessage" portion is always just getting the result and never actually waiting. If it has to wait, it sleeps the thread and you fill up the cycles with some coarse level parallel task. But my preference is probably more based on what I've used historically yo solve the problem than the "best" solution. |
|
|
|
|
|
|
#11 | |
|
Senior Member
|
Quote:
In the second approach, as far as I was able to read at this time at night lol, the ray-tracing test is made against simple boundary boxes, perhaps just a against a bounding volume that contains the part of the scene you care (it depends on you, you are the programmer Yes, dividing the big interesction test in passes can be done with the other approach too, it is normal in the theory of optimized ray-tracing, but I just wanted to point out that this first intersection was very likely NOT the big block you might expect from just seeing the pseudo-code. The branch can still be mispredicted and can cause a pipeline flush, but after that instead of trying to do tons of work, we just add a small job to our Ray-tracing Manager as nAo put it. Edit: He is intending to avoid the branch all-together of course me iDot... if-conversion as they call it... changing the problem from a control dependency to a data dependency (so yes, he does not do branching ther either). Branching or not branching here is not even the big part from where he is gaining performance from. The branch is there for pseudo-code sake, but in the real code he would just have a small stall on the value of the data the first function calculates and then he would resume working as he would use the data as index of a structure like an array to pick what to do next. In the first approach, for each particle you had to do the whole process.... at the end the particle had fully exploded or it had transitioned to the next position in space. In the second approach we try to reduce the per-particle custom work to a minimum, just so we can sort out the particles that we are interested in for collision detection possibilities from others which will surely not intersect the scene in their path. We batch all the little jobs together: we say... here is a list of jobs to perform between these particles and the scene geometry.The particles will all SHARE a common state: they all need a careful database check between themselves and the scene geometry. We are grouping all the checks we need to do to the database together. We can then sub-divide this groups in smaller ones that better fit SPE's LS constraints and that execute faster (for example in a specific group/batch they all check a certain small region of the 3D space so we avoid loading vertices randomly from the vast scene geometry). Even though each of these "jobs" for the Ray-tracing Manager might be slow to execute, we are executing it for many particles at the same time not serially one particle at a time. If you were following the first approach your path to multi-core work subdivision would be: do the job for each single particle on a different core... or "one thread per particle". The problem is that it would run VERY slow on a CPU like CELL as you would be intersecting a HUGE geometry database with a single particle's motion through the scene on each core and you would be swapping in and out overlays in the SPE's LS like crazy to parse through the entire database to have your check done.
__________________
"Any idea worth a damn is already patented... twice" -Mfa Last edited by Panajev2001a; 16-Aug-2005 at 23:48. |
|
|
|
|
|
|
#12 |
|
Senior Member
|
...
__________________
"Any idea worth a damn is already patented... twice" -Mfa |
|
|
|
|
|
#13 | |
|
Senior Member
Join Date: Mar 2005
Posts: 1,160
|
Quote:
In my work with threading-oriented languages like Cilk, I've just found them much more suitable for asynchronous tasks that do the heavy lifting and you don't really have to worry about race conditions. I see that sort of model making a lot more sense for CELL, but not so much for peer threading architectures like 360. While I like the idea of the whole "worker" threads in that it will be a much quicker and easier way to tap into the raw power, the impact is probably not going to be that big and you're bound to hit a wall sooner. I mean, you can thread off hundreds of collision tests, but it probably won't have as big an impact on your physics performance as parallelizing your solver. When it comes to actually having work to fill up the resources, my preferred method has historically been the job queue. Mainly because it at least gives me some feedback on how much open tasks I have in that, if the queue is ill-filled, that tells me I've got more resources than I'm currently in need of. And likewise, my preferred sync approach relies on message-passing. But like ERP mentioned on his own part, my preferences are probably shaped by what I've worked on before. All my parallel programming work has been academic work involving borrowed time on large clusters. So applying that to the gaming universe is not necessarily the best choice.... though I can see it vaguely working if you had a sort of separate renderer thread that constantly gets packets(jobs) from several various points throughout the codebase. I can't really think of other viable cases off the top of my head just because there are so many cases where you'd just want results so quickly because of dependency on the returned data for further computations.
__________________
Life is veritably the exact opposite of a vacuum cleaner. Vacuums tend to suck less and less as time goes on. |
|
|
|
|
|
|
#14 |
|
Senior Member
|
Hi guys,
I just wanted to say thank you for the great explanations. nAo: Do you know if on current processors it also make sense to use the second approach to take advantage of SIMD being able to processor multiple rays per triangle at once? Also, in the given approaches, do these algorithms allow for an n-body particle system with particle interactions or just particle-triangle interaction? How is data dependency between the different batches solved? Thanks again, Nite_Hawk |
|
|
|
|
|
#15 | |||
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco
Posts: 4,308
|
Quote:
Quote:
Quote:
It would be possible to handle particle-particle interaction but it would require more 'passes' (an a priori undefined number of passes..)
__________________
[twitter] More samples, we need more samples! [Dean Calver] First they ignore you, then they laugh at you, then they fight you, then you win. [Mahatma Gandhi] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|||
|
|
|
|
|
#16 |
|
Senior Member
Join Date: Mar 2005
Posts: 6,744
|
Good article from an IBM lead engineer, DR. Peter Hofstee. It's a good quick read so enjoy.
IBM explains why the PS3's CELL processor is leading the charge into new territory by OPM Staff 08.17.2005 Orginially publish in OPM #96 September 2005 You know the Cell processor is a big deal. Companies (especially ones as big as IBM, Sony, and Toshiba) don't often get together and come up with proprietary technology with the intent of it being a small-time project. But the processor is a big deal for a few reasons. Not only is it powering the PlayStation 3 hardware, but it's also marking a fundamental shift in hardware design—a shift that could potentially leave companies like Intel scrambling to keep up. We spoke with Dr. Peter Hofstee, a lead engineer from the Cell design team, to get some more information on the Cell processor and how, with the help of the PS3, it will become the next big thing. OPM: How did the Cell project get started? DR. PETER HOFSTEE: It got started with a CEO-level conversation between Sony and IBM. In that conversation, they identified the synergy between IBM and its strength as a technology company, and Sony in the media and content space. They saw a basis for working together, so in the summer of 2000—with Toshiba as a technology partner--representatives from the three companies got together and started hammering out a concept for what became the Cell processor. OPM: Was it always a goal to design a new type of architecture? PH: We looked at a number of alternatives—a rather broad number, some of them being sort of more in line with what you would expect from general trends in computing. I think it was Mr. Kutaragi who said that we should go beyond that and do something that was more specifically suited for media and gaming in this new space that is supposed to support a new vision of interactivity. OPM: How is the architecture of the Cell processor specifically beneficial to gaming? PH: [We had] three main targets. The first was to provide a level of performance in this chip that would be far beyond what you would see in a PC processor today. [For] the second goal, even though the structure of this processor was derived from a server-type expertise that IBM has, we brought in a lot of real-time controls that you wouldn't find in server processors. This is important for gaming. Also, because we were very much thinking of this processor as something that sits between a user and a network, we paid a lot of attention to the security architecture. There's a good bit of innovation in this chip that provides a hardware basis for security and privacy. The last goal: [While] it's important to note that the next-generation game system was the driving force to do this, the vision from the very start was that this would be an architecture more broadly applied. We really defined it to be a standard architecture, something that will hopefully be used for a long time and have many iterations. OPM: How has the education process for developers progressed? Have they been receptive to the idea of what the Cell processor is? PH: We had the benefit of having a software team inside the design center. We had representatives from all three companies working very closely with us. We actually went through a number of iterations on the architecture itself based on the feedback from the software team. We wanted to make sure not only that what we created had impressive peak performance numbers, but also that a reasonable programming effort could get to those numbers. Only in the last few months have we talked about programming models, but in the next few months we will publish more information on the models and the architecture of the chip. Right now, it's sort of a mix of reactions. There are people that have come from embedded backgrounds and are very excited about the architecture. For people that come from a more general-purpose programming background, we're heading toward an adjustment in [hardware]. But it takes programmers about three months to become familiar with it and become very enthusiastic. OPM: How is the multicore design different from just having multiple processors? PH: We felt we had to do something about the phenomenon known as the memory wall. In the last 20 years, microprocessors have gotten faster by a factor of 1,000, but the memory latency, or the amount of time it takes for memory to keep up with the processor, hasn't changed. It's like a bucket brigade with 100 people between you and the water, but because you can only get five buckets going, that immediately tells you that it's going to be inefficient. For Cell, what we did was use a technique [to make the process much more efficient]. OPM: Is there a point of comparison for the power of the Cell processor? PH: If you just look at the eight synergistic processors on the chip and the power core as well, on media applications, each of these processors can outperform a PC processor. We usually see a factor-of-10 performance advantage against PC processors, but if you're doing something with a more traditional workload, then you probably won't see that type of performance improvement. Side bar: CELLPOWER As Dr. Hofstee explains it, "The Cell processor itself has in excess of 200 gigaflops, which is 200 billion operations per second. So, imagine if you had 5 billion people in the world doing 40 calculations a second—that would be pretty amazing |
|
|
|
|
|
#17 |
|
Friends call me xbd
Join Date: Feb 2005
Posts: 6,309
|
Alright, link time.
__________________
Somebody set up us the bomb. |
|
|
|
|
|
#18 |
|
Senior Member
Join Date: Mar 2005
Posts: 6,744
|
Oh my bad. http://1up.com/do/feature?cId=3142805&did=1
|
|
|
|
|
|
#19 |
|
Senior Member
Join Date: Feb 2002
Posts: 2,636
|
not wanting to ditch anybody's efforts and ethusiasm but that interview was 95% maketing talk. 'we needed a much more powerful processor so we created one according to our whim'.. yeah, right, i'm gonna put that on a t-shirt.
|
|
|
|
|
|
#20 |
|
Friends call me xbd
Join Date: Feb 2005
Posts: 6,309
|
True. Actually I think this thread is pretty cool overall, and it probably shouldn't be sullied outside of the present programming models question/answer context.
__________________
Somebody set up us the bomb. |
|
|
|
|
|
#21 |
|
Artist formerly known as Acert93
Join Date: Dec 2004
Location: Seattle
Posts: 7,714
|
Good interview.
IBM is one of the top tech companies in the world. They must be given credit for trying looking at a problem and tackling it heasd on. Obviously IBM has smart people who aim to do the best they can and they believe in their product. It will be interesting to see if Intel and AMD take similar approaches down the road. As the interviewer notes CELL is best at media applications; the question will be when will Intel and AMD get serious about high end accelleration on a level similar to CELL? It may be another 5 years before they can fit a P4/AMD64 chip with SPE like units, but it will be interesting. Obviously IBM has a headstart.
__________________
"In games I don't like, there is no such thing as "tradeoffs," only "downgrades" or "lazy devs" or "bugs" or "design failures." Neither do tradeoffs exist in games I'm a rabid fan of, and just shut up if you're going to point them out." -- fearsomepirate |
|
|
|
|
|
#22 |
|
uber-Troll!
Join Date: Dec 2004
Location: Under my bridge
Posts: 26,466
|
I think the whole 3 companies should be given credit. As one's translated articles exaplined, there were different ideas and compromises. IBM wanted a multi-core PPC, Toshiba a SPE-only device, and Sony negotiated a compromise that seems to bring the best of both worlds. From most talk it sound like IBM practically did everything and Sony+Toshiba just fronted money or something. Though IBM's engineers, outnumbering the other companies, obviously had the main input in circuit design, I think the whole architecture is a team effort that no one company would have developed if they went it alone. I think this is important to appreciate - people can manage better working together and sharing resources and ideas rather than competing.
__________________
Shifty Geezer ... Tolerance for internet moronism is exhausted. Anyone talking about people's attitudes in the Console fora, rather than games and technology, will feel my wrath. Read the FAQ to remind yourself how to behave and avoid unsightly incidents. |
|
|
|
|
|
#23 | |
|
Senior Member
Join Date: Mar 2005
Posts: 6,744
|
Quote:
|
|
|
|
|
|
|
#24 | ||||||
|
Senior Member
Join Date: Feb 2002
Posts: 2,636
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
ps: could a kind moderator, please, delete all the posts on this page starting from after nAo's last post on this page up to this post of mine includingly. as an originally good thread is about to be turned into mindless gibberish. thank you. |
||||||
|
|
|
|
|
#25 | ||
|
Me me me
Join Date: Apr 2002
Posts: 15,367
|
Quote:
Quote:
Major clean up needed, including this post. |
||
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|