Devs and Programmers: Approaches to Effeciently Maximizing Multi-core Performance?

Acert93

Artist formerly known as Acert93
Legend
Background:
I have had this question down in Wordpad for a couple weeks waiting for the time to ask it correctly. With the recent remarks from John Carmark I decided to hold off in hopes this would not become a system flame war, so please, lets keep this on topic :D This is a serious question for the programmers and developers on the forum.

Some noted next-gen console developers, some even on the B3D forums, have noted the difficulties of programming for--and maximizing performance of--next generation console CPUs. The nuts and bolts: Getting good performance, let alone anything near peak performance, is difficult when working in a multi-core environment. Yet the Xbox 360, PS3, and even the PC (Intel Pentium D, AMD X2) are going this route due to process limitations. At this point there really is no other choice. Hate it or Like it (some devs have shown some excitement) it is the present reality.

Xenon, Pentium D, and X2 have all gone the route of 2 or 3 symetric cores (with 4-core CPUs in the near future for PC parts). The PS3, on the other hand, has cone with an asymetric design with 8 total cores and can be classified as a "stream processor" and uses techniques similar to GPUs. Because the CELL microprocessor has the most cores I have selected it as the "poster boy" of this thread. What stands true of most of the answers for CELL should apply roughly to processors with fewer cores.

This topic is not aimed to engage "hyper threading" or the similar techniques where one core can do more than one hardware thread. ERP noted in another thread that his theory is that 1 Core with 2 Hardware Threads you would want them doing similar tasks (i.e. not lumping AI on "core 0 thread 0" and Physics on "core 0 thread 1" on a single processing core). This theory was put forward as a way to avoid dependancy issues. Further, it is important to note that a thread != a core, so you wont get double performance. Hyperthreading on Intel chips seems to get 15% improvement in ideal situations at best (from what I have seen) and frequently no gain because the goal of hyperthreading is to "back fill" idle time of the primary thread. In-order chips like the PPC cores in the XeCPU and CELL may have more significant gains, but we are limited on that info so for this discussion lets keep it simple and just discuss cores (unless a developer working on the machines thinks it is relevant, in which case go ahead!)

Focus:
CELL. Why? It has the most processing cores. More cores means theoretically more work getting your application to (a) synchronize and avoid stalls and (b) more ways in which one is required to divide their code up. If a core goes unused and/or underutilized the peak performance of the processor is significantly reduced. 8 processors running at 50% is no better than 4 processors running at 100%. CELL also has two distinct processing elements, unlike the other processors, so focusing on CELL answers questions applicable to both camps.

In the grand scheme of things we all know that the next gen console CPUs wont hit peak performance in any game, ever. That is a given. Just as relevant though is that for the XeCPU and CELL processors to obtain their maximum potential they must have all their CPUs actively running and operating at peak capacity--that means 3 and 8 processors, respectively, running full bore all the time with no hiccups. That means:

No idle cores with no assigned tasks.
No cores waiting for other cores to produce data.
No cores running a single non-intensive thread that leaves unused potential.

Since the CELL has the most processors it would appear to be the most prone to performance issues related to the other cores (it also just happens to be the chip with the most potential!!), therefore I would like to focus on the CELL microprocessor in my questions for developers:

Question: What is the developer plan of attack for dealing with CELL? Some of my specific questions are:

1. How are developers attacking the problem of data dependancy. If "core 0" is waiting for "core 1" to return data that means "core 0" is being ineffient and therefore the theoretical peak performance of the processor takes a SIGNIFICANT hit. The more processors, the more potential for a mess. So how is this issue being addressed?

2. How do developers plan to divide their code onto 8 processors? It has been noted here the issues with breaking apart code--it is not as simple as having Physics on 1 CPU, AI on another CPU, the game code on another CPU, etc... So what are some current thoughts on how to tackle this issue? Unused processors means unused potential!!!

3. Effeciency. Will it be difficult to find tasks for all 8 processors to actively work on at the same time? e.g. if a processor is dedicated to particle effects and none are present at the time that would mean the core is sitting idle. This could apply to Physics, AI, whatever. If you dedicate processors to tasks that are not being used in a scene, yet another task may be totally saturating the processor. They may even ping pong back and forth. Can this be avoided?

4. How many threads is "enough"? It occurs to me that 8 threads may not be enough for two reasons. One being threads that are "on / off" like the particle effects mentioned above. Another is low-intensity tasks like some sound engines. So is there a need to have more threads?

5. SPEs. It seems to me one solution is to have more than one thread on a core (e.g. two low-intensity threads that better use the power of a core OR two threads that tend to alternate between being idle and running hard and play nice with eachother) Do the SPEs realistically have the memory size to deal with two or three threads? Another option seems to be swapping applications from the main memory. Is this fast enough to avoid dependancy issues? Is the latency low enough for this type of behavior and to avoid stalls (see #1)? (=>Lots of general questions here, but basically questions of Ineffeciency, how do you plan to deal with this?)

6. Does the asymmetric nature of the processors raise any significant hurdles? The SPEs are full features cores, but as some developers here have noted they are not great performers in certain areas. Will you be forced to run code on the SPEs that does not allow them to reach their full potential if you were provided another PPE to run such code on? Does dealing with two distinct processors (and their different cache sizes) cause serious issues with portability of code or is this a minor issue because they both accept compiled code and performance appears to be acceptible on either core?

7. Any other areas or issues of concern you are being faced with OR you have overcome.


Obviously multi-core environment force programmers to look at new ways to deal with problems. Scheduling, dependancy, and effeciency are important parts of the puzzle to bring together. I am interested to hear how this problem in general is being attacked and any feedback on what we can realistically expect in the next couple years.


Ps- For this discussion I want to avoid the following topics:

1. How the XeCPU, Cell, and PC multi-core solutions vary (unless directly related to how developers plan to overcome the issue of a multicore environment... e.g. an issue works for some but not other platforms). No, "Microprocessor X sucks! My favorite CPU kicks its butt!!!111"
2. HDD
3. Killzone
4. Allard
Thanks ;)
 
Acert93 said:
1. How are developers attacking the problem of data dependancy. If "core 0" is waiting for "core 1" to return data that means "core 0" is being ineffient and therefore the theoretical peak performance of the processor takes a SIGNIFICANT hit. The more processors, the more potential for a mess. So how is this issue being addressed?
Pipelining. Divide your workload in multiple package (A, B, C, D, ...) and divide the processing in distinct parts (1, 2, 3, 4, ...)
Snapshot 0:
Thread 0: A1
Thread 1: (idle)
Thread 2: (idle)

Snapshot 1:
Thread 0: B1
Thread 1: A2
Thread 2: (idle)

Snapshot 2:
Thread 0: C1
Thread 1: B2
Thread 2: A3

For a more thourough look at the issues, Avery Lee (of VirtualDub) fame has recently written a good piece on how (not) to approach multi-threading here.

2. How do developers plan to divide their code onto 8 processors? It has been noted here the issues with breaking apart code--it is not as simple as having Physics on 1 CPU, AI on another CPU, the game code on another CPU, etc... So what are some current thoughts on how to tackle this issue? Unused processors means unused potential!!!


This is the interesting bit and will too a large degree determine the quality of the individual games IMO.

3. Effeciency. Will it be difficult to find tasks for all 8 processors to actively work on at the same time? e.g. if a processor is dedicated to particle effects and none are present at the time that would mean the core is sitting idle. This could apply to Physics, AI, whatever. If you dedicate processors to tasks that are not being used in a scene, yet another task may be totally saturating the processor. They may even ping pong back and forth. Can this be avoided?
I think the Xenon CPU has an advantage here because of the unified cache and it should be able to schedule more dynamically than Cell's SPUs (where I imagine the code to be executed has to be uploaded for each different code fragment).

4. How many threads is "enough"? It occurs to me that 8 threads may not be enough for two reasons. One being threads that are "on / off" like the particle effects mentioned above. Another is low-intensity tasks like some sound engines. So is there a need to have more threads?
I am confident there will be more threads (albeit obviously handled by the OS in software via classic time-slicing), but the I expect that developers will have plenty of options to influence the scheduler manually.

5. SPEs. It seems to me one solution is to have more than one thread on a core (e.g. two low-intensity threads that better use the power of a core OR two threads that tend to alternate between being idle and running hard and play nice with eachother) Do the SPEs realistically have the memory size to deal with two or three threads? Another option seems to be swapping applications from the main memory. Is this fast enough to avoid dependancy issues? Is the latency low enough for this type of behavior and to avoid stalls (see #1)? (=>Lots of general questions here, but basically questions of Ineffeciency, how do you plan to deal with this?)
I'd expect the SPU to chew through the 256kb workloads with a single program (i.e. one "thread" per SPU). Both CPUs will have trouble to keep the cores busy / caches filled when multiple threads exhibit semi-erratic memory access patterns; but I expect Cell programmers to have more / better experience on how circumvent this with DMA transfers (from the PS2); Xbox / PC coders will struggle with this.

6. Does the asymmetric nature of the processors raise any significant hurdles? The SPEs are full features cores, but as some developers here have noted they are not great performers in certain areas. Will you be forced to run code on the SPEs that does not allow them to reach their full potential if you were provided another PPE to run such code on? Does dealing with two distinct processors (and their different cache sizes) cause serious issues with portability of code or is this a minor issue because they both accept compiled code and performance appears to be acceptible on either core?
A lot of this is down to the development tools / SDKs. Cell has more of a master + 7/8 slaves relationship (which I mentally imagine more like a CPU + 7 very, very flexible vertex or pixel shaders or even PPUs), whereas the Xenon CPU offers identical instruction set and performance characteristics (which in turn makes the more flexible scheduling possible). I am a bit concerned that the Cell PPE might be a bit overworked, but due to the clear relationship between PPE and SPUs, at least it will not have as much trouble synchronising threads (which may be a problem for the Xenon CPU - it could spend a lot of cycle on synchronisation issues).
 
Acert93 said:
1. How are developers attacking the problem of data dependancy. If "core 0" is waiting for "core 1" to return data that means "core 0" is being ineffient and therefore the theoretical peak performance of the processor takes a SIGNIFICANT hit. The more processors, the more potential for a mess. So how is this issue being addressed?

Probably the first order of business would be to try and figure out why you have data dependency in the first place and find out if it is really necessary. If we assume that it is, at what level? Can you pair the computations of the first set of data, and the second set of data, and parellelize that?

2. How do developers plan to divide their code onto 8 processors? It has been noted here the issues with breaking apart code--it is not as simple as having Physics on 1 CPU, AI on another CPU, the game code on another CPU, etc... So what are some current thoughts on how to tackle this issue? Unused processors means unused potential!!!


This is really related to question 1 with data dependency. If you have little data dependency you can create batches (think of tiles in 3D rendering) which you send out to each SPE or processor to process and return. Breaking up the workload is going to require figuring out what the boundaries for the data dependency are. What do you need to perform the computation, how big is it (will it fit in cache?) and what things are dependent on it? Another question will be how much it costs to move something to an idle processor. It might be a better choice to just let the processor remain idle for a bit if the cost of moving the data there is more expensive than finishing the computation on a slower node.

3. Effeciency. Will it be difficult to find tasks for all 8 processors to actively work on at the same time? e.g. if a processor is dedicated to particle effects and none are present at the time that would mean the core is sitting idle. This could apply to Physics, AI, whatever. If you dedicate processors to tasks that are not being used in a scene, yet another task may be totally saturating the processor. They may even ping pong back and forth. Can this be avoided?

You could avoid this if instead of dividing up tasks like "physics", "AI", etc into their own serial processes, you divide the work into chunks that can be performed on any processor. You would need to find out where you have data dependency though, and it may be really tough to do at times. Probably you will have some combination of relatively small batches of work that need to be done and larger more serial tasks. Figuring out how and when you should execute smaller tasks when a spe is idle between doing larger tasks is the million dollar question.

4. How many threads is "enough"? It occurs to me that 8 threads may not be enough for two reasons. One being threads that are "on / off" like the particle effects mentioned above. Another is low-intensity tasks like some sound engines. So is there a need to have more threads?

Maybe maybe not. If you have too many threads you introduce more overhead. Too few threads and you can't keep the processors fed with work to do unless you have low data dependency.

5. SPEs. It seems to me one solution is to have more than one thread on a core (e.g. two low-intensity threads that better use the power of a core OR two threads that tend to alternate between being idle and running hard and play nice with eachother) Do the SPEs realistically have the memory size to deal with two or three threads? Another option seems to be swapping applications from the main memory. Is this fast enough to avoid dependancy issues? Is the latency low enough for this type of behavior and to avoid stalls (see #1)? (=>Lots of general questions here, but basically questions of Ineffeciency, how do you plan to deal with this?)

It will be tough, and probably really dependent on the kind of work that is being done. If you have 8 serial processes you need to do that all are independent of each other you are golden. There would be no point to make additional threads or to do anything fancy. If you have some combination of large serial tasks and small tasks, it will be tougher. You don't want to be constantly swaping data around, but at the same time you don't want your processors sitting idle. It's all tradeoff.

6. Does the asymmetric nature of the processors raise any significant hurdles? The SPEs are full features cores, but as some developers here have noted they are not great performers in certain areas. Will you be forced to run code on the SPEs that does not allow them to reach their full potential if you were provided another PPE to run such code on? Does dealing with two distinct processors (and their different cache sizes) cause serious issues with portability of code or is this a minor issue because they both accept compiled code and performance appears to be acceptible on either core?

Like [maven] said, Cell really seems to have a "master/slave" relationship. The SPEs feel a lot more like "helper" units to the PPE rather than being equal processors. I think at least at first, the SPEs are probably going to primarily be used to catch the low hanging fruit for things that would otherwise just run on the PPE. Something along the lines of "Right now everything is running on the PPE, but 80% of it is this random physics code we can semi-easily parellelize, lets try moving it off to the SPEs".

Nite_Hawk
 

1. How are developers attacking the problem of data dependancy. If "core 0" is waiting for "core 1" to return data that means "core 0" is being ineffient and therefore the theoretical peak performance of the processor takes a SIGNIFICANT hit. The more processors, the more potential for a mess. So how is this issue being addressed?


This is just a fact of life, all you can really do with dependancies, is identify them and minimise the number. In any large scale parallel architecture you will see idle execution units a large percentage of the time. This is largely why 2 processors do not equal 2x the power.


2. How do developers plan to divide their code onto 8 processors? It has been noted here the issues with breaking apart code--it is not as simple as having Physics on 1 CPU, AI on another CPU, the game code on another CPU, etc... So what are some current thoughts on how to tackle this issue? Unused processors means unused potential!!!


I think Cell is a somewhat unique challenge, partly because of it's asymetric architecture and partly because of the limitations of the SPE's.
Code and data has pretty much got to fit in the local memory and any change of function means DMA'ing new code and data. The later makes a context switch potentially VERY expensive. Double buffering the memory will minimise this but at the cost of half the local memory.
I think on Cell you'll be parcelling off chunks of work to the other processors and running a largely sequential main thread, question is how big do the chunks need to be to make it worthwhile parcelling them off rather than just computing them locally.



3. Effeciency. Will it be difficult to find tasks for all 8 processors to actively work on at the same time? e.g. if a processor is dedicated to particle effects and none are present at the time that would mean the core is sitting idle. This could apply to Physics, AI, whatever. If you dedicate processors to tasks that are not being used in a scene, yet another task may be totally saturating the processor. They may even ping pong back and forth. Can this be avoided?


I think you're thinking too coarse scale.
SPE's will do multiple different things in a frame, the goal will be to get the granularity at a level where you can practically load balance.


4. How many threads is "enough"? It occurs to me that 8 threads may not be enough for two reasons. One being threads that are "on / off" like the particle effects mentioned above. Another is low-intensity tasks like some sound engines. So is there a need to have more threads?


Threads are the wrong concept IMO.
I'd start thinking more in terms of jobs. A Job is a code+data packet sent to an SPE that does significant work.


5. SPEs. It seems to me one solution is to have more than one thread on a core (e.g. two low-intensity threads that better use the power of a core OR two threads that tend to alternate between being idle and running hard and play nice with eachother) Do the SPEs realistically have the memory size to deal with two or three threads? Another option seems to be swapping applications from the main memory. Is this fast enough to avoid dependancy issues? Is the latency low enough for this type of behavior and to avoid stalls (see #1)? (=>Lots of general questions here, but basically questions of Ineffeciency, how do you plan to deal with this?)


Threads are just the wrong concept for SPE's. They're much more like the VU's from PS2 than they are the PPE in Cell. Yes they can run general code, but their limitations lend the to running self contained Jobs.


6. Does the asymmetric nature of the processors raise any significant hurdles? The SPEs are full features cores, but as some developers here have noted they are not great performers in certain areas. Will you be forced to run code on the SPEs that does not allow them to reach their full potential if you were provided another PPE to run such code on? Does dealing with two distinct processors (and their different cache sizes) cause serious issues with portability of code or is this a minor issue because they both accept compiled code and performance appears to be acceptible on either core?


See above, plus the fact that you have to target your code to either the PPE or SPE explicitly and retargetting is potentially a pain.


7. Any other areas or issues of concern you are being faced with OR you have overcome.
Obviously multi-core environment force programmers to look at new ways to deal with problems. Scheduling, dependancy, and effeciency are important parts of the puzzle to bring together. I am interested to hear how this problem in general is being attacked and any feedback on what we can realistically expect in the next couple years.


You could use the above solution to parallelism on the X360, but I'm not sure that it's the best way to do it. The fact that both architectures are more different than they are the same makes for some painful decisions when targetting cross platform.
 
Something to keep in mind is that maximizing performance is not always needed. If you can get it to work "well enough" and the solution make sense to the programmers, that may be more preferable than a system that's faster but only 1 or 2 people get it.

I guess I'd approach the problem from more the point of view of speeding up the game with relation to running it all on PPE as opposed to maximizing every core. There's not as stringent performance demands from the former.

Anyway, I'm going to concur with the master/slave quorum developing. Maybe something like this would work. Say you make all your task-specific SPE code and designate them all unique IDs. Then make a "core" program that receives a job, pulls the task ID from it, loads the SPE code for the task and runs it on the job. The PPE will pull off chunks of work, assign IDs to each and put them in L2 cache for the SPEs to grab and process. As long as the SPEs can retire jobs from the cache fairly quickly, this should not starve any one core. Also, smarter queuing may be programmed to make sure SPEs switch tasks as little as possible.

The problem, of course, is that not everything fits this model.
 
Good question!

You mentioned the move to multi-core in the PC market. You may find this surprising but, in a sly sort of way, PCs may (in the short term atleast) be more suited to using parallel CPU processors than consoles. That's because PCs run multi-tasking operating systems. Right now I've got 20 processes running on my stupid Windows XP machine. If I had 20 processors, I could just give each one it's own thread and see an enormous performance benefit. Not 20x, but close enough to make you say wow.

That's a simplified example but you see where I'm going. PCs (as they are most commonly used) basically always are running more than one program. Consoles basically always run one program at a time: a game. So that game has to be broken up into multiple "programs" (threads) to use multiple processors. That's a fairly large paradigm shift in software development and it will take time for developers (whether they write software for PCs or consoles or cell phones or whatever) to figure out how to do it effectively.

For now PCs will "go around" the problem by running seperate single-threaded programs each as their own thread. Unfortuneately this won't benefit PC games much (if at all).
 
1. How are developers attacking the problem of data dependancy. If "core 0" is waiting for "core 1" to return data that means "core 0" is being ineffient and therefore the theoretical peak performance of the processor takes a SIGNIFICANT hit. The more processors, the more potential for a mess. So how is this issue being addressed?
More or less... it's not. There's really not much you can do. There are certainly a number of repetitive tasks where you can group data dependencies into chunks and thread off the code that operates on these chunks, but these are usually the more "trivially" parallelizable tasks. In most cases, while you can identify the problem cases, there's little you can do short of revamp the whole process, and even then it may not buy you much of anything.

2. How do developers plan to divide their code onto 8 processors? It has been noted here the issues with breaking apart code--it is not as simple as having Physics on 1 CPU, AI on another CPU, the game code on another CPU, etc... So what are some current thoughts on how to tackle this issue? Unused processors means unused potential!!!

In the case of CELL, I really don't think peer threads will work out too well. I pretty much see things as ERP mentioned in terms of jobs and queuing up jobs. It also fits with the whole "apulets" idea that gets mentioned throughout the CELL patents. If you saw the Carmack keynote, he also mentions that would be the easiest approach for people worried about development on multiple platforms including PS3. While that may work reasonably well on CELL, I don't see it being too pretty on X360 (in terms of efficiency). In either case, though, for a lot of the initial round of titles, I'd expect to see a structure where the outer loop of the game is basically the same sequential thing we're all used to, and each component within the processing loop that can be effectively parallelized takes up every core/hardware thread to do its job.

3. Effeciency. Will it be difficult to find tasks for all 8 processors to actively work on at the same time? e.g. if a processor is dedicated to particle effects and none are present at the time that would mean the core is sitting idle. This could apply to Physics, AI, whatever. If you dedicate processors to tasks that are not being used in a scene, yet another task may be totally saturating the processor. They may even ping pong back and forth. Can this be avoided?
See above. I really don't expect developers right away to be audacious enough to start going down paths that introduce dozens and dozens of race conditions (i.e. KISS principle) -- maybe those who aren't working on launch titles, and so have a decent time buffer. This is probably something that will come in the 2nd generation of stuff or with developers who basically start anew for next-gen. In the long run, I think efficiency that could even be called "medium-high" is an unreachable goal. If I was working on a project more full of "embarassingly parallel" problems, then maybe I'll see some good efficiency and good speedup from the 360 and PS3 designs, but can you really imagine how to design a game around... say... particles? I sure can't, though I'm sure some Japanese studio will find a way that earns a Katamari-like following.

There's no universal solution, and really never will be. Which in a lot of ways raises some major concerns because it probably means that while an engine can remain semi-static, you're basically going to have to rework a lot of junk at the game code level from scratch for each and every project since the individual concerns could have radically different solutions. That could in turn lead to more formulaic product as it means more work, more time, and more money for each time a company tries to make a very different type of game.

4. How many threads is "enough"? It occurs to me that 8 threads may not be enough for two reasons. One being threads that are "on / off" like the particle effects mentioned above. Another is low-intensity tasks like some sound engines. So is there a need to have more threads?

5. SPEs. It seems to me one solution is to have more than one thread on a core (e.g. two low-intensity threads that better use the power of a core OR two threads that tend to alternate between being idle and running hard and play nice with eachother) Do the SPEs realistically have the memory size to deal with two or three threads? Another option seems to be swapping applications from the main memory. Is this fast enough to avoid dependancy issues? Is the latency low enough for this type of behavior and to avoid stalls (see #1)? (=>Lots of general questions here, but basically questions of Ineffeciency, how do you plan to deal with this?)
Well, in the vein of "jobs", I think as long as you've always jobs in a queue waiting for an SPE to go to, that pretty much tells you that you're not underutilizing the chip (assuming your apulet code is reasonably efficient), but it could be a sign of oversaturation if you start filling up that queue quickly. In general, the idea of jobs and worker threads are all very much of the nature that order in which you perform the jobs isn't important, and just filling up resources with more jobs is enough to guarantee effective resource utilization. As long as all the jobs are done quickly, that's enough -- no need to worry about scheduling.

6. Does the asymmetric nature of the processors raise any significant hurdles? The SPEs are full features cores, but as some developers here have noted they are not great performers in certain areas. Will you be forced to run code on the SPEs that does not allow them to reach their full potential if you were provided another PPE to run such code on? Does dealing with two distinct processors (and their different cache sizes) cause serious issues with portability of code or is this a minor issue because they both accept compiled code and performance appears to be acceptible on either core?
Well, I certainly see that each SPE job would be specifically constructed to fit within the 256k of LS, and ideally, you'd probably want job that can approach near the limit as it means more work that could be done per job (assuming you're not really working with single data structures that each eat up 100k). Ultimately, though, it really just boils down to specializing the code for each. Everything that is meant to go off to the SPEs will only go off to the SPEs. It may have a PPE counterpart that is used less often for "on-demand" purposes, but that will be ill-used.

7. Any other areas or issues of concern you are being faced with OR you have overcome. Obviously multi-core environment force programmers to look at new ways to deal with problems. Scheduling, dependancy, and effeciency are important parts of the puzzle to bring together. I am interested to hear how this problem in general is being attacked and any feedback on what we can realistically expect in the next couple years.
Biggest one that bugs me is really just multi-platform development. The challenges and approaches are likely to be radically different for each platform, and it's just going to lead to a mess. I have no clue about Revolution yet, but it's probably going to be a whole other ball of wax in itself.
 
Last edited by a moderator:
Back
Top