What's the current status of "real-time Pixar graphics&

Simon F · Oct 21, 2003

Dio said:
And I know that traditionally, multiprocessing architectures are harder to program efficiently than single-processor architectures.

Having worked on a multi-cpu system (~32 cpus) in my previous job and having studied the damned things in my Honours year, I'd just like to say that Dio's comment is an understatement.

Ilfirin · Oct 21, 2003

Heh. I talked to a friend a while back after he got back from a GDC2003 roundtable discussion with a bunch of game programmers. Some Sony reps were apparently asking if anyone there had any experience with parallel processing, and receiving 0 raised hands in return.

It's definitely going to be a bumpy ride.

Dio · Oct 21, 2003

Paul said:
Said APU's on the GPU could crunch Shader programs rather well I think.

No argument there. But the ALU is the easy part of a shader architecture. Getting the data in and out in a timely manner, with sufficient latency compensation, is the difficult bit.

Simon F · Oct 21, 2003

Dio, give up mate...

"The Milliard Gargantubrain?" said PS3 with
unconcealed contempt. "A mere abacus. Mention it not."

Apologies to the late great DNA

Dio · Oct 21, 2003

I think that's a terribly good idea. This is the kind of discussion that should only be had over a pint or three.

no_way · Oct 21, 2003

Ilfirin said:
Heh. I talked to a friend a while back after he got back from a GDC2003 roundtable discussion with a bunch of game programmers. Some Sony reps were apparently asking if anyone there had any experience with parallel processing, and receiving 0 raised hands in return.

Hmm. In embedded ( or actually, "limited device" ) programming world parallel processing in more like standard practice. I've done all sorts of robotic/microcontroller applications, just now tinkering around with aJile systems for fun. Parallel processing ( multiple chips, multiple processors ) actually make things alot easier for such apps.
Yeah, its not parallel processing in the sense of performing complex computations on huge data arrays, but its parallel nonetheless, and it does data processing.

MfA · Oct 21, 2003

Most CS/CE/EE students should have done some parallel programming, are there really so few developers with an academic background?

Ilfirin · Oct 21, 2003

I doubt the question was as simple as "have you ever had to work with multi-processing systems?" But I wasn't there (2nd hand info), so I couldn't tell you for sure.

This guy was, though:

A developer from Sony asked if anyone had any experience updating objects on an SMP system. No one did, but if the PlayStation 3 really does have 72 processors, I suppose we're all going to learn.

Panajev2001a · Oct 21, 2003

Simon and Dio, please do not abandon the thread... discussions can be informative.

Simon F said:
Dio said:

And I know that traditionally, multiprocessing architectures are harder to program efficiently than single-processor architectures.

Click to expand...

Having worked on a multi-cpu system (~32 cpus) in my previous job and having studied the damned things in my Honours year, I'd just like to say that Dio's comment is an understatement.

Well at least you do not have to worry about cache coherency at the APU level: there is no cache

The 128 KB of LS is in all effects the main RAM of each APU: the e-DRAM, the external RAM are all lower steps in the memory hierarchy.

All the code that an APU executes has to be in the LS.

A traditional SMP approach, think about your quad processor Xeon for example, has 4 CPU cores that essentially are trying to feed data and instructions from the same main RAM: each processor, in this SMP system, has a Cache and the problem ( no news here ) is that we do not want each CPU to work on the data on its cache and another CPU trying to load that memory location from memory before the cache's modified blocks ar written back to main RAM or the first CPU might update its cache and let's say main memory as well, but another CPU core might have that memory block already in cache and it could work on that block instead of fetching the "uopdated" block of data from memory.

Both on an System designer point of view and a programmer point of view this can be annoying.

In most cases I think we can compare a CELL CPU with a distributed computing environment over a relatively large network: each PE is a separate and independent super-node in this system.

Each node, in most cases, could be thought as a LAN with a host APU attached to it ( the APU has its own RAM ).

Yes, we have a common resource ( e-DRAM ) that all APUs do access, but if you look at the patent you will see that for all intents and purposes ( unless we start being funny with APU's masks using the PUs ) each APU has its protected portion of the resource.

Implemented in Hardware there is a memory sandbox structure: each APU has its own portion of e-DRAM which is marked with that APU ID and access to that resource is granted to the particular APU hose ID matches with the one the memory space was tagged with ( unless you play with APU's masks as I said: trusted code run by the PUs can change the APU's masks opportunely so that one APU can access the memory space reserved for another APU, but this is up to the programmer if they want to deal with that or not ).

APUs do not necessarily share data with each other as that would be more complicated to deal with ( it is not particularly relaxing to think about 32 APUs all accessing a common and non sub0divided memory space ).

Threating each APU as its own thread/process and implementing message passing between APUs ( this implementation is part of the trusted code: OS + game code that orchestrate the APUs [a sort of "intelligent" thread scheduler and resource organizer that either the programmer or middleware software would provide] ) should be the preferred approach: of course it will still not be easy, that I understand, but the more structured approach might help.

Directing an orchestra is not a trivial job, but that is what will be required i such a scenario.

In each PE the PU runs the OS code, performs I/O of data and runs part of the game code.

The PU will then assign tasks to 1 or more APUs and will instruct them how to communicate and inter-operate together ( what to do with partial results, etc... ): if you imagine sharing T&L across several APUs of even different PEs you will see that for things such as collision detection we do need to put some effort in the PU portion of our code as it is the PUs responsability to make sure everything is performed according to the original plans in the original sequence.

About launching PlayStation 2 one year later with 1 more year in R&D and using 1 or 2 generationss of manufacturing processes after the 250 nm node, I do not see how that would have worsen the whole picture economically ( aside for the greater competition ).

If you launch in 2001 your next-generation is not going to launch, generally speaking, before 2007 and if you launch in 2000 your next-generation is going to launch, generally speaking, before 2006.

In the case of SCE and Toshiba, instead of launching using 250 nm and going down to 90 nm, you could have launched using 180 nm and go down to 65 nm.

Simon F · Oct 22, 2003

Panajev2001a said:
Simon and Dio, please do not abandon the thread... discussions can be informative.

Simon F said:

Dio said:

And I know that traditionally, multiprocessing architectures are harder to program efficiently than single-processor architectures.

Click to expand...

Having worked on a multi-cpu system (~32 cpus) in my previous job and having studied the damned things in my Honours year, I'd just like to say that Dio's comment is an understatement.

Click to expand...

Well at least you do not have to worry about cache coherency at the APU level: there is no cache

The 128 KB of LS is in all effects the main RAM of each APU: the e-DRAM, the external RAM are all lower steps in the memory hierarchy.

All the code that an APU executes has to be in the LS.

A traditional SMP approach...

The system I worked on was transputer based which generally meant you don't share memory - (Although, in fact, some of central CPUs did, but we used the communication links to pass messages around etc, and used our own software protocols/OS to make sure things were 'safe'). Most of the CPUs thus worked as a distributed system.

The application it was targetting was raytracing, advanced 2D rendering, and video processing, which aren't too difficult to map to such a computer (although dynamically distributing sections of the database was a PITA (there was no virtual memory support on the T800)). Not all applications map quite so readily.

Dio · Oct 22, 2003

Yeah, I was thinking Transputers as well as the closest analogy. Occam will have a future, perhaps?

The problem with message-passing segregated memory architectures is that it's hard work to get full utilisation of all the units all the time (or even a substantial portion of the time), which can leave an awful lot of your transistors sitting idle. It's one of the many banes of all Transputer programmers.

Here's an analogy based on a real-world example I know of. Let's say you have 256 processors, to each of which you give 1/256th of the screen to raytrace. Some of the segments finish almost immediately; others take 10 seconds, most finish somewhere around 6 seconds. You work it out, and you find that you've only actually utilised 40% of your available CPU power. So you try to halve the block size, and give each unit 2 different blocks scheduled dynamically, but this makes each rendering slightly less efficient because you have finer granularity - so you find that your CPU utilisation goes up to 75%, but rendering time only drops to 8 seconds because of granularity losses (when you expected it to nearly halve).

Bad utilisation is Bad News for a 'cheap' design, because the implication is that you could have got by with something with fewer transistors and therefore that much cheaper. The flip side is that the multiprocessor architecture is probably cheaper than the equivalent horsepower as a uniprocessor architecture. You pays your money and takes your pick, as they say...

That GameArchitect.net article was a great read, by the way.

Simon F · Oct 22, 2003

Dio said:
Yeah, I was thinking Transputers as well as the closest analogy. Occam will have a future, perhaps?

If it had some data structures and recursion and ... then maybe. Working around those limitations was painful. I even started work on a "Usuable Occam to Occam" compiler at my last job, but the project was killed off before I'd completed it.

The problem with message-passing segregated memory architectures is that it's hard work to get full utilisation of all the units all the time (or even a substantial portion of the time), which can leave an awful lot of your transistors sitting idle. It's one of the many banes of all Transputer programmers.

I suppose it's a matter of having load balancing and giving each CPU several jobs to work on at the same time so that if one finishes the processor has something else to do. If there was one great thing about the transputer/Occam it was the multi-thread support.

Here's an analogy based on a real-world example I know of. Let's say you have 256 processors, to each of which you give 1/256th of the screen to raytrace. Some of the segments finish almost immediately; others take 10 seconds, most finish somewhere around 6 seconds. You work it out, and you find that you've only actually utilised 40% of your available CPU power. So you try to halve the block size, and give each unit 2 different blocks scheduled dynamically, but this makes each rendering slightly less efficient because you have finer granularity - so you find that your CPU utilisation goes up to 75%, but rendering time only drops to 8 seconds because of granularity losses (when you expected it to nearly halve).

On our system we would allow frames to overlap. Latency stays the same but the throughput goes up.

Dio · Oct 22, 2003

Simon F said:
The problem with message-passing segregated memory architectures is that it's hard work to get full utilisation of all the units all the time (or even a substantial portion of the time), which can leave an awful lot of your transistors sitting idle. It's one of the many banes of all Transputer programmers.

Click to expand...

I suppose it's a matter of having load balancing and giving each CPU several jobs to work on at the same time so that if one finishes the processor has something else to do. If there was one great thing about the transputer/Occam it was the multi-thread support.

I'd agree, but I'd worry that if the memory is very restrictively segmented, then getting the data for a new job to the processor can easily become the bottleneck, rather than the processing...

Then again, I'm a pessimist

MfA · Oct 22, 2003

I think Kroc (the only Occam compiler being somewhat actively developed) has structs, recursion and mobile data types (limited form of references, usefull for things such as linked lists).

Dio · Oct 22, 2003

I found a great Occam document that illustrates the problem I describe above clearly.

http://www.eg.bucknell.edu/~cs366/occam.html

The above pipe example is actually a silly way to compute the square root of a number. On one processor with no parallelism, the program is slow compared to the sequential algorithm because of time spent in process scheduling and in communication. A speaker at an Occam conference presented the following timing analysis of the pipeline example in the Occam Programming Manual [Inmos, 1988]:

- Time for pipe Newton-Raphson on one Transputer was 170 microseconds.
- Time for sequential Newton-Raphson on one Transputer was 60 microseconds.
- Time for pipe Newton-Raphson on 12 Transputers was 30 microseconds.

He observed, â€œWe need to think about our designs carefully.â€

i.e. adding 11 additional processors gives only a factor of 2 performance increase.

What's the current status of "real-time Pixar graphics&

Simon F

Tea maker

Ilfirin

Dio

Simon F

Tea maker

Dio

no_way

MfA

Ilfirin

Panajev2001a

Simon F

Tea maker

Dio

Simon F

Tea maker

Dio

MfA

Dio

Similar threads