A general question: what game code will SPEs be useful for?

Titanio

Legend
On the hardware side, I have a pretty good understanding of what Cell is all about and how it is working, but I feel I am lacking some understanding of how the puzzle will come together on the software side, and I haven't seen much discussion of this, so...how and what will harness specifically the SPEs in Cell? In terms of games?

We're told it's a wonderful chip for games consoles, if not anything else. We're told the SPEs are "general" processors. So what kind of game code will be "accelerated" (for want of a better word) by finding a home on a SPE?

I'm thinking anything that relies heavily on floating point performance, but in a game, what would that be?

Graphics aside, I'm guessing Physics, but what else?

I guess I'm asking about Cell programming models, what is needed to maximise performance, and what kind of game code lends itself to such models? For example, do you need to use a streaming/pipeline model to maximise performance, or can the SPEs be as busy with more general multitasking (that utlises all the SPEs for different tasks)?
 
Small coherent datasets that you can stream would run well. Anything that runs well on a DSP. Vertex shading (but not vertex texturing) would run well...basically vector maths with small datasets and minimum random access to memory.
 
Jaws said:
Small coherent datasets that you can stream would run well. Anything that runs well on a DSP. Vertex shading (but not vertex texturing) would run well...basically vector maths with small datasets and minimum random access to memory.

Thanks, I've read a lot of this before, but I haven't seen it translated into what that actually means for different types of game code - i.e. what aspects of a game will and won't benefit from SPE usage.. maybe it's a very silly question, I don't know, but I guess I'm looking for a more concrete relationship between these more general ideas and specific types of game code. So we've got vertex processing, which may or may not be relevant depending on the GPU, and physics (I guess?) but what else?

I guess I'm also asking - is the distribution of logic across dedicated floating point monsters like the SPEs versus more general, "regular" cores like the PPE justified in a games context?
 
Titanio said:
Jaws said:
Small coherent datasets that you can stream would run well. Anything that runs well on a DSP. Vertex shading (but not vertex texturing) would run well...basically vector maths with small datasets and minimum random access to memory.

Thanks, I've read a lot of this before, but I haven't seen it translated into what that actually means for different types of game code - i.e. what aspects of a game will and won't benefit from SPE usage.. maybe it's a very silly question, I don't know, but I guess I'm looking for a more concrete relationship between these more general ideas and specific types of game code. So we've got vertex processing, which may or may not be relevant depending on the GPU, and physics (I guess?) but what else?

I guess I'm also asking - is the distribution of logic across dedicated floating point monsters like the SPEs versus more general, "regular" cores like the PPE justified in a games context?

Don't think your going to get any sort of useful answer.
You will be able to target almost any code at an SPE, all beit with varying degrees of efficiency (or lack there of) and programming effort.

In terms of which is the better approach, we probably won't know until the 2nd or 3rd generation of games roll around simply because devs have little experience with either.
 
When we are talking about an SPE, it is essentially a simple, general purpose PPC with SIMD unit. So you could make a stretch and say if it ran well on a GC (the CPU, more specifically), it will run just fine on an SPE.
 
randycat99 said:
When we are talking about an SPE, it is essentially a simple, general purpose PPC with SIMD unit. So you could make a stretch and say if it ran well on a GC (the CPU, more specifically), it will run just fine on an SPE.

Haven't seen the instruction set for th e SPE discussed anywhere, but it's probably simpler than a PPC, and it has limited local memory, and needs to use DMA to access main memory.
 
ERP said:
randycat99 said:
When we are talking about an SPE, it is essentially a simple, general purpose PPC with SIMD unit. So you could make a stretch and say if it ran well on a GC (the CPU, more specifically), it will run just fine on an SPE.

Haven't seen the instruction set for th e SPE discussed anywhere, but it's probably simpler than a PPC, and it has limited local memory, and needs to use DMA to access main memory.

I get how you could program it and have the DMA be basically handled automatically (comign at a performance disadvantage over very well optiomized hand-written code of course): it is basically how you handle EWRAM and IWRAM on the GBA in a way. There is a keyword called "far" you can declare variables with and something tells me that if you have a variable in SPE code that is a "far" variable... it means that this variable is out of the Local Storage (main RAM... in the Virtual Memory address space out of the SPE's Local Storage).

There are small things that tell me that CELL was not designed by idiots ;), I am hopeful :).
 
In order for the SPEs to work well, you need to break your tasks into small, self-contained chunks (around 64k each) of independent data that each require a lot of math. What can you do with 64k?

Transform 1000 vertices (1 vertex with position, normal, tangent and 2 UVs counts as 4 channels @ 16 bytes per channel decompressed to local storage)

Animate 1000 particles (assuming parametric motion with data requirements equivalent to the above vertices)

Decode 64k of jpeg, mpeg or MP3

Encode 1/10th of a second of 48kHz 5.1 audio into DolbyDigital or DTS

Interpolate 500 pairs of position-rotation-scale animation keys into matrices (animated bones)

Do collision detection on 500 pairs of oriented bounding boxes or 2000 pairs of spheres

Do simple cloth or water simulation on a 32x32 grid

Apply forces and constraints to a whole lotta vehicles & ragdolls


The key point to keep in mind is that the SPEs read and write in 16K chunks when accessing main memory. Each chunk takes a lot of time to move in and out of local storage. It will probably take as much time or more to move the data as it does to do the floating point work! That means you can't take general purpose code (which almost always reads 4 bytes at a time from scattered locations) and expect to run it on the SPEs at non-laughable speed.

The 64K chunks I propose above is based on the probable arrangement of dividing the 256k local storage into 4 sections: 1 for the code, permanent data and scratch space, 1 for the incoming DMA, 1 for the outgoing DMA and 1 for the data that is currently being processed while waiting for the incoming and outgoing DMAs to complete.
 
corysama said:
What can you do with 64k?
The question is..why 64k? Is it just an example? or do you believe 64k is sweet spot for some reason?
The key point to keep in mind is that the SPEs read and write in 16K chunks when accessing main memory.
I thought 16k is the max amount of data you can transfer in/out local store from the SPE. THat doesn't mean SPEs move data at 16k at time. It would be a complete non sense. I thought the minimun amount of data is something from 16 to 128 bytes.
Each chunk takes a lot of time to move in and out of local storage.
Why? Once the transaction has started if there is enough bandwith to spare with other SPEs it should be quite fast!
It will probably take as much time or more to move the data as it does to do the floating point work!
Umh?! This could be true for some kind of processing, it's not a general rule.

ciao,
Marco
 
kaigai082.jpg


Are you masochistic devs still intent on assembly programming on CELL, especially when this slide says otherwise? Or is this slide for the 'Lorne Lanning' of this world? :devilish:
 
nAo said:
corysama said:
What can you do with 64k?
The question is..why 64k? Is it just an example? or do you believe 64k is sweet spot for some reason?
I believe 64k is a sweet spot for the reason explained in the last paragraph of my first post. The point of the specific examples were to show that you can do significant tasks in self-contained chunks of data without requiring scattered reads from main memory.

nAo said:
corysama said:
The key point to keep in mind is that the SPEs read and write in 16K chunks when accessing main memory.
I thought 16k is the max amount of data you can transfer in/out local store from the SPE. THat doesn't mean SPEs move data at 16k at time. It would be a complete non sense. I thought the minimun amount of data is something from 16 to 128 bytes.
I think you are right and I was mistaken, but you will probably want to move data in as large of chunks as you can for the reasons I detail later in this post.

nAo said:
corysama said:
Each chunk takes a lot of time to move in and out of local storage.
Why? Once the transaction has started if there is enough bandwith to spare with other SPEs it should be quite fast!
The problem is not the bandwidth. Bandwidth is cheap and easy. That's why the new machines are bursting with bandwidth. The problem is latency.
The time between when you make a request and when the first byte arrives is going to be large. After the first byte, the rest flood in like hell wouldn't have it. But if your request is for only 128 bytes you are going to spend a large time waiting and a short time actually receiving data. If all of your requests are small, you are going to spent almost all of your time waiting.
Using the bandwidth of modern machines is like using a freight train. Just because a train can move 100 tons 1000 miles in a day doesn't mean the same train can deliver 100 one ton packages to 100 locations each 10 miles apart in a single day. It would spend nearly all of its time starting and stopping. No one thinks about that when they are drooling over bandwidth specs.
Reducing latency is difficult and expensive. It is easier to hide than it is to solve. The primary goal of all this parallelism is to hide latency behind deeply pipelined processes. The way you do that in a system like the SPEs is you issue a large read request from main memory before you start processing the large amount of data you already have. You need to have enough to do to keep yourself busy while you wait for the new data to start to arrive. If you get done early, there's nothing to do but sit and spin.

nAo said:
corysama said:
It will probably take as much time or more to move the data as it does to do the floating point work!
Umh?! This could be true for some kind of processing, it's not a general rule.
It depends on the computational density of your data and how well you manage the latency.

I'll admit up front that I don't have numbers on how bad the latency will be for SPE read requests from main memory. If anyone does know, please share! I do know that the latency for current PC CPUs to go all the way to RAM (full cache miss) is getting into the hundreds of cycles. A 400 MHz memory bus has a latency equivalent to 25 Mhz. That means that if you don't stay in cache then you are working at less than 1% capacity. The latency of the SPE's DMA system is certainly going to be larger than that of CPU access. If the SPEs are going to run at 4+GHz, then the latency for them to hit main RAM is likely to be many hundreds of cycles.
 
Jaws said:
Are you masochistic devs still intent on assembly programming on CELL, especially when this slide says otherwise? Or is this slide for the 'Lorne Lanning' of this world? :devilish:

It should be possible to use the intrinsics effectively from C++. That way you can let the compiler sort out the register lifetimes and instruction latencies for you. The code will still need to be very carefully structured around the hardware's strengths and weaknesses though. Don't try to use many calls to new/delete! ;)
 
corysama said:
I think you are right and I was mistaken, but you will probably want to move data in as large of chunks as you can for the reasons I detail later in this post.
Obviously It's better to move data in 'large' chunks, as it holds true for almost everything out there.

corysama said:
The time between when you make a request and when the first byte arrives is going to be large. After the first byte, the rest flood in like hell wouldn't have it. But if your request is for only 128 bytes you are going to spend a large time waiting and a short time actually receiving data. If all of your requests are small, you are going to spent almost all of your time waiting.
I know that, but that's why most of the time we use double buffers :)
For streaming processing it shouldn't be a problem at all..
Things are different with general purpose code.

It would spend nearly all of its time starting and stopping. No one thinks about that when they are drooling over bandwidth specs.
That's why SPEs are called streaming processors ;)

The way you do that in a system like the SPEs is you issue a large read request from main memory before you start processing the large amount of data you already have. You need to have enough to do to keep yourself busy while you wait for the new data to start to arrive. If you get done early, there's nothing to do but sit and spin.
It depends on the computational density of your data and how well you manage the latency.
If the SPEs are going to run at 4+GHz, then the latency for them to hit main RAM is likely to be many hundreds of cycles.
Latency from external ram (but we should remember that SPEs can consume data from L2) is about 1000 cycles.
A very simple vertex shader would take at least 10 clock cyles. Even with 64 bytes vertices you'd need just 6 Kb buffer to cover inital transfer latency.
2 16 kb buffers would be enough for most stuff in real world (streaming) applications.
Things are way more complex with code that needs a lot of random read or writes, but that is where L2, prefetch and 16 outstanding dma queries per SPE may help :)
General purpose code compiled as it is will run like a pig, once fine tuned It could decently run.

ciao,
Marco
 
Is there enough here for us to concur that this "issue" has come about from someone trying to apply a "little pipe, big tank" computing practice to an architecture clearly intended for "big pipe, little tank" techniques?
 
Latency from external ram (but we should remember that SPEs can consume data from L2) is about 1000 cycles.
Thanks for the tip!
How are the L2 reads set up? I can guess at a few different schemes, but I don't recall reading any descriptions. It could be:

  • A: Fully automatic. If the SPE requests a read from an address that happens to be in cache then it will read from the cache. That would seem a little loose and difficult to control.
    B: Semi-Manual from the SPE. The SPE could request prefetches into cache before requesting a DMA. That would create an effectively two-stage fetch. It would also really screw the CPU out of its much-needed L2...
    C: Manual from the CPU. The CPU could explicitly lock some L2 to turn it into scratchpad memory then copy into the locked region and kick off a apulet that knows to DMA from the scratchpad. Still costs L2, but at least it's explicit.
But I'm just making stuff up. Do have any info?
A very simple vertex shader would take at least 10 clock cyles. Even with 64 bytes vertices you'd need just 6 Kb buffer to cover inital transfer latency.

Do you know how packed data (< 16 byte structures) is handled? Is there something like the VIF to to spread the data out into a quadword of memory or can the SPE read a <16 byte structure and unpack it into a vector register. Without some form of unpacker, we would have to either leave everything as uncompressed floats (sucking main memory and DMA bandwidth) or unpack everything manually one component at a time (sucking SPE cycles).

Things are way more complex with code that needs a lot of random read or writes, but that is where L2, prefetch and 16 outstanding dma queries per SPE may help :)
General purpose code compiled as it is will run like a pig, once fine tuned It could decently run.

Yeah, I'm mainly trying to prevent people from thinking "I can program SPEs in C++" means "I can recompile my old code and it will run at 1 teraflop!" It sounds like you've already got a handle on that. :) Though I think the process of "fine tuning" is going to be more along the lines of "rewrite completely to work streaming-style instead of CPU-style."
 
Back
Top