New ITAGAKI interview touches on 360 and PS3 comparison

mckmas8808 said:
BS. No way that's true (smiling hoping that it really is). Are you telling me that you don't have to dedicate a particlur SPE to do physics? So why do people always say, "If EA is using 3 SPEs for graphics and the other 4 for physics sound and AI then the CPU is being wasted"? I don't get it. So you're saying using your perfect example that any SPE could be used for physics at anytime? Is it smart to do it that way? So the 3 SPEs that EA was talking about might not always be the exact same SPEs, yet just will take 3 SPEs worth of information at any given time?

msckmas8808, it's called a job queue, it certainly can be done that way. People often talk about "reserving" a SPU for a specific task, but you certainly don't have to do that. If a task is going to occupy a SPU for the duration of a frame's processing, however, you could effectively say that it is reserved for that task. But some SPUs may well touch multiple tasks in the duration of one frame's processing.

SCEA's presentation from GDC has more on the job queue model, and others.

edit - SPE = SPU..SPU is the official name now, isn't it?
 
Last edited by a moderator:
I agree.

The significance of this is akin to no longer being restricted to some finite number of process threads that need to be mapped to a specific core or hardware thread, but now each process thread can be further divided into segments. Now you can have any number of segments (1000's of jobs vs. just 1, 2, 6, 9 threads) that can be executed, as appropriate, on whatever resources you have available. It essentially makes the whole worrying about divying up n-discrete processes (AI, physics, game code, etc) amongst x-processors in some "logical" manner as moot (maybe, "almost" moot, if you want to be picky ;) ).

Another way to imagine it is if you literally have 3 spools of thread and 8 sewing machines. If you restrict to the requirement that these threads are indivisible, then most certainly you will only be able to leverage 3 sewing machines to consume the threads. If you open it up to snipping many, many segments from these spools, and then feeding it to your pool of machines, then that opens up all sorts of possibilities of using all 8 sewing machines, or 16, or 100, or n-sewing machines... :oops:
 
Last edited by a moderator:
Black Dragon37 said:
Does it matter if he's a game developer or not? :???:

Of course not it's just that he has seemed to explain himself in such a way that it sounds like he may be. And if so I would be curious if he is being employed to make a current or next generation game. No more, no less. And thanks Titanio. I will use this information for future references. ;)
 
Spe

mckmas8808 said:
So the 3 SPEs that EA was talking about might not always be the exact same SPEs, yet just will take 3 SPEs worth of information at any given time?

I dont know details of EA work but I too heard about this distribution of tasks you talked about. Perhaps they are talking in terms of cycles since number of cycles per unit times number of units is total cycles (in real world not precise) available and theyre using 3 units worth of cycles for graphics tasks and will fill up rest of cycles with other tasks. Remember, that graphics information only needed when game control says is needed so no more data processed than needed. Easy to manage SPE use with PPE. IBM has much information on this on the web.
 
randycat99 said:
I agree.

The significance of this is akin to no longer being restricted to some finite number of process threads, but now each thread can be further divided into segments. Now you can have any number of segments (1000's of jobs vs. just 1, 2, 6, 9 threads) that can be executed, as appropriate, on whatever resources you have available. It essentially makes the whole worrying about divying up n-discrete processes (AI, physics, game code, etc) amongst x-processors in some "logical" manner as moot (maybe, "almost" moot, if you want to be picky ;) ).

Ok one question for you randy. Why don't more people point that then? Even here it seems like that info just fell through the cracks. Usually somebody like Acert would have went 4 paragraghs about how great and different that is from past systems.
 
This revelation came to me when somebody else posted in another topic not long ago to break my mind from the concept of "threads" (with the implication to think more in terms of "packets")...plus I had to get a biker tatoo, but anyways.

At this point, I agree it is a fine distinction that a lot of people aren't picking up. It all stems from the well-established, classic understanding that multiple processors automatically involve some number of discrete threads. There's nothing wrong with that, as that is essentially how the problem has been approached for a very long time. There is no imperative that there can only be one approach, however. It's quite certain that by designing an architecture from the ground-up to embrace multiprocessing may inherently make new approaches possible, whereas adapting an existing architecture to simply support multiprocessing may only leave you with the classical approach as a viable approach.
 
Last edited by a moderator:
ihamoitc2005 said:

In this instance, I was refering to direct memory access, as just that, direct acces to the memory from within cell, the ability to load and store directly from the cell i-stream to memory. Cell only supports access to main memory via a copy engine that realistically much move large chunks of memory at a time to be efficient.

Aaron Spink
 
hugo said:
Xenon isn't x86.I've never underestimated MS's software development skills but with Sony' recent move of going with software houses such as Epic and using Nvidia's CG tools,it shows that they indeed making efforts to make their console easier to develop for this coming gen.

Nor did I say it was x86. The actual instruction set being used is a secondary issue to the overarching programming model. The whole of the mainstream of the computing industry is moving towards a model that is roughly the same as the x360 which will reap significant rewards. The network processor industry is the closest thing to cell and has been riddled with frustrations, bugs, and performance issues due to the complex programming models.



Aaron Spink
speaking for myself inc.
 
Nemo80 said:
On the CELL on the other hand you have the same problems, when looking only at the PPE (although cache is a little less a problem since there is more per thread than Xenos has). The big difference is however that the SPE Model is not a SMP/T one at all. It can be thought of something like a Master -Slave relationship where the Master (PPE) delivers tasks to the individual SPE, much like simply calling a subroutine. Only that the subroutine is running in an ultra fast SPE instead of a GP core. This way there is much less synchronziation work needed and since each SPE is independent from the other it's also highly unlikely that "cache" stalls can occur (Also since SPEs don't have any cache)...

Gee, you just rediscovered the producer-consumer model which has been employed on SMP systems in the past.

Suffice to say that this model is certainly going to be a common one in game engines for X360. You kick off a bunch of threads at the start of the game loop to compute geometry, physics, etc for the next frame, and then gather the results together when they are done.

I would suggest that you read up on the various programming models in use today before making anymore comments.

Aaron Spink
speaking for myself inc.
 
I posted an explanation but can't think where. There's different programming models that you can use on Cell (and different memory access models too).

One is the traditional 1 thread=1 job which people gravitate towards. They talk of 2 threads for this, 2 for that, 1 for so and so. The problem with this model is inefficiency. When a SPE has finished it's job it'll just be waiting around doing nothing. Another problem is data dependency which I'll explain in a bit

Another model is distributed computing, processing one task across multiple elements.

So say we have to process on the SPE's rigid body physics, AI, sound, fluid dynamics and texture synthesis. The traditional model might be...

SPEs 1&2 = Rigid body physics
SPE 3 = AI
SPE 4 = sound
SPE 5&6 = fluid dynamics
SPE 7 = texture synthesis

Now consider SPE usage over 15ms (about 1/60th second). If sound for the frame is complete in 4 ms, you've got SPE 4 sitting idle for 11 ms. And if your rigid body physics is getting complicated maybe 2 SPE's won't manage to fulfil it in time. We also have a problem with dependency. AI needs to react to things that are happening, which might be dependant on physics. Audio too needs to know where objects are and where they're colliding. You can't really calculate the audio until after the physics. The distributed comupting model might work like this (BTW : my use of terminology is pretty manky. The offical term isn't distributed computing but I can't remember what it is. However in this context I just mean distributing the process over several computation devices)

Time Process
0 ms SPEs 1-7 process fluid dynamics including objects on surface of water
2 ms SPEs 1-7 process rigid body physics
8 ms SPEs 1-7 calculate AI
11 ms SPEs 1-6 generate textures, SPE 7 generates audio
13 ms SPEs 1-4 work on post processing effects, SPE 5 processes audio encoding

This way there's less wastage and more flexibilty. It's also scalable which ties in with one of the origianl concepts for Cell. If your tasks can be comparmentalised, you can spread the workload over as many SPEs as the system has. And if you attach more SPEs by networking up with another Cell device, it can share in the workload.

Context switching has negligable overhead when switching tasks as long as you aren't switching task frequently. Where a PC CPU has to switch between potentially dozens of prcoesses a SPE doesn't. It can be left to finish the job. If your having a SPE working on several different and switching between them frequently, you're not making the most of the SPE.
 
inefficient said:
I believe you are wrong. Your thinking in classic classic SMP/T terms. And like Nemo80 hinited, the correct way to look at SPE progamming is not like this. The key advantages the Cell has here is the DMA memory access model and that each SPE has a local store. In the cell programming model you would set up a DMA on the SPE and then let it execute/read/write in it's own private area.

And then it has to utilize the copy engine to move that data back to a memory accesible by the actual cpu and then syncronize with the cpu. Its the same damn model just with extra steps and complications. Its no different than having a multiple DSP card in a PC which you copy the data to, telll it to perform the calculations, copy the data back, and at some point syncronize with the main processor.

In the end, it is a producer-consumer model with added complexity.

Aaron Spink
speaking for myself inc.
 
aaronspink said:
Nor did I say it was x86. The actual instruction set being used is a secondary issue to the overarching programming model. The whole of the mainstream of the computing industry is moving towards a model that is roughly the same as the x360 which will reap significant rewards.
Though are not processor roadmaps heading towards a Cell like design? We've got SMP cores for now, but some years down the line Intel will be introducing a Cell structure of core(s)+synergistic processing unit. This change looks set to come sooner or later whether programmers like it or not, no?
 
Panajev2001a said:
Says you and others... but not everyone dislikes it :).

Given the choice between an architecture with DMA movement engines or DMA movement engines along with direct access, 9 out of 10 good programmers will prefer the later. The 10th was just in a car accident and suffered massive brain damage.

Aaron Spink
speaking for myself inc.
 
Shifty Geezer said:
Context switching has negligable overhead when switching tasks as long as you aren't switching task frequently. Where a PC CPU has to switch between potentially dozens of prcoesses a SPE doesn't. It can be left to finish the job. If your having a SPE working on several different and switching between them frequently, you're not making the most of the SPE.

Yeah, it should be stressed that the switch would only happen when the task is finished, not on regular blocking conditions. The idea with a SPU is to avoid blocking conditions ;)

aaronspink said:
The 10th was just in a car accident and suffered massive brain damage.

Necessary? No. Rude? Yes. Well done.
 
Last edited by a moderator:
aaronspink said:
In this instance, I was refering to direct memory access, as just that, direct acces to the memory from within cell, the ability to load and store directly from the cell i-stream to memory. Cell only supports access to main memory via a copy engine that realistically much move large chunks of memory at a time to be efficient.

Aaron Spink

True, but this is very efficient setup. Main memory is very slow and far away, last thing execution unit wants to do is waste cycles dealing with it directly for every little thing. This is why CPUs have cache. Cost of cache-miss is massive, even with low-latency XDR.
 
Shifty Geezer said:
Though are not processor roadmaps heading towards a Cell like design? We've got SMP cores for now, but some years down the line Intel will be introducing a Cell structure of core(s)+synergistic processing unit. This change looks set to come sooner or later whether programmers like it or not, no?

To my knowledge, neither Intel nor AMD have published any roadmaps or intentions to develop anything like cell. From their published roadmaps, both Intel and AMD appear to be going down the path of multiple homogeneous processors on a die.

Aaron Spink
speaking for myself inc.
 
Necessary? No. Rude? Yes. Well done.

No rude, its called humor. I certainly could have said 10 out of 10, but what I said was certainly more humorous. Given the choice, no programmer is going to turn down the option of having direct acces to main memory. While there is an advantage to copy engines and private memory, there is a signficant disadvantage to giving up direct access.

Aaron Spink
speaking for myself inc.
 
aaronspink said:
Given the choice between an architecture with DMA movement engines or DMA movement engines along with direct access, 9 out of 10 good programmers will prefer the later. The 10th was just in a car accident and suffered massive brain damage.

Aaron Spink
speaking for myself inc.

The more processing cores working on small bits of data wanting direct access the more the brain-damaged guy is smartest no? :)
 
Back
Top