New ITAGAKI interview touches on 360 and PS3 comparison

SPEs to LS

Its called the Element Interconnect Bus.

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D/$file/MPR-Cell-details-article-021405.pdf
I don't see anything in there that says that SPEs can directly access the Local Store of other SPEs.

The relevant quote in that document may be:
Similarly, another SPE can use the DMA controller to move data to an address range that is mapped onto a local store of another SPE or even to itself.
Which is, of course, not direct access.

I think the biggest problem is that SPEs do not mesh well with current popular programming paradigms. Ideally, you would want to load up a small function/executable/whatever into a small part of the local store then churn through a ton of data while issuing new DMAs. Perhaps asking for 96k at a time. Issue a request, work on the other 96k while waiting, and then swapping and issuing another 96k request. But then your memory needs to be put into sequential memory blocks. You certainly don't want the data you need to be working on intermixed with data that has no use in the current thread or spread all over the memory space.

I like using object oriented programming techniques myself. And the hoops I would have to go through to get my data in the proper format does not sound fun, but then I haven't done any graphics or physics work worth speaking about. Certainly you won't want to be using anything that has dynamic memory needs, but then I assume that is avoided anyways in the console space as allocating memory tends to be expensive.

At any rate, you won't be putting normal threads on the SPEs without losing a lot of performance. It sounds like you will need a significantly different approach to getting the most out of cell, luckily that approach should work well with XeCPU and PCs. But going the other way is really not an option.
 
ihamoitc2005 said:
Why should it fit in SPE local-store when it can fit in PPE L2 Cache as I described?

Well, then the SPEs are useless, no? The issue is that any given SPE can't work on a data structure that is larger than 256KB (in the real world its actually much smaller due to the setup overhead and latency of the DMA engine you'll likely be limited to a real data set of ~64KB at a time to allow upload and download of the previous data set.



OTOH, 6 threads on 384kb cache block on just 1MB of cache is pure hypothetical and not realistic since cache-block x threads effectively 230% of physics cache. Even 256kb cache block very risky for 6 threads with only 1MB cache since 150% of physical memory is risky. Too much cache-block x threads over physical memory is disaster waiting to happen. Cache miss after cache miss will be result. Maybe if developer finds opportunity, 6 threads at 192kb possible for effective 2MB total on very rare occasions. Still pushing chances and not recommended if consistent frame-rate matters.

You apparently don't understand the issues surrounding constructive interference. It is possible and for some data structures likely that a large number of the threads will be referencing it at the same time. In these cases that 256KB or 384KB dataset for instance is all being shared among the threads within the cache resulting in an effective cache size much greater than the actual cache size. This is real and does happen.

Aaron Spink
speaking for myself inc.
 
hugo said:
Yes there is.The process is called stream processing

Which is all fine and good but the mechanisms which allow this on CELL require using the DMA copy engine to copy a portion of one SPE's local storage to the local storage of another SPE. It doesn't happen auto-magically, or at least there is nothing in the CELL documentation that would allow it to happen auto-magically.

the actual process involves aliasing a portion of 1 SPE's local store into the global address map and then initiating a DMA copy from another SPE into the mapped global address range which is translated into the first SPE's local store.

Aaron Spink
speaking for myself inc.
 
ihamoitc2005 said:
Its called the Element Interconnect Bus.

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D/$file/MPR-Cell-details-article-021405.pdf

The EIB is merely a transport mechanism and does not facilitate the direct movement of data in 1 SPE to another. All movement into and out of the SPE's local store is handled via the DMA engine (Called MFC in the actual design).

It is readily apparent that you haven't read the actual tech documents and are instead relying on heresay from analysts at disreputable firms.

Aaron Spink
speaking for myself in.c
 
Eib

Brodda Thep said:
I don't see anything in there that says that SPEs can directly access the Local Store of other SPEs.

LS is shared memory and EIB connects SPEs to LS. EIB peak throughput 96B/cycle. Each SPE has 16B/cycle connect speed via EIB, same as PPU to L1 cache!

Look at diagram on page 2.

Maybe easier to understand if you read previous link as well which describes 3 level memory architecture.
 
aaronspink said:
It doesn't happen auto-magically, or at least there is nothing in the CELL documentation that would allow it to happen auto-magically.

Yes it does with the use of a simultaneous non realtime specialised OS in the background that manages the stream processing.IBM has this virtualisation technology incorporated into the Cell to do this.
 
ihamoitc2005 said:
LS is shared memory and EIB connects SPEs to LS. EIB peak throughput 96B/cycle. Each SPE has 16B/cycle connect speed via EIB, same as PPU to L1 cache!

Look at diagram on page 2.

Maybe easier to understand if you read previous link as well which describes 3 level memory architecture.

Maybe it would be easier to understand if you got a clue. The LS is NOT shared. Thats what the friggin LOCAL part of the name LOCAL store means. All access into and out of the LS of a SPE must me done through the DMA copy engine/MFC.

As I said, read the damn architecture documents and come back with a clue.

Aaron Spink
speaking for myself inc.
 
hugo said:
Yes it does with the use of a simultaneous non realtime specialised OS in the background that manages the stream processing.IBM has this virtualisation technology incorporated into the Cell to do this.

Um, in a word, NO. It doesn't happen automagically, it has to be programmer controlled via the MFC/DMA copy engine.

Aaron Spink
speaking for myself inc.
 
aaaaa00 said:
It adds fewer complications than something that requires developers to manage asynchronous DMAs, seperate small address spaces, and code overlays on top of whatever you need to do to make multithreading work in the first place.

"overlapping address spaces with potential context violations, everything ever touched by more than one thread should be thread safe or you should be absolutely sure what you're doing, intra-thread caches constantly stepping on each other toes, etc.". see, everyone can nitpick just for the jist of it.

Priority inversion is only possible when you have a contended resource shared between a high priority thread and a low priority thread AND you are scheduling multiple software threads on one single threaded physical processor -- the OS scheduler must decide which of the threads it can dispatch onto the CPU and which not to.

actually, the part after the AND above is totally superfluous. priority inversion is each situation where a thread of priority N does prevent another thread of priority N + X from running by means of a third thread of priority N - Y, where there exists resource contention between the latter two theads and there's no such between the first thread and any of the other two (where N, X and Y are positives). therefore, the only two threads that should compete for the same cpu are those of priorities N - Y and N. the third one (N + X) may have a whole vacant cpu for itself - it doesn't matter, it still cannot run.

On xbox 360, for the high performance parts of your engine you are typically going to use exactly as many threads as there are hardware threads, so all of the threads are really running at the same time on the hardware -- there's no software threading going on.

sorry, i missed that - why? you have N threads for the high performance parts (N = num hw threads) and an arbitrary number of other 'non-high performance' threads - and you get software threading, just not among the high-performance threads, supposedly.

Hence, there should be little possibility for priority inversion because none of the threads should ever get into a CPU starved state.

after the slight corretions above i don't see why anynmore.
not only thread (N) can starve (N - Y) on a scheduling basis, but also (N - Y) can be running and still its SMT 'roomate' can trash the former's cache so badly, that you get a brand new form of 'priority inversion' - one where (N + X) cannot run because a thread of arbitrary low priority (even lower than N - Y) to which the former has no contention relations whatsoever is cache-bulling (N + X)'s lock-keeper (N - Y).

Likewise priority inversion on SPEs is mostly likely a non-issue.

aside from the 'likewise', yep. as you would not do multithreading on a single SPE (they have a big read flashing sign over them 'not for multithreadin'), and you really don't _need_ to - there are plenty of them.


I did not say the tradeoffs were never worth it.

so i was under the wrong impression up until now : )

ok then, who decides which tradeoffs are and which are not worth it?
 
Last edited by a moderator:
Iced Tea

aaronspink said:
Maybe it would be easier to understand if you got a clue. The LS is NOT shared. Thats what the friggin LOCAL part of the name LOCAL store means. All access into and out of the LS of a SPE must me done through the DMA copy engine/MFC.

As I said, read the damn architecture documents and come back with a clue.

Aaron Spink
speaking for myself inc.

"Say what?" Have you not read about CELL stream processing where one SPE reads from another SPE's local store? Maybe instead of getting hot under your collar you should drink some iced tea and learn about how CELL actually works. Its not as complicated as you like to think.
 
ihamoitc2005 said:
"Say what?" Have you not read about CELL stream processing where one SPE reads from another SPE's local store? Maybe instead of getting hot under your collar you should drink some iced tea and learn about how CELL actually works. Its not as complicated as you like to think.

Aaron knows what he's talking about.

from http://www-306.ibm.com/chips/techli...C2D/$file/MPR-Cell-details-article-021405.pdf

Each of the eight SPEs has its own private local store, and the local-store memory is aliased to main memory but does not participate in a cachecoherency protocol. Software must manage the movement of data and instructions in and out of the LS and is controlled by the MFC. The LS has data-synchronization facilities but does not participate in hardware cache coherency. The eight local stores do have an alias in the memory map of the processor, and a PPE can load or store from a memory location that is mapped to the local store (but it’s not a high-performance option). Similarly, another SPE can use the DMA controller to move data to an address range that is mapped onto a local store of another SPE or even to itself.

SPE == SPU + LS + MFC

SPEs and PPE connect to EIB

Eh, just read up on MFC and you'll understand.

edit: fix link
 
Last edited by a moderator:
Acert93 said:
Each thread wont necessarily be the same size. You will have small threads and large threads. e.g. with an SPE you get 256K. That is it. So if you have two "applets" -- one 128K and the other 384K, you are in trouble. So the 384K applet does not work and the 128K applet leaves unused memory.
I imagine that's an unlikely scenario, like wanting to run a block 3 megs of executable on a processor with 1 MB cache. The crux of this issue is how large is an appulet? At the end of the day an appulet will be written to fit into the LS along with its data, so you won't ever have a developer trying to squeeze a pint of code into a half-pint store. If necessary they'll have to divide the process into two smaller appulets and maybe run them across two SPEs sharing data.
 
aaronspink said:
The only reason there is so much is because it is statically partitioned and allocated which means that all the local store must be sized to cover the maximum amount of memory that will be needed by any given program. In a scenario where the LS was shared this would not be required and the LS would be smaller.
And in a scenario where the LS was shared you'd by dividing its acces rate across 7 processors would you not? Where an L2 cache is a store, LS is a working space, I and D cache combined. Having six processors waiting for a seventh to finish working on the LS before they can access it doesn't sound very efficient!

Also, I think your believe in the low latency of XDR is fairly misplaced. The latencies for XDR are roughly equivlent to the latencies for SDR, DDR, DDR2, and DDR3.
Yes, I keep hearing about XDR's low latencies but no-one offers actual numbers. However I gleen the reason it's classed as lower latency is because it's clocked so much higher. But I don't really know. It's one of the console hardware myths.
 
Sis said:
(I'm only debating this point based on your original statement about fine-tuned control over memory access and use of assembler.)

Of course large programs are broken into smaller chunks--this has been true for how many decades now? The problem is that given a large code base, no one wants to be writing assembler level code for all the little bits. Instead, you code first, then you profile, then you optimize, and it's during the optimization stage that you may prefer to drop down into assembler/C/machine hacks and want fine-grained control over memory.

When you code, you code for correctness first. Things that make that difficult end up jeopardizing both your profiling and optimization stages.

.Sis
Which is why I guess no-one writes in assembler anymore, and there's no need to anyway except on tiny little code snippets!
 
Aaron is completely spot on. He knows EXACTLY what he's talking about.

SPE have 256K of addressable memory (via 32 bit pointer). They have an MFC which can DMA memory from an larger external pool (via 64 bit pointer) into the local memory. Its just so happens that the virtual address space includes each SPE's local memory, so that they can DMA to/from each others LS.
The MFC are clever enough to take the shortest path when you do this, so its fast BUT apart from speed its exactly the same as accessing main memory. Its a DMA and most importantly you have to copy it into local memory first before use (so you can never access more than 256K (including code) at any one time).
 
ihamoitc2005 said:
Using youre method of comparison, if one thread consumed 384k, you have just 5 threads with 128k/thread on XeCPU remaining..

On the other hand, if one thread consumed 384k on CELL, there is still 1 thread with 128k and 7 threads with 256k.
Aaron pointed out my point pretty good. My method was only to show your comparison is not an equivalent one. You are trying to frame Xenon within the framework of CELL. They both require a different model of approach.

As for your exception, I was pretty clearly talking about the SPEs (as Aaron noted) so your what if does not fit my example very well. I could just counter that your SPE and PPE code are not necessarily interchangible (not to mention the PPE is going to be doing a lot of stuff related to the OS and delegating tasks to the SPEs, so consuming 384k on one intensive task could be counterproductive).

The point is you are going to have to make your code for the SPEs fit within the 256K block and if your code is only 200K the left overs cannot be realistically counted as "extra cache in the system".

They are different models of use.

Really, CELL and Xenon are different designs requiring a different approach. At this point neither has shown itself to be better or more effecient. Xenon's model is the PC route and has more research behind it and has shown to have some very large hurdles; CELL is a new approach to the problem. Obviously a lot of the batching and que related ideas could work on Xenon as well. The difference is it has less cores, but they are more "flexible" cores so you don't have as many cores to use with such a model. The initial leak and patents indicate MS is aimed more at a model where a single GPU is dedicated to graphics (renderer, procedural work) and taking the rest from there. On the reverse the method for Xenon is not favorable to the SPEs. You would not want to take the PPE code and run it on a SPE.

So one method wont necessarily work for the other. They are different designs with different needs.

So I don't understand this fixation of denoting how one wont work on a given context. Games that are aimed at exposing the CPUs and maximizing performance are not going to be easily ported becayse the CELL model and the Xenon model of approach to reach those goals is in conflict.

CELL is a bigger chip (50% bigger) and has a higher peak in floating point, so it should really excell there. You would expect a significantly larger chip to perform better on average.

Shifty said:
I imagine that's an unlikely scenario, like wanting to run a block 3 megs of executable on a processor with 1 MB cache. The crux of this issue is how large is an appulet? At the end of the day an appulet will be written to fit into the LS along with its data, so you won't ever have a developer trying to squeeze a pint of code into a half-pint store. If necessary they'll have to divide the process into two smaller appulets and maybe run them across two SPEs sharing data.
Yep. Different needs, different architecture, different approaches. There are solutions for these problems on BOTH models. They are just different. Because one solution does not work on one platform does not mean it is "the suck".

We all have our thinking caps on how to make CELL work, which is good. But this same type of thinking goes into any project regardless of the CPU being used. CELL is just more exciting because it is new, powerful, and with so many cores you can really try some new things. That is a good thing. But that does not mean the model Intel/AMD are using is useless either (not that you are argueing that). IBM thought it was a good enough approach to come to Sony with it first and eventually used this approach for MS.

And I am sure we would be looking at it different if a Pentium D or X2's 2nd core was aimed more at floating point performance. Obviously they did not and games can really use the extra FP peformance, so CELL's performance in this area is very exciting!
 
Last edited by a moderator:
Bobbler said:
Can't we all just agree that the Cell and XeCPU offer two fairly different answers to the same problem? Both should work well, especially in a closed system.
That is what I am saying!

Bobbler said:
You program to the strengths, not weaknesses. Neither CPU has a perfect model for cache (or cache equivalents) either -- It wouldn't be hard to find ten problems with each CPU, but is that really important?
Please, by all means, make yourself at home in the console forum. ;)

Aaron said:
We haven't even gotten into cases where multiple threads share stay a 256 or 384 KB data structure making the effective size of the cache: 384 * 6 + 1024 - 384 = ~3MB.
Good example. But this analogy does not work if you are counting cache and local store with the goal to get a "total amount" end product. That just does not jive with the Xenon model (large shared resource among a couple cores vs. statically partitioned resources across many cores).

Its kind of like counting peak performance. Yeah, it looks good on paper... but how does that play out in real life. The only fair measurement is to do so in context of the programming model it will employ. In this case CELL and Xenon are different, so trying to cram one approach down the throat of the other and pointing out deficiencies does not really respect their differences.

What you do on Xenon is not something you will do on CELL, and vice versa.
 
Well the topic of conversations from what I observered in this forum went from:
Which was more powerful:
RSX vs Xenos to Cell vs Xenon
Now it's which is easier to program for? Cell vs Xenon.

The Cell's SPEs has their fixed 256kb cache limitation but there are goods and bads to it.The bad thing is programmers would have to write codes that exactly make use of the fixed amount of cache when doing subroutines.OTOH if you're going to specifically write specialised code for each individual SPE it's easier to troubleshoot problems and see your codes in a neatly manner.

The Cell and Xenon is not gaps apart from each other when it come to programming.For every problem that existed for each CPU there's solutions to overcome them.But the performance crown would still have to the Cell.No denial to that.
 
hugo said:
Well the topic of conversations from what I observered in this forum went from:
Which was more powerful:
RSX vs Xenos to Cell vs Xenon
Now it's which is easier to program for? Cell vs Xenon.

The Cell's SPEs has their fixed 256kb cache limitation but there are goods and bads to it.The bad thing is programmers would have to write codes that exactly make use of the fixed amount of cache when doing subroutines.OTOH if you're going to specifically write specialised code for each individual SPE it's easier to troubleshoot problems and see your codes in a neatly manner.

The Cell and Xenon is not gaps apart from each other when it come to programming.For every problem that existed for each CPU there's solutions to overcome them.But the performance crown would still have to the Cell.No denial to that.
well the topic of conversation for this thread is Itagaki, so why on earth they talk about Cell and Xenon
 
Back
Top