New ITAGAKI interview touches on 360 and PS3 comparison

ihamoitc2005 said:
Except that all the SPEs can work in parallel or in sequence and can read from each other's local stores so redundant data isnt only unnecessary its silly.

I don't believe that the SPEs can read each other's local stores. I believe that all access externally to/from a given local store must be done via the DMA copy engine. As such, the CELL has support for the copy engine to copy data from 1 SPE's local store to another SPE's local store.

Also any set of hardware contexts can operate in both parrallel or sequencial mode. This is pretty basic and not a new invention nor feature of cell.

Aaron Spink
speaking for myself inc.
 
ihamoitc2005 said:
Think of missing HD, originally standard, now $100 option.

Neither accurate nor relevant. HDD is not an engineer/design decision... building a chip is waaaaaaaay different. Also HDD was NEVER standard except on Xbox.
 
ihamoitc2005 said:
No the primary issue is limiting the negative effect of memory latency on real world performance. As the number of processing units increase, the greater the marginal cost of direct access.

The latency is actually going to be higher with the DMA engine.

Its no surprise that theres 1.8MB of local store for the SPE and another 512kb for the dual-threaded PPE. That works out to 256kb/"thread". Plus its hooked up to very low latency XDR.

The only reason there is so much is because it is statically partitioned and allocated which means that all the local store must be sized to cover the maximum amount of memory that will be needed by any given program. In a scenario where the LS was shared this would not be required and the LS would be smaller.

Also, I think your believe in the low latency of XDR is fairly misplaced. The latencies for XDR are roughly equivlent to the latencies for SDR, DDR, DDR2, and DDR3.

Aaron Spink
speaking for myself inc.
 
ihamoitc2005 said:
Using youre method of comparison, if one thread consumed 384k, you have just 5 threads with 128k/thread on XeCPU remaining..

On the other hand, if one thread consumed 384k on CELL, there is still 1 thread with 128k and 7 threads with 256k.

To misquote the misquoter, fool me once shame on you, fool me twice, 384K won't fit into a 256kb local store.

We haven't even gotten into cases where multiple threads share stay a 256 or 384 KB data structure making the effective size of the cache: 384 * 6 + 1024 - 384 = ~3MB.

Aaron Spink
speaking for myself inc.
 
Gubbi said:
They can DMA to and from other SPEs' local stores, into their own. Hence a copy and hence redundant.

Cheers
Gubbi

It cannot DMA to and from each other's local stores it has its own internal bus that unites all the local-stores at very high speeds.

Each SPE only uses DMA for local-stores to access main memory, not to access shared local-store. Each SPE has to complete access to full local-store (1.8MB of coherent shared memory) and writes from any local-store to its register file via on internal bus at very fast speeds.

Furthermore, if the data is split into blocks and loaded in batches almost 1.8MB of local-store is available via the internal bus to all SPEs.

From link posted below:

This three-level organization of storage (register file, local store, main storage) -- with asynchronous DMA transfers between local store and main storage

Each SPE has full access to coherent shared memory, including the memory-mapped I/O space.

The most productive SPE memory-access model appears to be the one in which a list (such as a scatter-gather list) of DMA transfers is constructed in an SPE's local store so that the SPE's DMA controller can process the list asynchronously while the SPE operates on previously transferred data. In several cases, this new approach to accessing memory has led to application performance exceeding that of conventional processors by almost two orders of magnitude, significantly more than anyone would expect from the peak performance ratio (about 10x) between the Cell Broadband Engine and conventional PC processors.
 
aaaaa00 said:
I do not equate "correct" with "easy", only that there tends to be a relationship.

If it's harder to write correct code, then it tends to be harder to write fast correct code, since all fast code must be correct code to be useful.

If it's easier to write correct code, it also tends to be easier to write fast correct code.

ok. here lies the crux of our disagreement - you're saying that smp multithreading allows for 'easier correct' code, and i say that it allows for 'easier' code per say, nothing more nothing less. as in "it won't automtically randezvouz your theads, it won't magically un-race them, it won't do squat more 'bout your code's correctness than the rudimentary mem coherency".

The point is, the less time you need to insure your code is correct, the more time you can spend on optimization.

sure. it's not clear though why you'd spend less time ensuring correctness of the smp code in comparison to 'cellular' code. a lame question for you: which would be potentially more problematic - priority inversion under xecpu or priority inversion under cell? when thinking about it take into accunt the number of processing elements, potential use of SMT, etc.

I never said it was the most efficient, merely the easiest and most natural. This is simply because it is the most general form of multithreading -- you have N threads, they execute the same kind of opcodes, they share your entire address space, memory ordering is strict, and memory coherency between the threads is enforced.

<sidenote>
if threads share the same address space then mem coherency _damn_better_ be enforced or you're in deep shit.
</sidenote>

Every multithreaded design pattern can be constructed from this basic foundation, which is why it's taught in schools and why it's the dominant form on general purpose CPUs.

i agree with your statement above up to "that's why it's taught in schools".

"as Charles Babbage's analytical engine was a turing-complete machine (at least as much as the desktop i'm writing this from) and turing machines are fundamental and taught at school, Babbage's analytical engine has been the dominant computer ever since."

see why?

Each special purpose optimization you add to this (weakly ordered memory model, NUMA, dropping coherency, asymmetry of the threads, private thread address spaces, etc) can improve peak performance, but at the cost of adding things to worry about.

and here we totally and utterly disagree. also you may want to give a wakeup call to all those high-performance NUMA architectures vendors about how wrong they are and what they cause to the developers.
 
expletive said:
Ok, couple of things here, and i do appreciate your thoughtful response.

1. My original post was phrased in the form of a question:

"That said, paralleism with 3 identical cores and 6 identical threads should be a bit easier than a PPE and SPE design where each has different needs and potentially different roles shouldnt it?"

So i'm not trying to argue any point, just trying to udnerstand the difference and benefits of each approach.

2. I guess i interpret what is being said by JC slightly differently. My interpretation is he's saying:
a. multithreaded prgramming is a pain in the ass
b. from an 'ease of use' standpoint, the design in the 360 is the best possible case for a developer to coax performance benefits out of multithreading
c. even on the best possible case, its very difficult to realize real world benefits

In my mind, that still doesnt change the fact that regardless of how much aboslute performance gain you can wring out of the 360 CPU, at the end of the day its still easier to get that meager efficency from the XeCPU than the Cell.

What does this mean? I dont really know, more relative efficiency on the 360 CPU? Shorter developement times? Bertter games sooner in each console's lifecycle? No idea and only time will tell.

So in summary, even if the dev-friendly design of the XeCPU gets you nothing, its easier to get nothing on the 360 then on the PS3. :D

J

re point b.

'ease of use' is a tarball in itself (in case you haven't noticed the argument we're having with aaaaa00). what can be said for sure, and would not contradict with what Carmack says either, is that 360's model is very close to the typical smp pc. that would imply ease of porting/translation of programming experience from that same pc domain.

of course, at the end of the day you may be absolutely right - the bulk of console devs may turn out to suck at getting anything decent out of the cell. what you're undoubtedly right about even at this very moment, though, is that time will tell : )
 
Misguided misquote of misquoter

aaronspink said:
To misquote the misquoter, fool me once shame on you, fool me twice, 384K won't fit into a 256kb local store.

I like the misquoting of the misquoter and the sentiment behind it but in this case its misguided.

384k fits into 512kb, which is the size of the L2 cache of the PPE. As I said earlier, that still leaves 128kb for another PPE thread and 1.8MB of local store to be shared by 7 SPEs.

We haven't even gotten into cases where multiple threads share stay a 256 or 384 KB data structure making the effective size of the cache: 384 * 6 + 1024 - 384 = ~3MB.

Aaron Spink
speaking for myself inc.

You are talking about cache blocking no? You are misunderestimating the unlikelihood of 6 threads sharing block as well as the cost of cache miss.
 
Can't we all just agree that the Cell and XeCPU offer two fairly different answers to the same problem? Both should work well, especially in a closed system.

The problem with cache on the XeCPU shouldn't really pose itself to be a huge problem because of the nature of a closed system -- you know the limtations ahead of time and so you can plan around them. Same with the Cell. You program to the strengths, not weaknesses. Neither CPU has a perfect model for cache (or cache equivalents) either -- It wouldn't be hard to find ten problems with each CPU, but is that really important?

The closer to release of these consoles the less I seem to care about the specs (I realize some people still like to debate things and dig into specs with fine tooth combs, and that's fine -- but I think getting into the fine semantics of things and trying to say one is a patently better solution is a bit overzealous).
 
ihamoitc2005 said:
It cannot DMA to and from each other's local stores it has its own internal bus that unites all the local-stores at very high speeds.

Each SPE only uses DMA for local-stores to access main memory, not to access shared local-store. Each SPE has to complete access to full local-store (1.8MB of coherent shared memory) and writes from any local-store to its register file via on internal bus at very fast speeds.

Furthermore, if the data is split into blocks and loaded in batches almost 1.8MB of local-store is available via the internal bus to all SPEs.

You are going to have to document your claim here. there is no documentation that I have seen that allows the SPEs to directly access the local store of other SPEs.
 
Bobbler said:
Can't we all just agree that the Cell and XeCPU offer two fairly different answers to the same problem? Both should work well, especially in a closed system.


Quit being all moderate and even-handed, damn you! :p
 
ihamoitc2005 said:
I like the misquoting of the misquoter and the sentiment behind it but in this case its misguided.

384k fits into 512kb, which is the size of the L2 cache of the PPE. As I said earlier, that still leaves 128kb for another PPE thread and 1.8MB of local store to be shared by 7 SPEs.

but 384 won't fit within an SPEs local store.



You are talking about cache blocking no? You are misunderestimating the unlikelihood of 6 threads sharing block as well as the cost of cache miss.

I believe my assumptions are as perfectly valid as your assumptions if not more so.

Aaron Spink
speaking for myself inc.
 
Processes can be assigned to any free SPE at any given time dynamically. The SPEs were meant to process smaller chunks of data that are ordered by the PPU which has 2 threads.
Given a data as large as 384k it will most likely be assigned to the PPE.If ever there's a case whereby more than 256k needs to be stored and processed by the SPEs it's possible that the SPEs can interact which each other via the internal high speed Element Interconnect Bus(EIB).
 
aaronspink said:
but 384 won't fit within an SPEs local store.

Why should it fit in SPE local-store when it can fit in PPE L2 Cache as I described?

I believe my assumptions are as perfectly valid as your assumptions if not more so.

Everything I say is hardware capability with no complicated programming required, just SPE style "small" programming. No assumptions or hypothetical situations, all real hardware capability.

OTOH, 6 threads on 384kb cache block on just 1MB of cache is pure hypothetical and not realistic since cache-block x threads effectively 230% of physics cache. Even 256kb cache block very risky for 6 threads with only 1MB cache since 150% of physical memory is risky. Too much cache-block x threads over physical memory is disaster waiting to happen. Cache miss after cache miss will be result. Maybe if developer finds opportunity, 6 threads at 192kb possible for effective 2MB total on very rare occasions. Still pushing chances and not recommended if consistent frame-rate matters.
 
aaronspink said:
You are going to have to document your claim here. there is no documentation that I have seen that allows the SPEs to directly access the local store of other SPEs.

Yes there is.The process is called stream processing
 
ihamoitc2005 said:
Why should it fit in SPE local-store when it can fit in PPE L2 Cache as I described?



Everything I say is hardware capability with no complicated programming required, just SPE style "small" programming. No assumptions or hypothetical situations, all real hardware capability.

OTOH, 6 threads on 384kb cache block on just 1MB of cache is pure hypothetical and not realistic since cache-block x threads effectively 230% of physics cache. Even 256kb cache block very risky for 6 threads with only 1MB cache since 150% of physical memory is risky. Too much cache-block x threads over physical memory is disaster waiting to happen. Cache miss after cache miss will be result. Maybe if developer finds opportunity, 6 threads at 192kb possible for effective 2MB total on very rare occasions. Still pushing chances and not recommended if consistent frame-rate matters.

I am having some slight difficulty understanding your Engrish.You speak Yoda English.Me know you from where.
 
CELL information

aaronspink said:
You are going to have to document your claim here. there is no documentation that I have seen that allows the SPEs to directly access the local store of other SPEs.

Its called the Element Interconnect Bus.

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D/$file/MPR-Cell-details-article-021405.pdf
 
darkblu said:
ok. here lies the crux of our disagreement - you're saying that smp multithreading allows for 'easier correct' code, and i say that it allows for 'easier' code per say, nothing more nothing less. as in "it won't automtically randezvouz your theads, it won't magically un-race them, it won't do squat more 'bout your code's correctness than the rudimentary mem coherency".

It adds fewer complications than something that requires developers to manage asynchronous DMAs, seperate small address spaces, and code overlays on top of whatever you need to do to make multithreading work in the first place.

a lame question for you: which would be potentially more problematic - priority inversion under xecpu or priority inversion under cell? when thinking about it take into accunt the number of processing elements, potential use of SMT, etc.

Priority inversion occurs when a high priority thread is waiting on a resource locked by a low priority thread, but a third thread is monopolizing CPU resources, starving the low priority thread and preventing it from finishing what it's doing and releasing the lock. This causes the high priority thread to behave as if it was running at low priority, hence "priority inversion".

Priority inversion is only possible when you have a contended resource shared between a high priority thread and a low priority thread AND you are scheduling multiple software threads on one single threaded physical processor -- the OS scheduler must decide which of the threads it can dispatch onto the CPU and which not to.

On xbox 360, for the high performance parts of your engine you are typically going to use exactly as many threads as there are hardware threads, so all of the threads are really running at the same time on the hardware -- there's no software threading going on.

Hence, there should be little possibility for priority inversion because none of the threads should ever get into a CPU starved state. Likewise priority inversion on SPEs is mostly likely a non-issue.

However, if you start scheduling software threads on one core of the XCPU, then yes, priority inversion will be possible, but that's no different than the same occuring on the CELL PPE. You'd only do software scheduled threads on the xbox 360 for low performance areas of your engine anyway.

and here we totally and utterly disagree. also you may want to give a wakeup call to all those high-performance NUMA architectures vendors about how wrong they are and what they cause to the developers.

I did not say the tradeoffs were never worth it. In fact newer SMP architectures like Opteron and Itanium make various optimizations to improve performance and step away from the classical SMP architecture.

For example, on a multiprocessor Opteron you have NUMA, which forces you to think about which node your allocations are coming from because accessing a data structure on the opposite node will knock your performance down.

Another example is that on a multiprocessor Itanium the memory model is weakly ordered, which means the CPU allows writes to memory to complete out of order from other writes and reads. This means you have to insert memory fence instructions in the right places to insure coherence before you do anything that involves more than one thread.

The point is, it is a fact that these optimizations do improve performance at the cost of making it more complicated for developers to write correct code.

From the point of view of a software developer, it is still my opinion that the best case scenario is a classical SMP and that classical SMP is the most general and straightforward form of multithreading.

The more you step away from the classical SMP architecture, the more things you have to think about when implementing your design, and at that point, it's all about the tradeoffs.
 
Last edited by a moderator:
Engrish no good?

hugo said:
I am having some slight difficulty understanding your Engrish.You speak Yoda English.Me know you from where.

I apologize if my Engrish aint perfect, but I try. Sometimes better than other times no? Regarding cache-blocking ... it is not good for sake of cache miss to have cache-block size x number of threads utilizaing cache block exceed total physical memory. If physical memory exceeded, then real chance of too many threads cache miss and fetch from memory. Too many cache miss is very bad and then we must feel ashamed of our failure and hang our heads in shame. I think Aaron "misunderestimates" danger of cache miss and cycles lost. See my Engrish is as good as the president!
 
Back
Top