New ITAGAKI interview touches on 360 and PS3 comparison

Discussion in 'Console Industry' started by ps2xboxcube, Sep 24, 2005.

  1. Brodda Thep

    Newcomer

    Joined:
    Jul 29, 2005
    Messages:
    44
    Likes Received:
    48
    SPEs to LS

    I don't see anything in there that says that SPEs can directly access the Local Store of other SPEs.

    The relevant quote in that document may be:
    Which is, of course, not direct access.

    I think the biggest problem is that SPEs do not mesh well with current popular programming paradigms. Ideally, you would want to load up a small function/executable/whatever into a small part of the local store then churn through a ton of data while issuing new DMAs. Perhaps asking for 96k at a time. Issue a request, work on the other 96k while waiting, and then swapping and issuing another 96k request. But then your memory needs to be put into sequential memory blocks. You certainly don't want the data you need to be working on intermixed with data that has no use in the current thread or spread all over the memory space.

    I like using object oriented programming techniques myself. And the hoops I would have to go through to get my data in the proper format does not sound fun, but then I haven't done any graphics or physics work worth speaking about. Certainly you won't want to be using anything that has dynamic memory needs, but then I assume that is avoided anyways in the console space as allocating memory tends to be expensive.

    At any rate, you won't be putting normal threads on the SPEs without losing a lot of performance. It sounds like you will need a significantly different approach to getting the most out of cell, luckily that approach should work well with XeCPU and PCs. But going the other way is really not an option.
     
  2. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    Well, then the SPEs are useless, no? The issue is that any given SPE can't work on a data structure that is larger than 256KB (in the real world its actually much smaller due to the setup overhead and latency of the DMA engine you'll likely be limited to a real data set of ~64KB at a time to allow upload and download of the previous data set.



    You apparently don't understand the issues surrounding constructive interference. It is possible and for some data structures likely that a large number of the threads will be referencing it at the same time. In these cases that 256KB or 384KB dataset for instance is all being shared among the threads within the cache resulting in an effective cache size much greater than the actual cache size. This is real and does happen.

    Aaron Spink
    speaking for myself inc.
     
  3. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    Which is all fine and good but the mechanisms which allow this on CELL require using the DMA copy engine to copy a portion of one SPE's local storage to the local storage of another SPE. It doesn't happen auto-magically, or at least there is nothing in the CELL documentation that would allow it to happen auto-magically.

    the actual process involves aliasing a portion of 1 SPE's local store into the global address map and then initiating a DMA copy from another SPE into the mapped global address range which is translated into the first SPE's local store.

    Aaron Spink
    speaking for myself inc.
     
  4. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    The EIB is merely a transport mechanism and does not facilitate the direct movement of data in 1 SPE to another. All movement into and out of the SPE's local store is handled via the DMA engine (Called MFC in the actual design).

    It is readily apparent that you haven't read the actual tech documents and are instead relying on heresay from analysts at disreputable firms.

    Aaron Spink
    speaking for myself in.c
     
  5. ihamoitc2005

    Veteran

    Joined:
    Sep 21, 2005
    Messages:
    1,181
    Likes Received:
    15
    Eib

    LS is shared memory and EIB connects SPEs to LS. EIB peak throughput 96B/cycle. Each SPE has 16B/cycle connect speed via EIB, same as PPU to L1 cache!

    Look at diagram on page 2.

    Maybe easier to understand if you read previous link as well which describes 3 level memory architecture.
     
  6. hugo

    Newcomer

    Joined:
    Dec 8, 2004
    Messages:
    172
    Likes Received:
    0
    Yes it does with the use of a simultaneous non realtime specialised OS in the background that manages the stream processing.IBM has this virtualisation technology incorporated into the Cell to do this.
     
  7. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    Maybe it would be easier to understand if you got a clue. The LS is NOT shared. Thats what the friggin LOCAL part of the name LOCAL store means. All access into and out of the LS of a SPE must me done through the DMA copy engine/MFC.

    As I said, read the damn architecture documents and come back with a clue.

    Aaron Spink
    speaking for myself inc.
     
  8. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    Um, in a word, NO. It doesn't happen automagically, it has to be programmer controlled via the MFC/DMA copy engine.

    Aaron Spink
    speaking for myself inc.
     
  9. darkblu

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,642
    Likes Received:
    22
    "overlapping address spaces with potential context violations, everything ever touched by more than one thread should be thread safe or you should be absolutely sure what you're doing, intra-thread caches constantly stepping on each other toes, etc.". see, everyone can nitpick just for the jist of it.

    actually, the part after the AND above is totally superfluous. priority inversion is each situation where a thread of priority N does prevent another thread of priority N + X from running by means of a third thread of priority N - Y, where there exists resource contention between the latter two theads and there's no such between the first thread and any of the other two (where N, X and Y are positives). therefore, the only two threads that should compete for the same cpu are those of priorities N - Y and N. the third one (N + X) may have a whole vacant cpu for itself - it doesn't matter, it still cannot run.

    sorry, i missed that - why? you have N threads for the high performance parts (N = num hw threads) and an arbitrary number of other 'non-high performance' threads - and you get software threading, just not among the high-performance threads, supposedly.

    after the slight corretions above i don't see why anynmore.
    not only thread (N) can starve (N - Y) on a scheduling basis, but also (N - Y) can be running and still its SMT 'roomate' can trash the former's cache so badly, that you get a brand new form of 'priority inversion' - one where (N + X) cannot run because a thread of arbitrary low priority (even lower than N - Y) to which the former has no contention relations whatsoever is cache-bulling (N + X)'s lock-keeper (N - Y).

    aside from the 'likewise', yep. as you would not do multithreading on a single SPE (they have a big read flashing sign over them 'not for multithreadin'), and you really don't _need_ to - there are plenty of them.


    so i was under the wrong impression up until now : )

    ok then, who decides which tradeoffs are and which are not worth it?
     
    #169 darkblu, Sep 26, 2005
    Last edited by a moderator: Sep 26, 2005
  10. ihamoitc2005

    Veteran

    Joined:
    Sep 21, 2005
    Messages:
    1,181
    Likes Received:
    15
    Iced Tea

    "Say what?" Have you not read about CELL stream processing where one SPE reads from another SPE's local store? Maybe instead of getting hot under your collar you should drink some iced tea and learn about how CELL actually works. Its not as complicated as you like to think.
     
  11. TrungGap

    Regular

    Joined:
    Jun 17, 2005
    Messages:
    578
    Likes Received:
    2
    Aaron knows what he's talking about.

    from http://www-306.ibm.com/chips/techli...C2D/$file/MPR-Cell-details-article-021405.pdf

    SPE == SPU + LS + MFC

    SPEs and PPE connect to EIB

    Eh, just read up on MFC and you'll understand.

    edit: fix link
     
    #171 TrungGap, Sep 26, 2005
    Last edited by a moderator: Sep 26, 2005
  12. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,104
    Likes Received:
    16,896
    Location:
    Under my bridge
    I imagine that's an unlikely scenario, like wanting to run a block 3 megs of executable on a processor with 1 MB cache. The crux of this issue is how large is an appulet? At the end of the day an appulet will be written to fit into the LS along with its data, so you won't ever have a developer trying to squeeze a pint of code into a half-pint store. If necessary they'll have to divide the process into two smaller appulets and maybe run them across two SPEs sharing data.
     
  13. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,104
    Likes Received:
    16,896
    Location:
    Under my bridge
    And in a scenario where the LS was shared you'd by dividing its acces rate across 7 processors would you not? Where an L2 cache is a store, LS is a working space, I and D cache combined. Having six processors waiting for a seventh to finish working on the LS before they can access it doesn't sound very efficient!

    Yes, I keep hearing about XDR's low latencies but no-one offers actual numbers. However I gleen the reason it's classed as lower latency is because it's clocked so much higher. But I don't really know. It's one of the console hardware myths.
     
  14. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,104
    Likes Received:
    16,896
    Location:
    Under my bridge
    Which is why I guess no-one writes in assembler anymore, and there's no need to anyway except on tiny little code snippets!
     
  15. DeanoC

    DeanoC Trust me, I'm a renderer person!
    Veteran Subscriber

    Joined:
    Feb 6, 2003
    Messages:
    1,469
    Likes Received:
    185
    Location:
    Viking lands
    Aaron is completely spot on. He knows EXACTLY what he's talking about.

    SPE have 256K of addressable memory (via 32 bit pointer). They have an MFC which can DMA memory from an larger external pool (via 64 bit pointer) into the local memory. Its just so happens that the virtual address space includes each SPE's local memory, so that they can DMA to/from each others LS.
    The MFC are clever enough to take the shortest path when you do this, so its fast BUT apart from speed its exactly the same as accessing main memory. Its a DMA and most importantly you have to copy it into local memory first before use (so you can never access more than 256K (including code) at any one time).
     
  16. Acert93

    Acert93 Artist formerly known as Acert93
    Legend

    Joined:
    Dec 9, 2004
    Messages:
    7,782
    Likes Received:
    162
    Location:
    Seattle
    Aaron pointed out my point pretty good. My method was only to show your comparison is not an equivalent one. You are trying to frame Xenon within the framework of CELL. They both require a different model of approach.

    As for your exception, I was pretty clearly talking about the SPEs (as Aaron noted) so your what if does not fit my example very well. I could just counter that your SPE and PPE code are not necessarily interchangible (not to mention the PPE is going to be doing a lot of stuff related to the OS and delegating tasks to the SPEs, so consuming 384k on one intensive task could be counterproductive).

    The point is you are going to have to make your code for the SPEs fit within the 256K block and if your code is only 200K the left overs cannot be realistically counted as "extra cache in the system".

    They are different models of use.

    Really, CELL and Xenon are different designs requiring a different approach. At this point neither has shown itself to be better or more effecient. Xenon's model is the PC route and has more research behind it and has shown to have some very large hurdles; CELL is a new approach to the problem. Obviously a lot of the batching and que related ideas could work on Xenon as well. The difference is it has less cores, but they are more "flexible" cores so you don't have as many cores to use with such a model. The initial leak and patents indicate MS is aimed more at a model where a single GPU is dedicated to graphics (renderer, procedural work) and taking the rest from there. On the reverse the method for Xenon is not favorable to the SPEs. You would not want to take the PPE code and run it on a SPE.

    So one method wont necessarily work for the other. They are different designs with different needs.

    So I don't understand this fixation of denoting how one wont work on a given context. Games that are aimed at exposing the CPUs and maximizing performance are not going to be easily ported becayse the CELL model and the Xenon model of approach to reach those goals is in conflict.

    CELL is a bigger chip (50% bigger) and has a higher peak in floating point, so it should really excell there. You would expect a significantly larger chip to perform better on average.

    Yep. Different needs, different architecture, different approaches. There are solutions for these problems on BOTH models. They are just different. Because one solution does not work on one platform does not mean it is "the suck".

    We all have our thinking caps on how to make CELL work, which is good. But this same type of thinking goes into any project regardless of the CPU being used. CELL is just more exciting because it is new, powerful, and with so many cores you can really try some new things. That is a good thing. But that does not mean the model Intel/AMD are using is useless either (not that you are argueing that). IBM thought it was a good enough approach to come to Sony with it first and eventually used this approach for MS.

    And I am sure we would be looking at it different if a Pentium D or X2's 2nd core was aimed more at floating point performance. Obviously they did not and games can really use the extra FP peformance, so CELL's performance in this area is very exciting!
     
    #176 Acert93, Sep 26, 2005
    Last edited by a moderator: Sep 26, 2005
  17. Acert93

    Acert93 Artist formerly known as Acert93
    Legend

    Joined:
    Dec 9, 2004
    Messages:
    7,782
    Likes Received:
    162
    Location:
    Seattle
    That is what I am saying!

    Please, by all means, make yourself at home in the console forum. ;)

    Good example. But this analogy does not work if you are counting cache and local store with the goal to get a "total amount" end product. That just does not jive with the Xenon model (large shared resource among a couple cores vs. statically partitioned resources across many cores).

    Its kind of like counting peak performance. Yeah, it looks good on paper... but how does that play out in real life. The only fair measurement is to do so in context of the programming model it will employ. In this case CELL and Xenon are different, so trying to cram one approach down the throat of the other and pointing out deficiencies does not really respect their differences.

    What you do on Xenon is not something you will do on CELL, and vice versa.
     
  18. hugo

    Newcomer

    Joined:
    Dec 8, 2004
    Messages:
    172
    Likes Received:
    0
    Well the topic of conversations from what I observered in this forum went from:
    Which was more powerful:
    RSX vs Xenos to Cell vs Xenon
    Now it's which is easier to program for? Cell vs Xenon.

    The Cell's SPEs has their fixed 256kb cache limitation but there are goods and bads to it.The bad thing is programmers would have to write codes that exactly make use of the fixed amount of cache when doing subroutines.OTOH if you're going to specifically write specialised code for each individual SPE it's easier to troubleshoot problems and see your codes in a neatly manner.

    The Cell and Xenon is not gaps apart from each other when it come to programming.For every problem that existed for each CPU there's solutions to overcome them.But the performance crown would still have to the Cell.No denial to that.
     
  19. dantruon

    Regular

    Joined:
    Apr 5, 2004
    Messages:
    487
    Likes Received:
    2
    well the topic of conversation for this thread is Itagaki, so why on earth they talk about Cell and Xenon
     
  20. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,104
    Likes Received:
    16,896
    Location:
    Under my bridge
    Because Itagaki said PS3 was too complicated, so we debate the validity of that statement ;)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...