Physics on Cell

Discussion in 'Console Technology' started by London-boy, Mar 23, 2006.

  1. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,439
    Likes Received:
    280
    I think this is basically what it boils down to. Since a lot of cores are worthless if you can't keep them fed STI decided a change was necessary. Cell's success will depend on the number of algorithms ammenable to the local store memory architecture.
     
    #41 3dcgi, May 16, 2006
    Last edited by a moderator: May 16, 2006
  2. DarkRage

    Newcomer

    Joined:
    Jul 25, 2005
    Messages:
    70
    Likes Received:
    1
    Location:
    Spain
    Can we extrapolate any results from the disappointing results on the AGEIA PhysX hardware?

    http://www.anandtech.com/video/showdoc.aspx?i=2751

    Maybe an array of FPUs is not optimal for physics calculation without the other whistles of a CPU? There were lots of pointers usage, conditional jumps and other general purpose programming the last time I looked at how physics for games where calculated. But maybe it was just a workaround because real FPU power wasn't available...
     
  3. rounin

    Veteran

    Joined:
    Sep 21, 2005
    Messages:
    1,251
    Likes Received:
    20
    You tried making a new thread on this before but it was locked.

    The reason it was locked was probably because the issue was explained beforehand why the GRAW results do not show anything useful about how useful the processors are : http://www.beyond3d.com/forum/showthread.php?p=750491#post750491
     
  4. one

    one Unruly Member
    Veteran

    Joined:
    Jul 26, 2004
    Messages:
    4,834
    Likes Received:
    156
    Location:
    Minato-ku, Tokyo
    No we can't, that's old and already debunked.
    http://www.beyond3d.com/forum/showpost.php?p=750491&postcount=181
    http://www.beyond3d.com/forum/showpost.php?p=755575&postcount=52
     
  5. DarkRage

    Newcomer

    Joined:
    Jul 25, 2005
    Messages:
    70
    Likes Received:
    1
    Location:
    Spain
  6. one

    one Unruly Member
    Veteran

    Joined:
    Jul 26, 2004
    Messages:
    4,834
    Likes Received:
    156
    Location:
    Minato-ku, Tokyo
    What do you expect from this very limited test?
    http://www.anandtech.com/video/showdoc.aspx?i=2751&p=4
     
  7. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,581
    Likes Received:
    981
    A SPE (SPU+LS) takes up 15mm^2 in 0.09um.

    this shows a K8 core taking up 33% of 193mm^2 in 0.13um. The die size of a X2 in 0.09um is 199mm^2, die photos indicate the exact same ratio of SRAM cell size to logic transistor size , so around ~30-35 mm^2 for a K8 core. So core size is approx. a factor of 2 compared to a SPE.


    Who said anything about a (pure) SMP? Most SMP systems today are NUMA, - even small ones (2-4 CPUS), Opterons are by design and Xeon by way of chipset design.

    The fact that you can scale to 4 sockets with Opterons with external links with simple cache coherence broadcast queries should tell you that it is indeed very feasable to do with multiple cores on a die.

    And you only generate cache coherency traffic when you miss caches. PC processors has hit rates in L1 above 95% for a whole bunch of workloads, heck running demos from Aegeia's physics SDK nets you L1 hit rates above 98% on a K8.

    So, there is little gained (diminishing returns) in designing a single shared cache level that could sustain 8 L1 miss requests every cycle, that would indeed be a daunting task. Make it pseudo dual ported with 32 interleaved banks.

    The important thing
    .. is that memory coherence is automatic. When you have a whole bunch of threads scheduled on these cores, passing data from producers to consumers is simple: just store and load from the same chunk of memory (in a synchronized fashion of course), and the cache coherency protocol will automatically get the right data to the right spot.

    On CELL you have two choices:
    1. Manually orchestrate threads on SPEs
    2. Have a system automatically schedule threads on SPEs

    If you go by 1.) the solution is simple, your producing SPE will have á priori knowledge where to send data to (with DMA)

    If you go by 2.) you have to resolve
    a.) If both threads are scheduled (running), then where ? And then set up a DMA request.
    b.) If only one thread is scheduled put the message out in main memory

    Resolving source and destination here is done in software (memory coherence!).

    a) is bad enough.
    b) is disastrous, - having the message travel on the main memory bus twice.

    The fact that you have to resolve this in software means that there is *huge* overhead (latency) for each such transaction. To alleviate this you'd want to make your transactions bigger to trade bandwidth for latency, I used the word internal fragmentation, that was a poor choice. What I meant is the amount of data from such a transaction that ends up not being used; Like when you pack K-D trees in cache lines, you have N nodes, but only visit log_2(N), "wasting" N - log_2(N) nodes. This waste puts further pressure on your main memory system.

    So the alternative is to manually orchestrate threads on the SPEs, knowing exactly what is going on on each SPE at any given time. That may work now where you have 6 contexts on XCPU and 8-9 on CELL, but with 20 to 30 cores (and more) this really becomes a hard problem.

    XCPU has 6 hardware contexts but having more active contexts is expensive because context switching is costly. That is by implementation. The amount of hardware contexts could be increased, - or context switching could be accelerated so switching in software would be faster (ie. loading and saving a whole bunch of registers in a few cycles).

    CELL's SPEs has massively expensive context switching, but unlike XCPU it is not by implementation but by architectural design.

    So CELL is hostile from a multiprocessing programming paradigm POV.

    That's what makes me say it has a narrow window of opportunity and that window is now, in PS3.

    Cheers

    EDIT: typos
     
    #47 Gubbi, May 16, 2006
    Last edited by a moderator: May 16, 2006
  8. vliw

    Newcomer

    Joined:
    Mar 28, 2003
    Messages:
    43
    Likes Received:
    0
    Location:
    Somewhere in Space
    My impression is that to see next generation physics we will wait a lot of time because there are not many programmers/engineers capable to implement what we could call really "Next Generation".
    The difference between Ps2 and Ps3 yet is that you can move not 20 but 200 rag dolls and for me this is not next generation physics.

    About the assumption that Physics doesn't require floating point power, I don't think so, Ageia PPU is very similar to CELL architecture and there is a reason for that.
    Implementing stable solvers is not only difficult but requires fast floating poit processors.
     
  9. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,581
    Likes Received:
    981
    Does it matter how much? It's the inherently sequential part of the algorithm and as such sets the lower bound for execution time.

    However, for each hull you'll need to chase down a space decomposition structure to find what other hulls the specific hull might collide with. The amount of loads found here caused me to think that quite a lot of time is spent that way.

    And pointer chasing is pointer chasing, but I *was* thinking in terms of on die memory since this particular workload have virtually zero misses, and as such fit SPEs and regular CPUs equally well.


    You're right, that was a hasty comment. But a regular CPU would (or should) be as good as a streaming processor on streams if the proper non-temporal load and store controls are used.

    Of course you'd be able to fit more streaming processors into the same die area as a regular CPU so you point is certainly valid.

    Cheers
     
  10. Fox5

    Veteran

    Joined:
    Mar 22, 2002
    Messages:
    3,674
    Likes Received:
    5
    For that hypothetical many cored K8, how about cutting down cache size? Benchmarks show a K8 retains nearly full performance even when cut down to 256KB or 128KB of cache, so why not give each core a dedicated 128KB of cache, thus each core has full speed access to the cache, and you can fit a lot more cores on a single die.
     
  11. DarkRage

    Newcomer

    Joined:
    Jul 25, 2005
    Messages:
    70
    Likes Received:
    1
    Location:
    Spain
  12. nonamer

    Banned

    Joined:
    May 25, 2002
    Messages:
    564
    Likes Received:
    7
    Wait a minute, if you have 8 cores on a die, doesn't that necessate that they are SMP?

    From what I've read here: http://www-128.ibm.com/developerworks/library/pa-expert8/ the context switch is 20 microseconds because you have to swap out all local RAM. That's expensive, but not murderously slow enough as to being hostile to concurrent programming. Also, from what I understand you're not suppose to program the Cell in a manner similar to the Opteron, but rather you use simple, corpuscular work threads that won't access main memory much. If you want to do more than what one SPU can handle, you have to do something clever like pipelining the work into small pieces.
     
  13. Frank

    Frank Certified not a majority
    Veteran

    Joined:
    Sep 21, 2003
    Messages:
    3,187
    Likes Received:
    59
    Location:
    Sittard, the Netherlands
    Grubbi, you're still trying to prove that Cell is bad because it wouldn't run a program written for an OOO/unified memory processor at the same speed.

    Well, it won't. But that OOO/NUMA processor won't be able to run a program written with Cell in mind as fast either.

    Both have things they excel in, and a specialized design like Cell can perform much better running a program written to uses it's strengths than other current processors are able to, even if you write the program to use the strengths of those. Which is less common to happen, because we (you?) assume the hardware to sort it all out. Or want to run that program written 10 years ago, faster.

    But you need to change the way you think about it. Use the paradigm that fits the hardware. NOT demanding the hardware runs whatever (old or badly written) program faster, no matter what. Only mainstream processors like x86 CPUs need to be able to do that.

    And that's all there is to it.
     
  14. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    Very high performance SPU programming is going to resemble GPGPU programming somewhat, except that the SPU is capable of scatter/gather operations within a 256k kernel. It sits kinda inbetween stream processing and parallel processing, depending on which model you want to do.

    Gubbi's memory argument works if you have to touch something more than once. For example, if you are performing a binary search on a large binary tree of N nodes, you will only touch log_2(N) nodes. You are probably still going to load and prefetch a lot ff data into the L2 cache. However, on a traditional processor with L2 cache, if you run the same search again, or touch nearby nodes, they will already be in the cache and you won't end up reloading everything from main memory again. (and the hot data you use will be pre-packed into the L2, leaving the rest free for other uses) The SPU running 2 binary searches on a say, a 10 megabyte tree, will end up streaming significantly more data from main XDR memory through the FlexI/O and into the SPU local memory. (unless the SPU code is written to do caching, but it will be at a disadvantage)

    But I happen to think programmers are smarter than that. For example, in tree navigation, it is most important to cache the upper hierarchy of the tree. That's why B-trees work so effectively for databases, since you can figure out how to navigate a billion records just by caching 1,000 nodes, and "pointer chasing" on the expensive disk falls down to just 2 disk accesses.

    It is likely that SPU programmers who need to navigate large trees, will keep a cache of the upper tree hierarchy so that they can calculate quickly which blocks of external memory they need, and load these subtrees in big blocks. How much they cache may be a factor of how much they need to minimize DMA requests and fit the resulting request in local memory. The more you cache, the less you can load, and therefore, you may need to load a given subtree in multiple blocks.

    SPU coding for sure will require a lot of thought to maximize performance. But I'm willing to bet we will be surprised.

    And yes, if AMD can ship 8 cores on a single chip at current clocks, and full SSE, then it will be an CELL killer from an ease of coding standpoint, but they'll have to do it within 4-5 years to affect CELL's market .

    I personally think that a suped up version of Niagara + MAJC would be better. Give me a general purpose chip which can run hundreds of threads plus a huge block of functional units rather than a chip with 8 oooe ILP cores.

    (I also like Azul's 48-cores on 90nm at 800mm+ :) BTW, it was founded by ex-3dfx ex-Nvidia Scott Sellers)
     
  15. Frank

    Frank Certified not a majority
    Veteran

    Joined:
    Sep 21, 2003
    Messages:
    3,187
    Likes Received:
    59
    Location:
    Sittard, the Netherlands
    Yes. But you're still assuming a global data model with a commonly used data structure, and so your answer is about how an architecture like Cell can do well in one of the worst scenarios we can think of. It can do much better than that, if you use a different data model. One designed with such an architecture in mind.

    Contrary to common wisdom, it might be a boon to duplicate your data for local access.

    Agreed. But think about this: after quad cores, both Intel and AMD want to split the functionality offered by the cores. Some might have OOO, some might have vector units, etc. Essentially, they want to make those cores asymmetric as well, they only want to retain the basic instruction set for all of them.

    And I'll wager, that many people won't be happy with having to use the x86 fp or SSE interfaces (stacks and mapping!) to access those functions, when they have seen how much easier it is to do it directly.

    Or... they might not. Because, who still uses assembly nowadays? How much overhead is acceptable if it makes it easier to code?

    But then again, you need different models in either case, or at least take into account that the same code executed on core X is much slower or faster than when executed on core Y...
     
    #55 Frank, May 17, 2006
    Last edited by a moderator: May 17, 2006
  16. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    Um, no, nothing in my previous post makes that assumption. The only assumption that is made is that a tree data structure is used. Global or non-global is orthogonal, since I said nothing about whether there are multiple copies in XDR or that the SPUs are performing updates.

    The issue is, how much of the tree can fit into local memory. If you need to do a tree search, and the entire tree cannot fit into local memory, than it has to page in external memory. This holds tree whether you consider multiple SPUs or not. Even if you took a tree, and divided it into 7 subtrees as a forest, and gave each one to an SPU, you still might have to stream in data.

    An example of such a large tree would be a BSP for the static geometry of the level, which allows very efficient world-object collision detection. This is going to exceed the combined memory of all local stores easily.

    For object-object collisions, you might want to give each object its own tree (bounding spheres, axis aliented bounding boxes, or oriented bounding boxes). Now the problem is, for each dynamic object, you've got a tree, and ideally, you want to group them into a forest based on locality, and then pack those into an SPU local storage for detection and physics calculations.

    So you need a datastructure which efficiently groups objects (and attached trees and other physics data ) by spatial locality *AND* is memory local (so can be easilly fetched into an SPU by a stream request instead of a gather operation) *AND* must be efficiently updatable so that that as objects move, the amortized cost to keep this data structure in sync is not too expensive.

    <BrainStorm mode on> I think I'd want the PPE to decompose the physics problem as so:

    1) find all spatial locales that will be updated (e.g. because some event is going to trigger something to move or apply a force) (could potentially be parallelized). Locales with no updates may or may not neccessarily require simulation based on whether or not the developer wants to simulate that which is too far away to see. In other words, if a tree falls in a forest and no one was there, did it happen? :) Some people may not like this and may want the world to keep going when they aren't watching.

    2) for each spatial locale, do a quick trivial check to see if any thing needs to be done (no need to upload data to the SPU and have it do work, if the check can be done faster than the transfer), else no collisions so proceed directly to time-step/constraint solver (SPU parallelizable)

    3) if collision detection needs to be done, spatial local structures must be streamed to one or more SPUs. SPUs must compute collisions, and use that to feed into constraint solver (depending on how you resolve collisions and interpenetration) Here you may run into a problem if the local spans multiple SPUs since they need to communication this information.

    4) spatial datastructures must be updated. Object bounding trees need to be recomputed if an object changed (bounding sphere/hull could be bigger or smaller). And the spatial datastructure which clusters together these objects in space must be updated if an object has moved.

    I don't see a way to avoid not having a global data structure. Even if all SPUs kept a copy that was local for them (wither in SRAM or XDR), an SPU would still need to be able to determine which SPU "owned" a given spatial locale, so that if an object crossed a spatial boundary, that SPU would have to update its local tree. The cost of doing this may exceed just having a single tree for everything that has node-level locks on certain subtrees.

    Let's say you were storing hierarchical bounding spheres per object. A sphere is determined by position and radius, let's say XYZW (32-bits) per sphere. The sphere must also have an internal list of subspheres. For sake of simplicity, let's say we compute subspheres down to a fixed 4 levels deep and store in a fixed array addressing scheme. That's 31 * 4 ~ 128 bytes per object of heirarchical bounding volumes in the absolute simplist case. Now assume we have 1,000 objects on the screen at once. That would consume 50% of a single SPU's local storage, without even considering geometry or other physics information (velocity, inertial tensor, etc)

    If we consider 1,000 army men each with a character model that has 10,000 vertices, you're talking 10M vertices. If instead, we just geometry instancing, you still need to store position, bone position, and physics for those 1,000 actors.

    Then, if those 1,000 guys were all running at you, you'd need to update 1,000 hierarchical bounding sphere trees and 1,000 set of bones, and 1,000 sets of physical properties (more, considering all the limbs each have constraint forces ).

    On top of that, when all is said and done, it has to be written back to XDR (so that the SPU can work on other data), and it must be written in a format which is stream friendly to write, and stream friendly to read, otherwise you'll be performing scatter-writes with DMA.

    One possibility is if the PPE is good at scatter writes (I profess ignorance), the SPUs can stream a work list of memory locations to update plus the data. The PPE can then batch them into memory write friendly bunches and execute the scatter updates.

    The problem with the SPUs isn't that physics can't be reasonably localized. The problem is, the paucity of their local memory dictates that they must write and read from external memory, and they are very bad at doing and absorbing latency.
     
  17. Frank

    Frank Certified not a majority
    Veteran

    Joined:
    Sep 21, 2003
    Messages:
    3,187
    Likes Received:
    59
    Location:
    Sittard, the Netherlands
    Yes, but you're still assuming a generic, normalized data structure, that only stores any bit of data once. That's what normalizing your data is all about.

    But consider this: like in a distributed data model, you might need local copies, that have to be synchronized. Like the difference between cache memory and local storage, but for your data structure instead of the heap.

    So, instead of assuming that the global data model is the only current and correct one, you could use that as a kind of cache: you can assume the data model is correct, but not the actual values. Because they can reside somewhere in local storage. Or in some object that has changed recently or has dependencies.

    When you would want to access data, you search for the descriptor in the tree, but then you look and see where the current copy of the value resides. Which might be in some object, which is currently being processed somewhere.

    Even worse: it can be part of two local data spheres, both of which are updating it! So, you need a mechanism to handle conflicts as well. Which can be done by reprocessing the interaction, as all it means is, that there are multiple interactions occuring on it. Multiple events, same data.

    So, instead of tracking down and updating that global data all the time, we assume it has multiple live copies all the time. And in the case we made a decision based on it's exact, current state (read or write), we flag and reprocess it.

    That way, you still have a global data structure, but it doesn't have to be current. It offers a snapshot. And, of course, we write all the values back when processing ends.



    To do all that, you need a mechanism to synchronize all your data, and you can handle conflicts by reprocessing it. Not hard to do, I do those things all the time for servers. So, no need for negociating, and no need to access the global store all the time for indirect data. Only to look up the descriptor and usage statistics.

    And only the management functions have to do all that. Because your data is time-stepped by a fixed interval, all processing in between can be done by collecting the available global data once, and flagging the changes. You only have to look for conflicts on write back, and reprocess those before updating.

    And, of course, you need a mechanism to handle deadlocks and loops. But all of those can be done. Using a transaction model would be the simplest solution, only writing changes as updates would handle locking conflicts. And for games, you can just disregard updates when needed, you're not handling exact data after all.



    Edit: creation of new objects should be written to global memory immediately, and send a message to all processes. That might be costly.
     
    #57 Frank, May 17, 2006
    Last edited by a moderator: May 17, 2006
  18. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    Denormalizing data does nothing to help you. You are too fixated on relational database approaches, and relational database approaches are not good for solving every problem.

    I think you are completely misunderstanding the problem discussed. It has nothing to do with multiple readers/writers accessing contended memory or serializing access to a shared resource. It has to do with avoiding touching external memory, and in doing so, doing it in a stream oriented (nice DMA friendly blocks) instead of scatter/gather way. That problem has to be solved first.

    The SPUs do not have enough memory to avoid touching external memory for the games with the detail we are talking about. I like RDBMS, but it simply doesn't fit here. RDBMS utilize significant amounts of RAM to avoid touching the disk unless forced to (full table scan). An SPU is like a database server with 256k of RAM that must contend with possibly a 128megabyte database, and where many of your algorithms require a full table scan.

    All of the optimistical concurrency stuff you keep bringing up doesn't solve the performance problem. Forget about the other 6 SPUs for a moment, and forget about the PPE. Solve the performance problem for a single SPU first before you start worrying about how broadcast, multicast, or unicast updates to global data are handled.
     
  19. Frank

    Frank Certified not a majority
    Veteran

    Joined:
    Sep 21, 2003
    Messages:
    3,187
    Likes Received:
    59
    Location:
    Sittard, the Netherlands
    I had the PPE in mind as manager. SPEs can communicate with it directly without using DMA, right? And the PPE can access global memory randomly without large penalties.

    I was constructing an approach that limits random memory access to the managing process, while allowing the use of global data structures. Which is what my suggestion does. If you want to access other data but what is in your local store, you ask the manager. Read or write.

    You do a second (or more) pass(es) to handle the conflicts/transaction dependencies.

    And I would suggest, that you build your data structure with that in mind, not simply normalizing it.
     
  20. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    Disclaimer: I'm not familiar with the game engine pipeline (high level but not deep), so take it for what it's worth :)

    I would avoid touching the PPE as much as possible since it's running the main game loop. Any stall in the PPE may have a direct visible impact to the system.

    I would strip down any global structures as much as possible so that it's very compact and mostly act as a "directory" to the real meat. The real meat will contain different and sometimes multiple representation of the actual data (At times you optimize for space constraints, locality, or other things depending on needs and effort to [re]generate them).

    Because if the global structure is small and high-level enough, it is presumably more stable and may even fit into all the SPEs (depends on actual situation). That minimizes the communication between the processors. Global accounting info about "who handles what" can also be tagged onto the global structure and disseminated and sync'ed relatively efficiently (This is relatively large grain parallelism). I'm not certain if this is possible in the gaming world.

    Then for the "meat", I would allocate specialize SPUs to handle different aspects:

    + For tasks that throughput matters more, I may continue to use just 1 full SPU for it. Optimization strategy would be based on NUMA, overlapping DMA requests, and if possible the SIMD engine. Here we don't have to worry about concurrency control, just worry about keeping the dedicated SPU well-fed.

    + For tasks that require short latency, short update cycles, high interactivity *and* "always there", I may employ more fine grain parallelism (e.g., allocate > 1 SPU for it). I'm still not certain if this is a good idea, some paper calculation on the cost-benefits *must* be done first before diving in.

    In general, I dislike anything that's fixed to spartial partitioning (for dividing SPU responsibility) because you may lose control if everybody crowd in 1 area. I'm more in favor of the job queue model but the average cost is higher (but still much better than current/last gen console). In this model, the data can be packed based on spartial locality (like what DemoCoder mentioned), but the actual SPUs handling it may be dynamically assigned where it makes sense.

    We have seen a few distributed/replicated data structures and lazy evaluation approach discussed so far, all of them seem valid in the current async communication model (Of course the devil is in the details). The secondary objective here is to avoid synchronization/locking as much as possible.

    If we run into problems requiring constant communication between SPUs, it may be better to take a step back and reorder/reorganize the pipline, recalculate the values, serialize the program, or use estimated values (to avoid communications, not sure whether possible).

    If regular communnication is unavoidable and we absolutely must use > 1 SPU for the same task (I wonder why), then I would build a custom but general communication mechanism to batch data transfer where appropriate. Sony probably already provides a library for inter-SPU communication.

    It would be nice if we know what RSX is though.

    EDIT: I got a "Develop Magazine" from E3 2006. It covers an interview with Phil and also Devstation 2006 for Europe. It recommends "Patterns for Parallel Programming" by Mattson and Massingill for reading if you are interested. I may get hold of a copy myself :)
     
    #60 patsu, May 17, 2006
    Last edited by a moderator: May 17, 2006
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...