So, 1 PE after-all or is this just for GDC 2005 ?

Discussion in 'Console Technology' started by Panajev2001a, Jan 14, 2005.

?

So, 1 PE after-all or is this just for GDC 2005 ?

  1. Yes, the PlayStation 3's CPU will be a 1 PU + xSPUs/APUs.

    100.0%
  2. No, this is only the CPU they are describing at GDC: the final CPU of PlayStation 3 will have more P

    0 vote(s)
    0.0%
  3. "Eh scusate... ma Io sono Io evvoi...e voi non siete un cazzo" --Il Marchese del Grillo.

    0 vote(s)
    0.0%
  4. This, as the last option is a joke option... do not choose it.

    0 vote(s)
    0.0%
  1. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    The PowerPC 3xx family does not exist yet officially so it is still a clean sheet design :p.
     
  2. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    I do not think program+data will be limited to the 128 KB of Local Storage each APU has. We have seen several patents: one talking about using overlays to allow to divide code and data into modules with shared data and the main "loader/manager" module always loaded in the Local Storage, we have seen the description of a shared L1 cache for the APU's in a PE thatis hit beforethe APU goes to main RAM and the same patent suggestedthe CPU L2 might be able to be read by the APU IIRC.

    Do you really think programmers will have to manually segment and upload, without any help, code and data to the APU's just like you do on the Vector Units on the Emotion Engine ?
     
  3. version

    Regular

    Joined:
    Jul 27, 2004
    Messages:
    452
    Likes Received:
    5


    local storage is a stupid design, little cache is fine
    but what size the address register in APU? 16 bit? 32 bit?
    for ray tracing engine must have big memory access :)
     
  4. Fafalada

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    2,773
    Likes Received:
    49
    External memory references are done through DMA - and according to the patents, APU DMA controllers have their own TLB, so technically you could "see" the entire virtual adress space of the machine from the APU.
    That big enough? :p
     
  5. version

    Regular

    Joined:
    Jul 27, 2004
    Messages:
    452
    Likes Received:
    5
    yes, but virtual address is a base, offset is in the apu 's register
    cell patent not perfect, if ibm use cache for APU and not LS then what doing the dma?
     
  6. Fafalada

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    2,773
    Likes Received:
    49
    Oh, you were thinking of APU with no local storage at all? Personally I don't think that's likely to happen.
     
  7. version

    Regular

    Joined:
    Jul 27, 2004
    Messages:
    452
    Likes Received:
    5
    local storage not ideal for programming and too large transistor count, and problem in cell's evolution

    16 kbyte cache perfect design for apu
     
  8. ERP

    ERP
    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Likes Received:
    49
    Location:
    Redmond, WA
    I can't really comment on this.
    But think about worst cases, I can write a single piece of code that would require multiple overlays to complete a single iteration. or I could just split up the code in the address space to do basically the same thing.
    I think that once you get to explicit DMA as your transfer mechanism you going to be chunking your code and data manually.
     
  9. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    I don't care if Xenon CPU or PS3 PU(s) are a clean sheet design or not (whatever it means).
    I think game developer cares about CPU that are fast in general purpose code, that are easy to program for and that expose 'custom' functionalities that can boot special computations.
    Even if I don't know a single thing about Xenon CPU I'm sure IBM did a great job. They have the know-how, very smart people..and Microsoft founding ;)

    We already debated 10 times the memory latency problem and how SPUs don't seem to be designed to hide very big latencies (such as latencies that can appear while doing textures sampling).
    Obviously STI guys are well aware of this kind of problems, neverthless they decided to NOT go toward the fine-grained multithreading route.
    I believe they preferred to have an array of very powerful and flexible stream processors under the controlo of a PU, than an set of (bigger) multithreaded processors. For a given process the STI/CELL choice should give us more (theoretical) flops per die area.
    This is from a Mr. Billy Dally's work:"Stream Processors vs. GPUs"

    [​IMG]

    DeanoC wrote a good summary and moreover added a good number of educated guesses mostly driven by good common sense, so even if we'll end to do texture sampling on the SPUs, in the main case textures stuff will be addresed by the GPU in its pixel shaders engines. SPUs will provide a very powerful base for vertex processing and just everything we can think of (yeah..MRM too ;) ) as we know SPUs can adress external memory and own several mechanisms to allievate latency problems.
    I'm quite excited to know the final Xenon and PS3 specs (that's a hint for someone!! :D ) but I'm even more excited to think how to exploit all that power in novel ways! :)



    What's the big deal? We can do the same on every pc ..or on the XBOX, just switch the vertex shader every couple of drawindexedprimitive() ;)
    Once you have code and data overlays supports doesn't mean you (as a good programmer) are going to act as local mem were infinite. In the end you would sort per overlay switch , but it woull stilll be much more simple for the programmer than doing manual code/data chunking (I hate doing that on the Vu0/Vu1), imho.

    ciao,
    Marco
     
  10. ERP

    ERP
    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Likes Received:
    49
    Location:
    Redmond, WA
    I'm talking about the case where a single piece of code exceeds the local memory. As is the case with the interpreter for a script language once coupled with it's global data.

    Sure you can organise a lot of code to run using overlays, but if you don't have good code locality/piece of data, and the data must be processed sequentially then you have a problem.

    Running scripts was just an example that happens to fit that model, and one I happen to have looked at recently.

    You can construct other examples, almost any large data structure that doesn't have an obvious spacial locality but exhibits good locality over tme.

    I'm not saying you can't solve these problems on a Cell like architecture just that solutions tailored for a more general shared memory architecture aren't going to port easilly.
     
  11. DeanoC

    DeanoC Trust me, I'm a renderer person!
    Veteran Subscriber

    Joined:
    Feb 6, 2003
    Messages:
    1,469
    Likes Received:
    185
    Location:
    Viking lands
    Sorry this is a bit OT but probably relevant enough...
    Its true that a strict stream processor may have higher theoritical flops but in practise the other architectures may well have higher real performance. Also the author of that paper is slightly out of date with regard GPU architectures, they are much more complex than the simple version he showed.
    Its similar to the RISC/CISC arguments of yore, in practise hybrids that borrow bits of both will likely win in the long term. Even looking at a current GPU (the NV40) we see a far more complex situation, it has MIMD vertex shader extremely similar in design to a stream processor or Cell combined with an array processor for the pixel shaders. With a few simple modification (i.e. let the vertex shader output memory directly) you could choose which type of processor best fits the job possible. It will be interesting if PS3 keeps a similar design, Cell APU for streaming work and NVIDIA pixel shaders for more random access vector processing...

    I think its not specialist processing (graphics etc.) but the normal stuff that ERP is trying to fit on the APUs. Generally our game code is very 'pretty' with lots of virtuals and indirections, designed for mantainability and quick changing rather than fast processing. By 'pretty' I mean high level software engineering principles, interfaces, encapsulation and stuff.
    Suspect we will end just running all this stuff on the main PU...
     
  12. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    That's why I wrote about theoretical performance :)
    Common sense says fine-grained multithreaded architectures should achieve higher number easily than a stream architecture, but in the long run (I'm thinking about the life cycle of a console..) developers should be able to extract more flops per die area on a stream architecture, at least in some specific application :)

    Then you should read one the last nvidia's patent application..;)

    Let's see how they want to address the syncing and feedback problems..

    That's what most of first or second (next) generation games will do.
    Saying APUs will not efficiently run scripting code is true..but well..who cares? I don't ;)
    It's really too easy to say an application completely unfitted to run on a specific architecture will run badly on that architecture.
    Vertex texturing will probably run very slow on Xenon CPU custom vector units..you know... :)

    ciao,
    Marco
     
  13. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Why would you even want t run it on the CPU?
     
  14. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    That's the point Dave..I don't want to run that stuff on the CPU. A CPU is not designed to run some textures sampling code efficiently. Like we don't want to run scripts code on SPUs. Just because one can do it doesn't mean it's a good thing to do that for real.
    Imho, it's a non-sense to complain about this kind of 'problems'.

    ciao,
    Marco
     
  15. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    It occured to me that one reason you might go with LS versus cache for a Cell APU is predictable latency.

    If the LS were replaced with cache, then all memory access would have extremely unpredictable latency, so hardware support for SoEMT would be required.

    However, if you split memory accesses into LS and DMA, then AFAICS the only really high latency event that can happen is DMA access. Instruction ordering would hide most of the execution/LS latency, and the programmer could perform SoEMT manually by separating code into chunks ending with DMA access, then looping across independent chunks before using the results of a transfer.

    I believe it was mentioned that an APU has access to a very large number of registers (128?) and can make up to 32 outstanding DMA transfers - if this is true, then I think it is incorrect to say that the APU has poor latency hiding mechanisms - it's just that the mapping of registers/LS to threads is left up to the programmer (or a potentially a specialized compiler).

    Serge
     
  16. Xenus

    Veteran

    Joined:
    Nov 2, 2004
    Messages:
    1,316
    Likes Received:
    6
    Location:
    Ohio
  17. fxtech

    Newcomer

    Joined:
    Apr 23, 2003
    Messages:
    77
    Likes Received:
    5
    mmm let see..

    the highest feature for the chip produced from toshiba ( the chip in ps3? )

    4000 Mbit/sec per pin

    and 521 Mbit per chip , 16 bit/bus

    let assume a 4 chip 4 channel 16 bit bus configuration , we have a pool o 256 MByte at 32GigaByte/sec

    Ok these are the final spec of external ram pool on PS3 , ok ?
     
  18. version

    Regular

    Joined:
    Jul 27, 2004
    Messages:
    452
    Likes Received:
    5

    if dma read or write the LS data , until then APU be stall !
    for maximize performance ,must to divide LS 2,3 or 4 segment and then system can to work in paralell
    a segment about 30-50 kb this is NIGHTMARE !
     
  19. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    version - I'm not sure I understand what you mean - how exactly would this stall the APU? Do you mean by occupying the read/write port to the LS, or...?

    Do you just mean that segments of the LS must be assigned to different "threads"? In any case, I'm not proposing to use a system like this for large general purpose codes (like a Java VM). On the other hand, it should be able to handle latency incurred by point sampling a vertex texture, for example.

    Even if the code is fairly nightmarish, I would expect this to be worth it for some of the core algorithms (e.g. adaptive subdivision of a subdivision surface with displacements).

    Serge
     
  20. version

    Regular

    Joined:
    Jul 27, 2004
    Messages:
    452
    Likes Received:
    5
    you know ps2's VU1 ? similar than that
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...