ppe memcpy

Discussion in 'CellPerformance@B3D' started by c94wjpn, Dec 7, 2007.

  1. c94wjpn

    Newcomer

    Joined:
    Nov 7, 2007
    Messages:
    8
    Likes Received:
    0
    What is the fastest way to get the ppe to copy a big (1MB) block from one memory location to another? This does not involve spu memories. I'm using memcpy at the moment and it appears to manage 700Mb/sec which is phenomenally slow I think.

    I mean, a rough estimate is that it should manage to r/w a register every 2 clock cycles. If a reg is 4 bytes, then it should manage 6Gb / sec. so I've lost a factor of 10 somewhere.

    Can I use dma hardware to move memory around the ppe's memory.?

    Any ideas? If you don't have anything helpful to say, please don't reply.
     
  2. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,614
    Likes Received:
    60
    Have you checked these out ?

    "PPE RAM Latency" discussion on IBM Cell forum
    http://www-128.ibm.com/developerworks/forums/thread.jspa?threadID=163046
    (Can't you push [all] the work to SPU ?)

    libfreevec (http://www.freevec.org)
    "libfreevec is a free (LGPL) library with hand-optimized replacement routines for GLIBC, such as memcpy(), strlen(), etc. These routines have been written specifically to take advantage of the AltiVec unit (a.k.a Velocity Engine or VMX), and will only work on processors that include this unit. This means they will not work on older processors, such as 603, 604, 750 (G3) or the POWER family of CPUs."
     
  3. c94wjpn

    Newcomer

    Joined:
    Nov 7, 2007
    Messages:
    8
    Likes Received:
    0
    this has confused me completely. According to my ibm docs, the latency of stqd (store) is 6, and the same for an lqd (load). The discussion seems to be talking about RAM latencies of 10x that, and suggests that the RAM is on a slower clock than the cell. What memory caches does the Cell have and how do I control them?

    Could someone clarify the situation with a few meaningful sentences in English?
     
  4. ADEX

    Newcomer

    Joined:
    Sep 11, 2005
    Messages:
    231
    Likes Received:
    10
    Location:
    Here
    It'll depending on how you're trying to move the memory.
    If you're reading a long or a long long per loop iteration it'll not work very fast.

    It's best to use vectors and use them all in one go.
    Declare 32 vector variables then setup a loop.

    per iteration:
    Use a pointer to fill them all.
    Use another pointer to empty them all.
    You can use a second thread to do the same thing with another part of the memory block.

    That should be a lot faster but be careful of unaligned memory at the beginning and end.

    It'll be easier and probably faster on an SPE though.

    Also, look at the LibFreeVec lib.


    BTW check how your timing works, I find compilers tend to mess with this no end.
     
  5. seebs

    Newcomer

    Joined:
    Nov 29, 2007
    Messages:
    44
    Likes Received:
    0
    Location:
    Minnesota
    I'm a bit unclear on how the SPE would be doing it; just DMA to and from local buffers?

    You'd think that it'd be possible to trick the DMA engine into copying from memory to memory. :)
     
  6. Laurent06

    Regular

    Joined:
    Dec 14, 2007
    Messages:
    715
    Likes Received:
    33
    I managed to get about 4 GB/s memcpy using only the PPE.

    The trick is to manually unroll ld/st 64 (or was it 128? I need to turn on my PS3 :)) bit instructions and inserting cache preload instructions.
     
  7. popper

    Newcomer

    Joined:
    Jul 22, 2006
    Messages:
    69
    Likes Received:
    3
    (thats not very nice)

    go and ask and comment on this thread, gunnar and markos etc can help readers, as can you and many readers here perhaps.

    http://www.powerdeveloper.org/forums/viewtopic.php?t=1426&postdays=0&postorder=asc&start=0
    Gunnar PPC linux patch
    [​IMG]

    and dont forget to contribute your code, comments, test results etc on this thread too
    http://www.powerdeveloper.org/forums/viewtopic.php?t=1453

     
    #7 popper, Dec 15, 2007
    Last edited by a moderator: Dec 15, 2007
  8. popper

    Newcomer

    Joined:
    Jul 22, 2006
    Messages:
    69
    Likes Received:
    3
    perhaps you could go and post a bit of working code and what you did, what aspect of copying, test results graphs etc on the http://www.powerdeveloper.org/forums/viewtopic.php?t=1426&postdays=0&postorder=asc&start=15 thread.

    it all helps increase the speed of takeup if people contribute their findings.
     
  9. c94wjpn

    Newcomer

    Joined:
    Nov 7, 2007
    Messages:
    8
    Likes Received:
    0
    ok thanks for the advice folks. The impression I get is that the memcpy I am using is not designed for this hardware and I need to write my own function or use libfreevec.
     
  10. popper

    Newcomer

    Joined:
    Jul 22, 2006
    Messages:
    69
    Likes Received:
    3
    its designed or more like ported and so functions as required on PPC, but it needs people to put the time in and improve its speed for everyones benefit as per the links you can follow and help out and receave help if your so inclined, its your choice.

    sure libFreeVec will help you, as will talking with Gunnar and Markos if your after understanding the best code you might use in your own code functions, and in the process you might have some fun and help improve the general PPC linux speed for all users in time.

    just read the thread to get the idea of how they do it and ask questions,markos will probably have you test the pre 1.0 or you might prod him to get stuck in and add the latest ideas coming to light (if you have an idea to improve it, tell it)and increasing speed even more.

    peoples code snippets and benchmarking for better PPC/PS3 performance is always welcome there and here too when people wake on and post interesting ideas/thoughts etc.

    however, its been very quiet around here/there of late, is everyone hibernating for the winter ;)

    its a shame, but it seems that other than the Amiga PPC lads no one seems interested in tuning/patching the current PPC/PS3 linux and apps for potentially far better speed, its going to take a long time at this rate if we cant find new people that enjoy the challenge and have some fun with it.... perhaps this cristmas will be different and new people can find some time to help out or just hang out and teach and drop some advanced working code snippets ...
     
    #10 popper, Dec 17, 2007
    Last edited by a moderator: Dec 17, 2007

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...