Larrabee vs Cell vs GPU's? *read the first post*

Discussion in 'GPGPU Technology & Programming' started by rpg.314, Apr 17, 2009.

  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Ok, I didn't get it.
     
  2. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    And dedicated DMA engines, how are they likely to help with the x86 cores? They have no software lockable cache. Even if you prefetched a large amount, it will likely be thrown out quickly.
     
  3. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,805
    Likes Received:
    473
    Well one of the suggestions has been to use a single of the 4 threads for gather operations, if it gets equal running time that's so many cycles where nothing gets issued to the vector ALUs.
     
  4. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,805
    Likes Received:
    473
    So? I never said it would work bolted onto Larrabee as is, I don't particularly like Larrabee as is ...
     
  5. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    so you are suggesting that there should be a dma engine on gpu's to prefetch to shared memory?
     
  6. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,805
    Likes Received:
    473
    Would be nice ... and almost free as far as transistors is concerned.
     
  7. Rayne

    Newcomer

    Joined:
    Jun 23, 2007
    Messages:
    91
    Likes Received:
    0
    I barely remember anything, because it was year 2006/7, but i remember that with a single buffer for the DMA transfer, the SPE was idle most of the time. But, with multiple buffers and overlapping the computation on one buffer with the data transfer in other, the results were better. This is a must for any Cell app imho. I also remember problems trying to multi-thread my code because there was shared data across the threads.

    Yeah, i did something like that, but with the suffle instructions & several registers.

    I remember using the _MM_TRANSPOSE4_PS macro, but it needed the 8 XMM registers, and sometimes, you are using some registers to store some 'previously suffled' data :)

    Don't shoot the messenger :)
     
  8. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Cache memories are banked as well so I am not sure the premise is entirely correct, though I still get what you are pointing at :)
    Also gather done via texture unit doesn't sound like a good idea if you are going to re-use your data multiple times.

    This is a scalable approach, which can get faster in the future. Doesn't nvidia operate in a very similar way for (uncached) global memory accesses?
    Unfortunately scratchpad memory based programming models don't scale so nicely.

    It certainly fast, but I wouldn't call gather/scatter from a minuscule memory that you have to manange on your own 'flexibile' ;-)
     
  9. MrGaribaldi

    Regular

    Joined:
    Nov 23, 2002
    Messages:
    611
    Likes Received:
    0
    Location:
    In transit
    But why is that? They don't give any reasons in the slides, although it looks like the ME is taking "too long".

    Nvidia has a nice CUDA sample for doing DCT on the GPU with 2 different kernels that are enjoyably fast of themselves, and can be further tweaked, so I don't see why you couldn't offload it to the GPU as well.

    Granted, I've only played around with doing MJPEG compression on the Cell and GPU, so could very well be that I'm missing some of picture for doing .x264 encoding...
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Blockwise DCT should be very nicely parallelizable. Also cuda helps as it exposes the dedicated video decode hardware, alteast on windows, so only the last lossless compression needs to be done on the CPU. ME and DCT can be both on GPU.
     
  11. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I'd like a detailed explanation on this please. It seems to me that you are implying that the hardware managed caches (with s/w hints like on lrb if need be) scale better than purely software managed caches like on gpu's.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Yeah, shame there's nothing more detailed.

    Is there an NVidia presentation on the details of h.264 encoding? Anyone else done one for CUDA encoding?

    There's a CAL sample for DCT too.

    I don't understand the h.264 encoding pipeline at all well, so I don't know how realistic ME and DCT both on the GPU is.

    Jawed
     
  13. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Not to my knowledge, global accesses (ie uncached) are also banked addressed (just like shared memory) with CUDA. I'd argue that in terms of bandwidth limited cases, that having the extra 16x addressing capacity (16 bank addresses, vs 1 cacheline) per global access can be quite a performance benefit (well assuming the programmer can make use of it).
     
  14. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    432
    Location:
    New York
    Global memory accesses don't seem to be banked. Coalescing is contingent on all required addresses being present in the same contiguous memory segment. However the segment size can be either 32, 64 or 128 bytes. So global memory access works similar to the LRB cache where there is one read per segment (cache line).
     
  15. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Just checked CUDA docs and global memory accesses are "simply" coalesced, which sounds similar to what LRB does (caching aside..).
    Also this cache vs bank memory stuff doesn't make any sense, cache memories *are* banked.
     
  16. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,749
    Likes Received:
    127
    Location:
    Taiwan
    Yes, global memory access is basically just following the memory controller pattern. However, in the case of GT200, it seems that there's some sort of a reorder buffer between the memory controller and the ALU, so the coalescing rules are much more relaxed than G80.
     
  17. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Yep, that's pretty cool.
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
  19. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    If 64 threads issue contigous reads, each thread reading 16 bytes, then you have fetched 1k data from memory. Probably Timothy is referring to the fact that that 1K will be fetched by different memeory controllers and then merged together iin the register file, hence the banking. Elements of this technique re already there in CPU's with multi channel memory controllers, so I suppose it is an issue of semantics.
     
  20. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,560
    Likes Received:
    157
    Location:
    In the Island of Sodor, where the steam trains lie
    It's damned difficult to do fast on anything :lol:

    FWIW Encoding is probably easier than decoding which is ironic as the latter is surely going to be more common.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...