Larrabee vs Cell vs GPU's? *read the first post*

Discussion in 'GPGPU Technology & Programming' started by rpg.314, Apr 17, 2009.

  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    For decoding we already have dedicated hw in gpu's today. So transcoders most likely will be taking advantage of it.
     
  2. T.B.

    Newcomer

    Joined:
    Mar 11, 2008
    Messages:
    156
    Likes Received:
    0
    It really simplifies the code in many cases, which can give you a nice speed-up. For my biggest algorithm, I got some 30% just by removing the ungodly-complex-GF8-coalescing-code with a very straight-forward partially coalesced one.

    Staying on topic, I'd say having a little bit of smarts in the way you process memory accesses is often quite useful, especially if you are not on a fixed platform. Cell suffers from not having this, but makes up for it with 6 cycles latency. It would be interesting to know how much the GT200's reorder-buffer costs in terms of latency.
     
  3. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I suspect it is about 50-100 cycles. In initial CUDA docs they said that 400-600 clock cycle latency should be expected. But when volkov published his benchmarks, he found more like 500-700 clocks on GT200. This is a rather crude estimate, I admit.
     
  4. T.B.

    Newcomer

    Joined:
    Mar 11, 2008
    Messages:
    156
    Likes Received:
    0
    That sounds waaaaay too high for me. Like an order of magnitude too high.
    OK, the reorder is at base-clock, not shader clock, which I assume you meant, so that brings it down a good bit.
    Still, I may be totally off here, but even 20cy is a long time. :)
     
  5. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    The numbers there are definitely in terms of shader clocks. However, considering the logic involved, 100 cycles is indeed high. However, it may well be the case that nv wanted to minimize the area used (as it is CUDA specific) so used a smaller coalescer many times to reduce an already bloated die. Like I said, it is a crude estimate.
     
  6. CouldntResist

    Regular

    Joined:
    Aug 16, 2004
    Messages:
    264
    Likes Received:
    6
    Only Arm and Core will prevail. Other architectures (nvidia, amd, larrabee, itanium, cell) will be assimilated or obsoleted.

    Resulting polarity on the industrial scene, will be seed of epic conflict within human civilisation, destined to last for eons. Eventually spreading over entire galaxy, the conflict will outlast biology based life form of humanity. Our cosciousnesses, now encased in machine shells, will be occupied with the ultimate goal: complete annihilation of the opponent, before the Heat Death of Universe happens...
     
  7. T.B.

    Newcomer

    Joined:
    Mar 11, 2008
    Messages:
    156
    Likes Received:
    0
    Galactic Arm vs. Galactic Core? Hmm.... ;)
     
  8. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Oops, that's what I get for a quick post in the local Apple Store this weekend. I was referring to the ability of the GT2xx series to reduce access size. Specifically the ability for the hardware to reduce global access requests from 128 byte segments to 64 byte segments or 32 byte segments. My mistaken 16x addressing factor is really a 4x addressing factor with the reduction from a 128 byte segment to four 32 byte segments.

    The 32 byte segment turns out to only be 1/2 the size of a LRB cache line. Marco's point is indeed a good one for global memory accesses.
     
    #48 TimothyFarrar, Apr 20, 2009
    Last edited by a moderator: Apr 20, 2009
  9. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Sure, but with the shared memory stuff a single read can pull data from all banks simultaneously at arbitrary offsets within each bank. Aren't single-ported cache reads rigidly limited to pulling a single contiguous cache line at once?
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    What algorithms need to do this? Why? When are these offsets arbitrary, but not random?

    Jawed
     
  11. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    'Banked' can actually mean a lot of things. 'Banked' to allow one or more reads per cycle (if there are no conflict). 'Banked' because a given size of cache memory isn't synthesizable at a given frequency so you need multiple blocks and select between them on access.

    In the case of NVidia I have not read the documentation so I'm not sure but I would guess that their local storage memory has as many banks as vector ALUs (requests per cycle) and each bank is independtly addressable. Because it makes sense when taking into account how a shader processor works. In fact that way it would work more like a single-ported register file than like a normal cache (it may even be synthesized as such).
     
  12. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    In fact it's not a cache, it's a multi-banked local memory.
     
  13. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Awesome link Jawed. Thanks.

    So for the convolution with G92 level hardware it looks as if only a little over 2.5x faster to go fully optimized CUDA shared memory vs GPGPU texture access only. Then with GT200 still around 2.5x faster using shared memory vs texture access, just the coalescing brings the hideous CUDA code up to the texture access only case's performance.

    Wasn't a similar ~2x factor was also mentioned in some DX10 vs DX11 CS slides. Best GPGPU vs CUDA speed up factor I can remember was something like ~7x for a parallel SCAN (comparing CUDA to OpenGL).
     
  14. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Don't really understand the question. Why is randomness relevant? The hardware doesn't care whether the offsets are fixed, arbitrary or random. Lane-aware dynamic warp formation can be considered a case where the offsets are both arbitrary and random, yet it would certainly take advantage of the fact that each lane has a dedicated bank to facilitate simultaneous reads across all lanes.

    Yep, that's pretty much how it's setup. But since their ALUs are double pumped there are actually 16 banks to each 8-wide ALU (reads are 16 wide).
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    It's notable that the later tweaks (128-bit memory operations) weren't applied to the texturing algorithm. Also, does shared memory actually benefit this kernel? Why not just let the texture cache do all the heavy lifting? OK, so this isn't trilinear filtering where the texture cache is really at home, but still.

    I think the D3D11-CS speed-up comes in FFT.

    Slide 54 for the Scan speed-up:

    http://gpgpu.org/static/asplos2008/ASPLOS08-5-advanced-data-parallel-programming.pdf

    which relies upon shared memory so that scatter doesn't get bogged down by slowness of memory. Luckily it's highly localised scatter.

    Jawed
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Randomness is relevant because you need an algorithm with precisely controlled offsetting to gain the benefit you're ascribing to NVidia's architecture.

    Which raises the fundamental question: are you doing offsetting because the memory forces you to do so, or is this intrinsic in the algorithm (and with a wonderful stroke of luck the algorithm just happens to match the dimensions of the memory :lol: )

    I've seen people use 17 and 63 as offsets in their shared memory programming purely to access the performance benefit of banked memory in NVidia. In neither case was 17 or 63 a convenient or useful tweak determined specifically by the algorithm, it was a pure hardware-specific adjustment.

    So basing an argument for the superiority of NVidia's banked shared memory over L1 fetches solely on offset-banking is not very convincing.

    e.g. if an algorithm eventually consumes all the data in the tile sat in shared memory, then what we're actually talking about here is ordering of fetches. The ordering that works on NVidia may not suit Larrabee, but the converse prolly exists.

    Some Larrabee factors that might affect how you implement the algorithm when faced with Larrabee:
    • The large per-work-item state that Larrabee affords
    • the mutability of "registers"<->"shared-memory" in on-die memory (no hard limits)
    • the shuffle instructions that allow work-items to get at data belonging to other work-items without having to go via "shared memory"
    Jawed
     
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    It's highly convincing since all of the "problems" you see with shared memory conflicts also apply to a traditional cache. Whereas a multi-banked local memory will in a large number of random cases coalesce better than cache reads. So the cache has all of the same downsides and none of the potential upside. But like you said, at least caches are bigger.
     
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    This one may be somewhat more complex, depending on the situation.
    Shuffles and broadcasts allow for getting data belonging to other work items, so long as we have and SOA scheme and sharing data by shuffling between the 16 items in the same SIMD register.

    If sharing for more than 16 units, but the work group is small enough that we can hold enough relevant data within the register file, then some combination of instructions can find the needed register and then access the needed elements.

    Once we start spilling into main memory, arbitrary access to any work unit's data would involve the program tracking the needed index values and then calculating the needed address to load from. It may not share the name with the abstraction of shared memory, but the actions involved are similar.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    The Larrabee implementation may not stumble blindly into thrashing the cache at 1/16th throughput. Like I asked earlier, why is the algorithm behaving "so badly"?

    A large-radius stochastic kernel might be an example of such an access pattern. If you treat each work-item individually then the fetches appear random. All the work-items in a fibre are less random. If you have multiple fibres forming a tile, and the tile strides, then you're left with very little randomness.

    Jawed
     
  20. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    I have to admit you've lost me here. You criticize shared memory based on the fact that the software has to explicitly take advantage of its structure. But then you say it's not a problem for Larrabee because the software explicitly takes advantage of its memory structure? Isn't that the same thing?

    If so, I would argue that it takes a lot less fudging about with shared memory (offsets) than with cache (shuffling data around) to get what you want.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...