Larrabee vs Cell vs GPU's? *read the first post*

Discussion in 'GPGPU Technology & Programming' started by rpg.314, Apr 17, 2009.

  1. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    I'm not sure about that. If you were to scatter randomly (or semi-randomly) into your CUDA shared memory wouldn't you have a rather low on average number of bank conflicts ... if so your scatter would actually be a bit faster than if you could only hit a single cache line per access.
     
  2. Rayne

    Newcomer

    Joined:
    Jun 23, 2007
    Messages:
    91
    Likes Received:
    0
    The gradient vectors in the Perlin noise are 100% random, if you use the original Ken Perlin algorithm.

    In my CUDA implementation, i have 24 random accesses to float4 vectors, plus 6 accesses to byte indexes, and the final float write result. :oops:
     
  3. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    A large radius stochastic kernel is a good example. With LRB if you were gathering 16 random pixels each vector gather, how many would hit in the same 4x4 tile (bit interleaved addressing so 4x4 tile = one cache line)? Compare to how many expected bank conflicts you'd get with CUDA. Not to mention if large radius kernel, you could create a "random" pattern which was insured not to have bank conflicts...
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Just because I'm describing what people might do on Larrabee doesn't mean I think all this stuff is wonderful.

    Shared memory isn't even an automatic win. Still it can be a huge bonus so until the abstractions are improved (if it's even possible?) people are stuck with programming to the metal.

    Maybe NVidia will start opening up on their architecture, in light of increasing competition, instead of making people faff for 18 months in order to get a decent matrix multiply programmed.

    I'm sure this will be qualified by developers over time once they get their hands on Larrabee.

    Maybe we need a thread: "Ray Tracer: CUDA or Larrabee?"

    Jawed
     
  5. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    How do you make your data fit in a small local memory for a "large radius stochastic kernel"? :)
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    The fundamental question is: if you're allowed to optimise for NVidia, aren't you allowed to optimise for Larrabee? Why blindly take a piecemeal approach on Larrabee just because piecemeal doesn't hurt too much on NVidia?

    This paper:

    http://impact.crhc.illinois.edu/ftp/conference/ppopp-08-ryoo.pdf

    is just over a year old. I wonder how much of that is still useful? They didn't even get to 100GFLOPs in MM on G80.

    The learning curve is pretty steep with these "desktop supercomputers" :lol:

    Jawed
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    In Larrabee you might hold the whole tile in L2 and slide a tile from the dataset through L2 using the scalar pipeline's cache pre-fetching instructions, in parallel with the work being done by the VPU.

    The worst case latency is 1/16th * 10 L2 cycles = 160 cycles. How does that compare with the 500 cycles of latency that NVidia's trying to hide, each time it goes to fetch a tile from video memory into shared memory? How much work are the multiprocessors doing (erm, not doing on the kernel) while they compute tile addresses and move data into shared memory?

    Jawed
     
  8. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    :sad: Yeah, a 64x64 or so sized tile for local memory isn't all that large...
     
  9. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    This is what I mean when I say that a data cache based approach scales better (which obviously doesn't say much about absolute performance).
     
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,137
    Likes Received:
    2,939
    Location:
    Well within 3d
    Well, it is cache, so in the spirit of nitpicking, the worst-case is "Surprise!! One of the other three threads just thrashed the L2 at those lines you need" and it's 16*200 cycles or more.

    The best way to prevent that would be to properly pin the needed threads to the core and make sure that they don't fetch to the L2 or write out any data that might force an eviction before all filter work is done for all threads.
     
  11. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,436
    Likes Received:
    443
    Location:
    New York
    I've been wondering lately whether Nvidia will increase the amount of shared memory beyond DX11-CS requirements and try to take advantage of it for graphics rendering tasks as well. If it's there, might as well use it right? Is there anything that could potentially make use of it? Tiling, programmable AA resolve? Or is that a non-starter?
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    In Seiler et al there's a thread that controls this stuff as well as distributing work to the other threads. I've got no idea if a co-operative scheme amongst 4 threads is reasonable.

    There's only L1 line locking, there's no way to lock L2 lines.

    Jawed
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,137
    Likes Received:
    2,939
    Location:
    Well within 3d
    Threads working on the tile can synchronize so that they work on the tile in the L2 at the same time.
    Load and store traffic from these threads during this process could be set to be non-temporal to minimize L2 cache disruption.
    There may be some awkwardness if results are written back non-temporally, since they'd have to be reloaded if required soon after.

    If complexity is the order of the day, the data could be structured and packed so that 1 or more ways in the L2 are purposefully kept free of data, so that writeback or scratchpad work doesn't evict data.

    Neither scheme works fully if another core is writing to the exact same addresses, then invalidates its cached copy, but that sounds like there would be worse problems than cache thrashing then.
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    If they deleted the ROPs?

    Currently NVidia's L2s are located close to the memory controllers. I'm not clear on whether L2s serve both for textures and render targets though. If the L2s were moved closer to the multiprocessors then there'd be a problem in moving texels from L2 in MP1 to L1 in MP37, say. Becomes similar to the ring-bus palaver of R600.

    Larrabee dodges that bullet (at least partially) by giving texturing a dedicated cache. No idea how big it is. No idea if it's multilevel. No idea if Larrabee's L2 is on the critical path for texels on their trip from memory to TUs.

    The other issue is the early-Z culling. Rasterisation and ROPs effectively interact in determining whether to shade a fragment. Deleting the ROPs would require the output merge programs on all the multi-processors to keep the rasteriser's low-resolution version of Z/stencil up-to-date. That's a potentially tricky interaction twixt fixed-function rasterisation and OM programs. e.g. if there's 64 multiprocessors each trying to keep the rasteriser up-to-date...

    There might be an argument for saying that ROPs can only die when setup/rasterisation also becomes fully software.

    Anyway, I like the way Larrabee holds a tile of the render target in L2 until it's completed. Shared memory doesn't seem particularly different, so it would be sweet if NVidia went with a significantly expanded shared memory, no-ROPs and lots of FLOPs to fill the space taken by the ROPs.

    Jawed
     
  15. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Interesting idea. But what about MSAA and render target compression? Something tells me with the number of samples (scaled up in comparison to the GT2xx) described in that GPU 2013 presentation that MSAA is still in hardware for GT3xx at least...
     
  16. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Forgot about this, but also in that NVision 08 GPU 2013 presentation,

    Rough timeline of changes,

    - unified shader (G80)
    - double precision (GT200)
    - C++
    - preemption
    - complete pointer support
    - virtual pipeline
    - adaptive workload partitioning

    And by 2013,

    - arbitrary dataflow
    - hardware managed threading and pipelining

    My guesses,

    - C++ (CUDA 3.0 gets C++ like interface?)
    - preemption (support for multiple CUDA kernels at the same time?)
    - complete pointer support (dynamic branching in a kernel?)
    - virtual pipeline (software rasterization?)
    - adaptive workload partitioning (hardware sched of kernels to SIMD units?)
     
  17. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    It's 32 Kb per core (see Siggraph paper).
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Do you mean compression of both MSAA and colour data? It's certainly more work...

    This is for Z compression, which I've picked because of the sheer stupidly high rate in NVidia:

    http://v3.espacenet.com/publication...=B1&FT=D&date=20080603&DB=EPODOC&locale=en_GB

    Hmm, quite a bit of work.

    In Larrabee, with no compression required (since pixels are written only once to memory, with optional MSAA resolve, depending on whether samples are required for following pass) there's obviously less work to do. In that case it seems MSAA/stencil-shadow performance is more than adequate.

    Here:

    http://forum.beyond3d.com/showthread.php?t=53993

    the Archmark results imply that theoretical Z-rate is achievable, 147Gzixels/s, on GTX280 (I wonder if that's meaningful, though - how much z-culling is happening? HD4870 is faster than theoretical...). That's 2.3 shader cycles per zixel. Say Z is in 8x8 tiles, that's ~147 cycles per tile. Looks unlikely.

    At the same time, NVidia's theoretical rate is nearly 3x RV770's. Is that factor ever witnessed in any game? If not ...

    Maybe some special instructions would speed things up, e.g. max + min of a tile in shared memory.

    How many FLOPs would fit in the ROPs and Frame Buffer (what is that?) sections of the final picture here?:

    http://www.techreport.com/articles.x/14934/2

    Jawed
     
  19. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I'd have thought that it would be mean pointers to registers too. Dynamic branching is already there, isn't it?
     
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,137
    Likes Received:
    2,939
    Location:
    Well within 3d
    Looking at ROP sections alone, my eyeball says 2.5-3 clusters (without TMUs) could fit.
    The framebuffer part is something of an unknown. Part of that might be the various other widgets like memory controllers, PCI-E and other miscellaneous parts.
    I've seen other pictures labelling that region as containing those.
     
    #80 3dilettante, Apr 21, 2009
    Last edited by a moderator: Apr 21, 2009
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...