Larrabee at Siggraph

Discussion in 'Architecture and Products' started by nAo, Jun 2, 2008.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    At the same time it seems like Larrabee takes heed of some of the techniques for using its L2 cache from Cell. With Cell the programmer divvies-up LS into a number of regions for different types of access. The L2 cache control instructions in Larrabee sound like they're doing much the same - treating L2 as "directly addressable" and allowing the programmer to specify the size and behaviour of regions.

    The paper gives an example of marking some L2 cache lines as low-priority so that a thread can stream data through L2 without trampling all over other lines.

    Jawed
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Another thing is that the tiling scheme for rasterizing is designed to keep as much as possible local to a core's L2 with minimal writing to common structures.
    It's somewhat ironic that all the noise was "coherent caching!!!!111!!!!" and Intel's eventual qualifier was "now try not to use it".

    Not knowing details, I wonder about the exact implementation. Is it setting a cache line in particular, or setting whatever cache line is hit by a special load/prefetch.

    The latter involves less work and less intelligence on the part of the cache controller, but is Larrabee able to specify that demand loads be done in this manner, or only prefetches?
     
  3. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    I don't find it ironic at all, when coherency is a low hanging fruit and it improves perf a lot while keeping your overall software renderer architecture highly scalable and future proof it would idiotic to not exploit it.
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    It is actually does meet the defintion of irony when everyone is fed the idea that coherent caching is an unqualified second coming and then it is shown that it has to be used judiciously.
     
  5. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    fyi I already have a 3 slot cooler and I installed it myself. It gives me 2x+ better cooling at <23db of noise something that just isn't possible in 2 slot cooling. It also enables me to get ~40% more performance.
     
  6. Enforcer

    Newcomer

    Joined:
    Apr 17, 2008
    Messages:
    32
    Likes Received:
    0
    Some thoughts:
    Incoherent reads are up to 16 times slower than Nvidia.
    (Coherent reads can be slower on Nvidia if data isnt patritioned properly (highly unlikely? why one use shared memory for that?). Nvidia support 1 broadcast per cycle (read the same address by many threads)).

    Is it 33% of logic or (logic+L1+L2)?

    Intel Atom has 47 million transistors with 512KB L2, 4W TDP with upto 2.2GHz speed. What can we expect from 45nm Larrabee?
    2 Tflops = 32 cores x 2Ghz isnt impossible...

    Simulated FEAR performance looks very promising (~120 average FPS 1600x1200x4AA at 1GHz x24 cores, ex. GTX280 has avg 140fps in fear internal benchmark),
    however 50% time wasted on Rasterization+DepthTest (rendering shadow volumes i guess) looks kinda scary, isnt it?
    Too bad the results aren't directly comparable as we dont know which frames were used...

    Ring bus:
    Only 128 GB/sec at 2 GHz??
    Thats lower than memory bandwidth of current cards (140 GB/s GTX 280).
    What about scaling beyond 24-32 cores? Can they increase bus width in the future?
    R600 had 512+512 ring bus too...

    Nothing was told about multi-chip communication and scaling...
     
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    After-market or some kind of special edition card I haven't seen yet?
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Larrabee can hide the latency of incoherent reads as it uses L1 to accumulate the data before the thread resumes. NVidia takes incoherent reads on the chin.

    Intel's price appears to be fully-manual context switching though, with VPU cycles lost - so Larrabee can't actually hide all that latency.

    So, overall, it looks like we'll just have to benchmark it :lol:

    It seems to me that a core is defined as logic+L1+L2. Each L2 is only used by its core. Cores can only access foreign L2s under the cache-coherency protocol, which is effectively a request to fetch data to make a local copy.

    If you take account of the fact that a ring bus normally supports multiple packets per direction per clock (between non-overlapping start-end segments) then you get more bandwidth. Also, the average trip length per direction is rather less than half the cicumference.

    Interestingly, with the huge amount of bandwidth that Larrabee saves in render target operations, it means that texturing will take up a far larger proportion of the overall bandwidth of each frame than we see on FF GPUs.

    So the TUs, while they're likely to be equally distributed around the ring, will also incur the highest average ring-bus trip lengths. They'll be fetching texels from all MCs and providing results back to all cores. That's my impression, anyway.

    The TUs have their own cache. I dare say I'm assuming this cache is distributed per TU - though they're likely to be able to share texels amongst themselves.


    So, the TUs will be using the ring bus pretty heavily and lowering the effective bandwith somewhat because of the relatively long trips they'll incur - and per texture result that is:
    • a request packet from the core
    • a TU cache coherency request and response, if the TUs share texels
    • a TU fetch command to multiple MCs (fetch + pre-fetch)
    • texels fetched from memory by multiple MCs
    • texture results returned to requesting core
    Jawed
     
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Possible, but with the caveat that Intel up until now has not considered the L2 to be part of the core for its CPUs, although this is possibly changing with Nehalem.

    The Larrabee paper doesn't quite define the relationship, but some of the wording seems to indicate that in the authors' minds there is a distinction of sorts.
     
  10. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    the thermalright HR-03 cooler. Aftermarket of course, haven't seen a graphics card shipped with anything but the most basic low cost design in forever. As a general rule, the vendor supplied coolers are some of the worst designed and worst build things in the heatsink universe. Horrible quality control, internally bent/blocked fins, missing solder, etc.
     
  11. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    The headaches it provides is irrelevant. It's about cost/benefit, being consistently slow is never an advantage. Sure cache to shared memory is only an apples to apples comparison when your algorithm has a local data set which fits shared memory ... but I assumed that much was obvious.
    You could argue that, but it's not a tenable argument. Lets say for a moment Larrabee could implement it's cache as banked with 0 area overhead ... you would still argue they shouldn't just because you are too lazy to make use of it?
    In the end a multibanked architecture will always be as fast or faster, no amount of tweaking your data layout or other software tricks can change that.
     
  12. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,462
    Location:
    Finland
    Not sure if it's been mentioned here already, but Larrabee utilizes some sort of Ringbus, similar to what ATI used apparently.
    Few shots from the presentation at IDF:
    [​IMG]
    [​IMG]
    [​IMG]

    According to Larry Seiler, the first products will appear late 2009 or 2010, and that the first product will be PCI Express add-in card, which will work alongside a CPU and possibly a GPU
     
  13. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,629
    Likes Received:
    1,227
    Location:
    British Columbia, Canada
    I'm just saying it's not as simple as what you're saying. Indeed in my experience getting data into local memory is usually more of an issue than operating on it there, and generally that's the step where you do your data conversions, coalescing, etc.

    In reality though everything has a cost, be it hardware or otherwise. Typically I'd prefer to have a larger cache/local memory and thus more chance of fitting bigger blocks of the problem into than too much cleverness like banking, etc. And seriously, optimizing CUDA has nothing to do with laziness... it's is quite literally beyond the ability of mere mortals, as several recent papers have demonstrated :)

    Sure, but nowadays it's all about the hardware cost/trade-off and no argument is very useful (particularly comparing architectures) without that component. Honestly it's not useful to say things like "Core 2 is faster than Atom"... clearly the actual hardware trade-offs matter.

    In this case, while it's fun to imagine an architecture with a different bank for every byte (or even bit!) of memory, I can't see that being a good use of transistors ;) Thus given the types of applications that I see, I don't thing a multi-banked memory architecture would be a huge win and it's probably best to spend the transistors elsewhere (such as on more local memory). But hey, I'm happy to be proven wrong on that one, and I certainly don't know the details of how much hardware it takes :)
     
    #513 Andrew Lauritzen, Aug 22, 2008
    Last edited by a moderator: Aug 22, 2008
  14. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    while we're going off in lala land lets make the local store infinite size as well.

    There are LOTs of cases where local store will be slower. And guess what, local stores never scale either. Local stores are as much of a dead end on GPUs as they were on GPUs.
     
  15. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    How about we take this to say a practical example, like say optimal sorting 16 million objects (say {sort key, object id}). Does the cache advantage (Larrabee) vs banking+local store (NVidia) argument still apply?

    Assuming the same amount of registers, if compute to cache or local store ratios are 1.5 to 2.0 in favor of Larrabee, how much of that advantage is lost in poor cache utilization, and which architecture ends up with better performance in terms of utilization of un-cached bandwidth to main memory?
     
  16. crystall

    Newcomer

    Joined:
    Jul 15, 2004
    Messages:
    149
    Likes Received:
    1
    Location:
    Amsterdam
    Well, the whole description of their software renderer screams "there's no free lunch". The synchronization between the setup and work threads is done without using real thread synchronization primitives, however it is doable only because the 4 threads (1 setup and 3 workers) have been physically pinned to a specific core (or at least, that's how I understand their description). They probably use compare and exchange instructions without a LOCK prefix for this purpose.

    In other words Larrabee is a general purpose solution but it cannot be used in the usual way when doing a GPU's work if you want it to perform decently.

    So called non-temporal loads/stores have already been used beside the usual non-temporal prefetches. The mechanism used is simple, the cacheline involved in the transfer has its (pseudo) LRU counter set to a value which will make it automatically eligible for eviction during the next cache miss. I guess will see something along this lines on Larrabee too.
     
  17. crystall

    Newcomer

    Joined:
    Jul 15, 2004
    Messages:
    149
    Likes Received:
    1
    Location:
    Amsterdam
    I'd take that results with a *massive* grain of salt for two reasons: the first one being that they are simulated and the second one that most of the data for evaluating them is missing. How many TUs where simulated? Which layout was used for the ring network above 16 cores? The details on it are sketchy (multiple short linked rings)... Was AF used? It is not mentioned so my guess is no. What is the supposed die area and power consumption of the various setups (8, 16, 24 cores, etc...)?

    Another thing which bugs me is this quote from the paper when describing how they obtained the simulated results:

    "We wrote assembly code for the highest-cost sections"

    To me it is crystal clear that all the stages of the renderer will go through a JIT compiler, not only the shaders as all the functionality that has been moved from hardware to software requires JIT compilation. That quote suggests that they do not yet have such compiler in place or that its output is not good enough - something I find hard to believe as Intel has an excellent compiler team - but in both cases it seems to me that their software stack still lacks some *huge* parts and we cannot guess easily on the potential performance of Larrabee. We simply do not have enough data.
     
  18. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    If you think you can give a realistic estimate of the area cost, by all means ... contribute something useful.

    Having multiple banks is not about local store vs cache ... it's orthogonal (works with both).
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    A cache will have to be multi-banked just to support multiple concurrent reads and writes, won't it?

    Jawed
     
  20. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    You are confusing ports and banks I think.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...