Larrabee at Siggraph

Discussion in 'Architecture and Products' started by nAo, Jun 2, 2008.

  1. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    almost all caches of any reasonable size have multiple banks, the issue is how much those banks are exposed, the more it is exposed, the higher the latency and the higher the area impact as well as the power impact.

    But fundamentally, anything you can do with a local store can be done equally well with a cache. The whole local store architecture of CELL is one of its biggest drawbacks. In general local stores significantly complicate programming as well.
     
  2. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    I wonder how they do the texture cache ... will they simply add a couple of cycles of extra latency so they can merge read accesses or will they use multiple banks with narrow ports?
     
  3. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    It's certainly different.

    If one looks at a SPE with 'normal' programming model glasses, a SPE is just a processor with a two level register file, the local store forming the second level, - and no cache. I'd suspect a future CELL would do good having a big fat shared cache before the memory interface. I still don't think it beats processors with normal cache hierarchies.

    Cheers
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The buzz prior to the release of details screamed something a little different.
    I believe the memory model is a good one, but the naive parroting of sound bites about it was nearly as silly as the "OMG realtimeratracer!!!" fluff articles floating around the net.
     
  5. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    SPU patents describe such a cache, though all CELL variations out there don't have any, afaik.
    On the other hand with a cache or without the programming model would still be the same, unless SPUs ISA gets extended.
     
  6. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,629
    Likes Received:
    1,227
    Location:
    British Columbia, Canada
    That's an extremely difficult question to answer in any sort of precise way you know ;) Not only does it depend entirely on the algorithm used (and indeed you might use a different algorithm for it on different architectures), but if you take it all the way to database-land where only IO matters (which is not going to be an unreasonable model in the future I think), the complexities are entirely dependent on your ability to touch non-local data as infrequently as possible. In this land, large caches/local stores win hands-down as increasing block sizes reduces the number of "passes" (in GPU parlance) over the data set.

    The concept here is similar to many algorithms actually: you want to extract the "minimum" amount of parallelism out of your problem to keep all of the cores busy in general, and run the serial algorithm on the in-core data set. This is precisely how reductions, scans, segmented scans, sorts and many other primitives are implemented on parallel processors, and that's entirely what you're doing when you use CUDA's local store in most cases. It's just that the programming model has you looking out from inside some nested loops so it doesn't make it clear that you're really just writing SIMD code over a block of data in local store.

    So while multi-banked memories like CUDA's local store are useful, so are bigger caches, more ALUs and any number of other things that you could spend the transistors on :).
     
  7. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    In the terms you have described. Seems to me for similar ALU capacity between Larrabee and say an NVidia style GPU, that Larrabee might only have a 1.5 to 2.0 size advantage in the in-core data set, and be at a disadvantage in terms of utilization of out-of-core data bandwidth. So I'm still skeptic on this idea that Larrabee is going to be a huge win. For the problems I would like to use Larrabee to solve, it still seems as if out-of-core bandwidth utilization is more important.

    Perhaps I'm way off base here, but taking my other simple example, I don't really expect a huge win for Larrabee in overall time for sorting 16M elements on shipping hardware with similar ALU capacity.

    However, there is one area which it seems as if Larrabee might have quite an advantage, in that is general scatter where the scatter as some locality.

    NVidia and ATI GPUs have a readable write combined cache for each ROP/OM unit right? Seemed as if NVidia at one time had plans to expose this surface cache in CUDA (.surf in PTX spec), but that never materialized (and neither did programmable blending to which it might have been used). If the Larrabee model (SIMD+scatter/gather+ R/W caching) ends up been the best thing since sliced bread, seems like other GPUs adopting an accessible surface cache might be a good evolution (to give bandwidth reduction on scatter). Of course latency could be far better on Larrabee, but the overall memory bandwidth reduction could be similar.
     
  8. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    17,879
    Likes Received:
    5,330
  9. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    It might be case of history repeating. David Kirk released a similar interview 3 or 4 years ago (we simulated a unified architecture, etc..) and we know how that ended. Perhaps we should expect them to release something very close to Larrabee in 2010 ;)
     
  10. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,629
    Likes Received:
    1,227
    Location:
    British Columbia, Canada
    ... enough said about the intelligence of that article. ;)
     
  11. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Some things that jumped out at me in that article:

    I think this is a pretty weird statement to make. The information that Intel released was aimed at the point that they could scale better BECAUSE they reduced bandwidth requirements, among other things.

    This might go for nVidia, but Intel works within a different environment. For example, nVidia didn't decide to design the G80 10 years ago. The time wasn't right. nVidia didn't have the expertise yet, there was no major API that would require it (nor would the rest of the PC be up to the task of driving it), and it would be impossible to manufacture with the state of chip manufacturing at the time.

    Intel is ahead in chip manufacturing, and Intel has expertise in areas that nVidia doesn't have yet (and vice versa). So it might not be the right design for nVidia at this point. But it could be for Intel. It could also be the right design for nVidia some time in the future.

    This is something that I've been saying aswell. Cuda is the real power of G80 and beyond, and we have yet to see what AMD's answer will be to that.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The ring bus is physically situated on top of the L2s, which would save space. It might hint at some kind of rough correlation between bus width and L2 tile size. An L2 that didn't match or exceed the physical width of the bus would leave space that would need filling.

    I wonder if the ring bus sits a metal layer or two above the signal lines needed to signal the cache and link it to its local core.


    Other things to note:
    Intel's slide seems to indicate Larrabee's moved to MOESI.
    This probably makes sense with so many cores modifying data on that ring bus that the modifications not write back out over memory.
    This moves Larrabee:
    1) in an AMD direction
    2) in a direction not matching Intel's QPI

    There's a "td" box situated at the ends or at the crosslinks between ring busses. Some kind of directory?


    There's a block for "fixed function" that is both separate from the PCI-E, display unit, textures, and memory controller.
    I wonder what's left that they haven't mentioned yet.
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Isn't it normal for any kind of bus to run over logic - with only repeater logic consuming area in "islands"?

    What would happen as process scalings kick in? Would L2 shrink more rapidly than the bus?

    Presumably also good since with core scaling you get multiple rings and ultimately, I guess, multiple chips linked together.

    No idea :sad:

    ---

    In rummaging around www.intel.com I found this paper on game physics:

    Game Physics Performance on the Larrabee Architecture

    Doesn't really provide any architecture insights though.

    Jawed
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Cell's EIB doesn't route over the local stores. The EIB has a fair amount of dedicated logic sitting right in the center of the die.

    The coherent bus in Larrabee might have made the case for distributing it amongst the caches.

    That's what I'm curious about.
    SRAM compacts pretty well with process. Logic less so. Interconnect beyond the lowest levels scales more slowly, and the higher layers are at higher geometries.
    It might depend on just where the bus is running.

    There might be a design-specific inflection point where the work in compacting all the signal lines balances with the challenges of running it at speed versus the space savings of keeping the L2 physically small and the desire to have more L2.
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    And probably "saves space" by having EIB control logic passed over by interconnects.

    All I'm saying is that in terms of the die as a whole, "space saving", by laying a bus over logic is normal.

    Now it might be that laying a bus over RAM is the easiest of configuration. Don't know, I suppose it's a question of the impact of repeater logic islands on L2 latency (due to increasing the radius of L2).

    Jawed
     
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    I wouldn't expect them to route completely around the EIB logic, since the EIB logic deals with the interconnect directly. I don't have a high res shot of Cell's EIB section, but it looks like part of the path the signals go through is in the logic block that takes up die space.

    Routing the interconnect over the dedicated EIB logic is a bit different than lofting it over non-dedicated silicon.
    Perhaps Cell also routes some of its bus over other silicon, I haven't seen a diagram for that.

    But it's also not required. The fact that IBM aggregated the EIB's logic in one place instead of distributing it indicates there are other considerations.

    Given the likely size of the L2 caches and the relatively simple ring bus scheme, I'm not sure there would be enough to be prohibitive.
    My question is what happens when the SRAMs shrink, and if the ring bus will scale accordingly or if Intel will relax a bit and allow the cache capacity to go up to take pressure off of the ring bus designers.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I've got a high res shot of Cell, but I don't know how it would help you to discern anything about the routing of the bus lines...

    A cache is also "non-dedicated silicon", so the question of whether any "area saving" applies is moot.

    I'm merely saying that it seems to be normal to route an interconnect over un-related logic.

    If the cache increases in capacity then that affects latency. Clearly, it's too early to tell how sensitive to cache latency Larrabee will be. Arguably, as the number of cores rises, any slight increase in L2 latency will be overwhelmed by ring-induced cache-coherency latency, and other scaling factors. So maybe it doesn't matter so much?

    In then end it seems extremely unlikely to me that the bulk of the ring bus in Larrabee is formally restricted to the area of the die occupied by L2 - I don't think there's much value in assigning a 1:1 scaling question-mark over these two components (RAM + bus) of this subsystem. If the bus overspills the RAM in future it will lower the density of the logic it flies over - but this is just another version of the scaling problem for physical I/O, where analogue stuff scales poorly.

    Jawed
     
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    That was my point. Cell has a contiguous area of the die devoted to the ring bus and its logic. I'm not privy to the details of the design, but I have a hard time accepting it is wholly made up of repeater blocks.

    The dominant factor for that is area, and latency roughly scales with sqrt2 of the physical area of the cache.
    If the SRAMs shrank, we'd expect better latency.
    If the cache capacity were expanded to give roughly equivalent area, we'd have the same latency with more capacity.

    Perhaps I'm reading to much into the part of the slides that said that the ring bus is physically layered on top of the L2.

    It might require the redesign or rerouting of all the logic it flies over, possibly at the expense of poorer density in logic that already scales worse than SRAM.
    Depending on how large an L2 tile is compared to its directly linked compute core, the penalty may be worse if the logic expands.

    The SRAMs might not require too many additional layers for their signalling, the more complex logic of the cores might have uses for the interconnect at the altitude of the ring bus, plus whatever margin of safety is needed to keep both layers from interfering with one another.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...