Larrabee: Samples in Late 08, Products in 2H09/1H10

Discussion in 'Rendering Technology and APIs' started by B3D News, Jan 16, 2008.

  1. ArchitectureProfessor

    Newcomer

    Joined:
    Jan 17, 2008
    Messages:
    211
    Likes Received:
    0
    I agree that gumming up the caches could be a problem. One way to address that is to place such scatter/gather lines in the cache in "evict me next" position in the cache way, greatly reducing the overall cache thrashing impact.

    As far as gumming up the interconnect, the on-chip interconnect has much more bandwidth than the off-chip DRAM. So, once you've already gone off-chip, the on-chip interconnect can certainly handle it. That is, if you have a 1TB/second on-chip interconnect with 0.1 TB/second DRAM bandwidth, the most DRAM data will gum it up will be 10% (approximately).

    Such a scatter/gather unit wouldn't save DRAM bandwidth, as modern GDDR DRAMs require bursts of 32B or more anyway.

    Putting all this together, I doubt that there will be any special scatter/gather units except those that sit behind the L1 caches. It just doesn't fit with Larrabee's philosophy, IMHO.
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Reminds me of:

    SIMD processor executing min/max instructions

    which is worth reading alongside:

    SIMD processor having enhanced operand storage interconnects

    SIMD processor and addressing method

    The latter is based upon a dual-ported memory [0037] (I'm assuming on-die). I guess this is the operand window which acts as a proxy for the register file and/or other sources of operands (constants, fetches from memory).

    Jawed
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Interestingly:

    http://www.qimonda-news.com/download/Qimonda_GDDR5_whitepaper.pdf

    So the implication is that the memory controller can increase bandwidth utilisation by sorting data into bank groups, making for a looser restriction than sorting by bank. It doesn't improve granularity, but it does mean that the MC will need a hierarchical-banking model to make best use of the bandwidth.

    I've no idea if this is new to GDDR5 or is a common concept - I got nowhere fast trying to determine if this is a feature of prior memory types in the PC space.

    Jawed
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    For your final homework assignment tonight, from me :razz: go in search of the "post transform vertex cache" and "tile based deferred rendering"...

    Jawed
     
  5. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
  6. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Ct is a research project, and I'm not sure if it's very related to Larrabee (yet). They also focus on the software aspect, so saying that they will make use of scatter/gather when it's added to the ISA doesn't mean this is going to happen any time soon.

    But I hope I'm wrong. :D
     
  7. ArchitectureProfessor

    Newcomer

    Joined:
    Jan 17, 2008
    Messages:
    211
    Likes Received:
    0
    FWIW, Ct does seem to be related to Larrabee. There are little hints here and there (the paper that was linked shows a "future" 512-bit SIMD extension for example). I recall hearing that the Larrabee team was actually considering writing much of the Larrabee software in Ct, but I'm not sure what they ultimately decided.
     
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    That would reduce the cache contamination by a factor of 1/associativity.
    It's interesting to think of the implications of such a move.
    The cache controller's job would be slightly more complex, as it would have to suppress the update to the cache line's LRU status.

    The Larrabee slides put the memory bandwidth at 128 GB/sec and the ring bus at 256B/cycle.
    If we assume the top speed bin of 2.5 GHz and a full-speed interconnect, my math puts it at 640 GB/sec internally.
    That means the number is 20%.

    That's starting to edge towards noticeable for me.
    On top of granularity concerns and the dynamic load behavior of the ring bus, the 20% may better serve as a lower bound.

    I agree that this is likely the case, but I have a suspicion the consequences will be more noticeable.
     
  9. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Explicit gather seems to me cheaper to implement than trying to support the huge amounts of outstanding prefetches you need to cover external memory latency for unpredictable access patterns (around 2 orders of magnitude more than what x86 processors currently support).
     
  10. Bob

    Bob
    Regular

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    If all the ports are write ports, then yes. If you're only increasing read ports, it should be rather easy to grow the SRAM linearly (at least for "reasonable" numbers of ports). After all, you could just duplicate the SRAM array N times, once per read port.
     
  11. ArchitectureProfessor

    Newcomer

    Joined:
    Jan 17, 2008
    Messages:
    211
    Likes Received:
    0
    Bob, is 100% right on this. I actually mentioned something about this read vs write port issue earlier in this thread:

     
  12. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Yeap that's exactly what I was describing, using a tree reduction with vector MIN or MAX opcodes. But a 16-wide tree reduction would probably be at least 8 instructions because of the shuffling required.

    I'm thinking not having a dedicated texture unit would be suicide in terms of performance. Some reasons being that you want the texture cache to work different than a standard cache (?), and texture storage in memory to be ordered for 2D locality which is not efficient for a CPU to access.

    Interesting question. Guessing the concept is to run from fetching vertex shader results from the software vertex shader cache (or if not cached run the vertex shader and insert into the cache) all the way through to ROP in the same task. Leaving the complexity of the geometry shader out of this, I guess it would be possible, but what would the side effects be in terms of cache performance. You might end up trading locality of intermediate pipeline data for locality in texture units (in terms of L2). One plus here is that triangle data is probably locally bunched already for the vertex shader results cache.

    Getting back to the vertex through pixel in one task, you would need to be able to send out tasks to different cores for obvious stuff like rendering a full screen quad (2 triangles) which would shade all pixels on the screen. So primitive through ROP in one task quickly breaks down when triangle size varies.

    I believe pixel and vertex shaders overlap in execution, pixel shader starts as soon as it can be insured that it will not stall on vertex shader results. BTW, there was a really good talk (downloadable MP3) at some university website from a guy at NVidia going over shader scheduling. I don't have the link anymore, perhaps someone else knows and could post it...
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Or maybe it's actually a 3D locality that you want to implement:

    http://forum.beyond3d.com/showpost.php?p=1121304&postcount=319

    I expect it isn't difficult to map texture data into "CPU caches", particularly as this is a CPU that's being optimised for 2D data structures such as textures and render targets, with some "graphics" helper instructions.

    http://courses.ece.uiuc.edu/ece498/al1/

    I think you might be referring to Shebanow's presentation from the prior year:

    http://courses.ece.uiuc.edu/ece498/al1/Archive/Spring2007/Syllabus.html

    I don't know if he presented again this year.

    Jawed
     
  14. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Either way you want hardware translation of linear coordinates to tile coordinates.

    Having a separate specialized texture cache makes about as much sense as having a specialized instruction cache.
     
  15. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Jawed, thanks for posting the link.

    It is somewhat ironic here given that x86 ISA is "hardware rich" (providing something like 9 or more addressing modes and merged MEM/ALU instructions) compared to other ISAs like PPC/MIPS/Alpha, and that Larrabee is going for the "hardware sparse" approach with respect to GPU functionality, but still keeping x86 ISA... I realize that the x86 ISA now is really just a form of instruction stream compression for a RISC backend, but I'm still surprised that Intel seems like it isn't using Larrabee as an opportunity to clean up the instruction decode and go with a clean 32bit per opcode instruction set. Without out-of-order and register renaming, seems like x86 looses a lot of its advantages...
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    That's a rather positive spin to things.
    One person's hardware-rich is somebody else's pointless cruft.

    The ISA isn't a first-order factor when determining performance, so it's not an immediate killer, it's just kind of ugly.
    The reasons are more market and marketing-based, anyway.

    There is no technical reason why x86 is particularly compelling for graphics, and Larrabee's success would simply mean the imposition of decades-worth of architectural baggage on a market that has little need for it.
    If x86-graphics becomes dominant, we'll be seeing shader programs 10 years in the future with coding determined in part by architectural decisions for programs that predate memory-mapped IO.
     
  17. Barbarian

    Regular

    Joined:
    Jun 27, 2005
    Messages:
    289
    Likes Received:
    15
    Location:
    California, USA
    It will be even more impressive if the address generation instructions have parameterized (or at least programmable) masks. Swizzled textures could be implemented as a simple "combing" of two registers (ie similar to what you do for FFTs). Such a capability could open the doors to a lot of tricks not limited to texture mapping.
     
  18. Barbarian

    Regular

    Joined:
    Jun 27, 2005
    Messages:
    289
    Likes Received:
    15
    Location:
    California, USA
    One could claim that small register files and mem-op instruction set IS the way to go with heavy multi-threading and 4xSMP. Combined with 1-cycle L1, you pretty much get 32kb indexable register file, with low context switch overhead. So, IMHO, x86 legacy here is all about the coherent memory and massive SMP, plus 20 years of compiler improvements.
    Larrabee's graphics capabilities will live or die based on it's vector unit, which as we know is NOT x86.

    To paraphrase, to some this is pointless cruft, to others this is the most clever use of legacy technology to date.
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The rumored size of the vector register file would seem to negate the small register file point.

    Combined with an L1 shared by 4 threads, you get a register file where you pray the entries are still there when the thread next comes around. That register file is hidden behind layers of variable and possibly optional indirection and can prompt any number of exceptions that must be checked or trapped by the silicon.

    It's not ideal, and it could be implemented just as you say it without x86.

    Destructive operands, string operations, an awkward ring-based software permissions architecture, IO instructions, multiple memory addressing modes, funky flag registers, variable instruction length, not-entirely-general-purpose general purpose registers.

    Then Larrabee lives or dies based on the one unit of its architecture not covered by 20 years of compiler improvements.

    The most clever use seems to be, by your statements, in leveraging this non-x86 vector unit.
    I must be jaded when I'm not impressed by x86 getting tacked onto anything with a clock generator.

    I've heard of designers stating that in a perfect world, they'd go clean slate with regards to x86.
    I'm not aware of many saying it's a dream come true that they get to implement it for another two decades.

    x86 isn't pure poison, but it certainly isn't the best there could be.
     
  20. Barbarian

    Regular

    Joined:
    Jun 27, 2005
    Messages:
    289
    Likes Received:
    15
    Location:
    California, USA
    There are a lot of designs that did go with a clean slate and none of them managed to beat x86. Even Intel's own Itanium failed miserably.
    Don't get me wrong, I'm the last one in the world to say I love x86. But I think it would be a mistake to ignore the history of so many failed clean-slate designs.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...