Larrabee: Samples in Late 08, Products in 2H09/1H10

Discussion in 'Rendering Technology and APIs' started by B3D News, Jan 16, 2008.

  1. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    You'll still end up with an instruction that can access 16 different places in memory, having really bad worst case behaviour. Scattering stores are problematic, a partially completed instruction will have modified memory (ie. be visible) before completion, what if one of the last partial stores segfaults ?

    And with the method I suggested you would hit the cache once for each different 512 bit location (1-16 accesses). Importantly you would only do this when you need gather/scatter, not when crunching dense matrices.

    Adding ports to your cache adds latency (or power) and logic, which will cost elsewhere.

    Cheers
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The thing is that nobody makes an L1 cache capable of that many accesses (4 read + 4 write or 4 read/write) at the capacity and latency shown on the Intel slides.

    Itanium services FP loads from the L2, and that cache is capable of 4 reads, but it takes much longer latency-wise.
     
  3. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Ok, I think that concludes my questions about the hardware cost. Thanks!

    Hopefully converting everything to SoA or serializing small memory accesses isn't going to hurt Larrabee's performance too much then.
     
  4. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Agreed, but I don't think Larrabee will use the old SSE style of gather which for SSE takes,

    4x load scaler (into low slot of vector)
    2x move low to high (merges 2 pairs)
    1x shuffle
    --
    7 instructions to gather 4 values into a vector

    But rather something more like SSE4 style INSERTPS as Nick was saying,

    4x insertps to gather 4 values into a vector

    Unfortunately if it takes 16 vector ALU cycles (+ any memory stalls) to gather, gather is effectively still useless for most algorithms (as it is with SSE). With the exception that Larrabee's texture units are bound to be able to gather from textures. While I think Larrabee will be a really interesting and fun platform to program for (and there is a good possibility I will be programming on Larrabee), I think it is an illusion that Larrabee will be "easy" to program for when going for performance for the following reasons (some I have mentioned before),

    1.) 4 in-order threads sharing the same cache (not sure if this has been confirmed or denied yet). Will have to insure tasks running on those threads have the same locality. If threads per core changes in future product revisions, algorithms might have to be adjusted.

    2.) 16-wide SIMD. Will have to insure algorithms use SOA with array size at a multiple of 16 and fully aligned memory accesses. Which really makes things like ray tracing secondary rays (ie divergent) not really a good match for Larrabee.

    3.) Possible mix of texture unit and general purpose cache sharing L2 (not sure if this has been confirmed or denied yet either). Will need to have knowledge of how (or if) the processor balances the needs of texture units and general L1, or design the algorithm with this in mind.

    4.) Like CPUs, Intimate knowledge of cache design will be important performance wise for data structure and algorithm design. Also comes with the crux if cache design changes on product line in future revisions, algorithms might have to be recoded.

    5.) Will need a good solution to distributing tasks on chips with a varying number of multiprocessors (not hard), but with very good ways to control the ordering and synchronization of tasks such that you can insure a good framerate (can be a complex problem).
     
  5. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,629
    Likes Received:
    1,227
    Location:
    British Columbia, Canada
    But Larrabee is supposed to hail in the new era of real-time raytracing so that we can finally get rid of dirty rasterization!

    :D Sorry, couldn't resist ;)
     
  6. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    The main thing to realize about scatter/gather in any modern system, is that its primary purpose is to free execution resources, not necessarily decrease complexity or latency of the load/store operation.

    In fact I'm not sure that generalized scatter/gather is actually that beneficial vs. limited stride or constant repeating base+offset. See the Alpha vector extensions from the Alpha Tarantula paper.

    As far as segfaults, I'm not sure thats too much of an issue with graphics and physics.

    Aaron Spink
    speaking for myself inc.
     
  7. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    bah, I want REYES! I'm sure you are away of the pixar work on adding RT to PRMan. Seems likes a very elegant solution to all the problems. of course I don't think realtime graphics will be doing 1 million cycle shaders anytime soon...

    Aaron Spink
    speaking for myself inc.
     
  8. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Good point Aaron. Still very sad the Alpha had to die (was my favorite cpu arch).

    However if Larrabee had generalized scatter gather which didn't take ALU ops (ie like the Tarantula design), then scatter gather would be very useful (especially for non-polygon rendering). The question is can background scatter gather be done with 16/24 cores on one chip? And can 4 in-order threads have enough ALU work to hide the latency of scatter gather in ALU ops in general?

    From the Alpha Tarantula arch paper,

    "Tarantula employs a conflict-resolution unit, also described below, that sorts gather/scatter addresses into bank conflict-free buckets. Then, these sorted addresses can be sent as normal vector requests to the L2 cache."

    Which is basically what G8x/G9x does in CUDA if you fetch from non-aligned/non-coalesced locations, except fetches are either from non-cached memory or direct from program managed per core local store (which has a free swizzle). The G8x/G9x also has enough threads to hide the fetch cost and keep ALUs busy in the mean time.

    Intel's x86 on the other hand has had a history of using ALU cycles to scatter/gather. So if they break this for Larrabee I will be surprised and very happy indeed!
     
  9. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    One of the purposes of a low latency gather instruction that I was thinking of is to be able to implement transcendental functions with table lookups. I've spend quite a bit of time trying to compute them efficiently with SSE using only arithmetic operations, but nothing beats having tables of accurate values you can linearly or quadratically interpolate from.

    They could add execution units for exp, log, sin, cos, sqrt and rsq, or they could spend the transistors on a gather instruction and implement all these and more (e.g. hyperbolic and inverse trigonometric functions) with very little instructions.

    There's also a lot of use for LUTs beyond 3D. ;)
     
  10. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    I agree, and that is why I suggested the "parallel insert" above. Have an instruction that computes whether/where to stuff values in a wide SIMD register based on the adresses in the index vector, and loop until you gathered all the values (best case: all the values you need gathered are from the same 512 bit block/vector, worst case: only one value in each 512 bit block loaded).

    For stuff with great locality, like texture access, you don't pay the price of issuing 16 load requests. For stuff without locality you do, but hey, lack of coherence costs.

    Larrabee will be optimized for spatial locality, -it has to be. Imagine 16 cores (and in the future, more), all issuing 16-way gathering loads with poor coherence (ie. missing local D$); the lower levels of the memory subsystem (L2 cache, memory) would be overwhelmed.

    Cheers
     
  11. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    Well, the first incarnation looks to be a experimental rasterizer/vector crunching unit (for physics or whatever), so you're right.

    I'm just wondering what Intels motivation to go with x86 for Larrabee, it's not exactly a natural fit for some of the tasks required for graphics. I think long term it is an attempt to consolidate all major ICs into one: General purpose CPU, physics, graphics... everything.

    Cheers
     
  12. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Or perhaps software running on it should take care of spatial locality :)
    If it is going to be super-flexible I don't see how the hardware per se can enforce
    such data locality, certainly not in the same way GPUs do it.
     
  13. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    Well, I wasn't suggesting that random accesses with low coherence would be illegal, just that it would be much more expensive.

    Cheers
     
  14. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Yes, I understand that, I just wanted to underline the fact that if Larrabee is really so flexible as it seems there's not much that the CPU can do to preserve locality.
     
  15. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    Alright, sorry, misunderstood :). You're absolutely right. Lack of locality just costs, fundamental physical laws to blame here, and Larrabee won't be a silver bullet.

    Cheers
     
  16. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    I'm not sure this is true for streaming accesses. Say you have an array of structures, and you process it sequentially. You could convert it to a structure of arrays, at a cost, but you're still using the same bandwidth (assuming every element is used). A gather instruction would cause the same number of misses. It doesn't lower locality. And this case if very important for automatic vectorization as I've mentioned before.

    But I feel like I'm beating a dead horse here...
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    A purely streaming case with poor spatial and temporal locality would be an argument for a dedicated space to load and then combine gather data before injecting it into the cache heirarchy.
    The least saturating case would be to put little gather units in the dedicated logic, so that streaming gathers can be coalesced prior to thrashing the caches or gumming up the interconnect.
     
  18. ArchitectureProfessor

    Newcomer

    Joined:
    Jan 17, 2008
    Messages:
    211
    Likes Received:
    0
    I wanted to jump in and clarify something here. There is a difference between a "bank" and a "port" to a SRAM array. A true multi-port structure with N ports implies that you can access any N different locations at the same time. This requires lots of wires going to each SRAM cell, making the SRAM grow quadratically with the number of ports (once the array is wire-dominated).

    With a multi-bank SRAM array, you're just chopping up the existing array into N parts. Each SRAM cell only has a bit-line and word-line to it. This keeps the SRAM dense. The down side is that you're only able to make N accesses to the array if you don't have any bank conflicts. That is, in a four-bank structure, you could access address 0, 1, 2, and 3 in parallel (as they would map to banks 0, 1, 2, and 3). But if you tried to access addresses 0, 4, 8, 12, they would all map to bank 0, so the bank conflicts would force the hardware to do this with repeated accesses to the same bank.

    Of course, if you're multi-banking a cache, you'd also need to bank the cache tag array (not that bad) and the tag compare logic (probably ok, too).

    Finally, you can combine banking and porting to get the bandwidth you need (that is, you could have an SRAM with two "true" ports and two banks, or whatever).

    This is actually a good thing, in some cases. If you're getting 16 misses in parallel, it will be much faster to grab them all in parallel rather than fetch them one at a time.

    In fact, it wouldn't surprise me if gather/scatter was one cycle per element, but it would prefetch them all at the same time. It would give you the MLP benefits of scatter/gather without complicating the pipeline.
     
  19. ArchitectureProfessor

    Newcomer

    Joined:
    Jan 17, 2008
    Messages:
    211
    Likes Received:
    0
    So, if Intel did the same approach, they could bank the L1 cache and then have gather loads take anywhere from a single wide access (if there are no bank conflicts) or 16 accesses (in the worst case of all accesses falling into the same bank). That would be fastest, but the hardware complexity would be fairly high.

    Here is my wild guess: they will add scatter/gather to the ISA. But the first implementation will just repeatedly access the cache. A future implementation would then be free to make that even faster.

    This is somewhat analogous to how SSE instructions used to be handled 64-bits at a time (halving the throughput) whereas now they are handled with a full 128-bit bundle at the same time.
     
  20. ArchitectureProfessor

    Newcomer

    Joined:
    Jan 17, 2008
    Messages:
    211
    Likes Received:
    0
    The XeCPU in the XBOX 360 has some special vector operations for decompressing D3D compression formats. From the XBox 360 System Architecture in IEEE Micro 2006:

    The VMX128 implementation includes an added dot product instruction, common in Another addition we made to the VMX128 was direct 3D (D3D) compressed data formats, the same formats supported by the GPU. This allows graphics data to be generated in the CPU and then compressed before being stored in the L2 or memory. Typical use of the compressed formats allows an approximate 50 percent savings in required bandwidth and memory footprint.

    Would such instructions help Larrabee?

    Would a vector "min" and "max" instruction help here? For a 16-element vector, it use some sort of tree reduction (16->8, 8->4, 4->2, 2->1). To get this down to a 4-cycle latency, it could use a tree with a branching factor of four (16->4, 4->1) by using 6 parallel subtracts, allowing the FP subtraction in each level of the tree to take two cycles. There is probably even a smarter way of taking a min of 16 elements.

    What if Larrabee didn't have a texture unit? What would it take to do those operations on the main cores?

    Would it be possible to schedule the pixel shaders to run on each pixel as it is generated immediately following the vector shader? If so, if the pixel shader tasks were routed to the same core as the vertex shader ran, it would likely still be in the local L1/L2 cache hierarchy.

    If Larrabee does use task-level parallelism, it should be possible for a task to generate additional tasks. These tasks could be put on the local work queue (even at the front of the local work queue), and thus could be handled immediately once the task was completed.

    Ok, perhaps this isn't anything like the way the GPU pipeline currently works, so please correct me if I'm way off base here. In general, it isn't clear to me if all the vertex shading is done first for the entire frame and then the pixel shaders run, of if they are interleaved. Either way, where are the intermediate data stored? How is the intermediate data stored? Is it the form of "vertex" data, "trinagles", or "pixels", anyway?

    One of the reasons I'm asking is that it seems what order you do this things in could have a huge impact on locality and various z-culling sort of optimizations.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...