Larrabee's Rasterisation Focus Confirmed

Discussion in 'Rendering Technology and APIs' started by B3D News, Apr 24, 2008.

  1. Jawed

    Jawed Legend

    Sorry I meant pipelined in the sense of being a single instruction rather than being calculated by a macro.

    Blimey! Afterwards I realised that a polynomial's terms are heavily serially dependent just because of the successive powers so it prolly doesn't split across many lanes too well. Also with double-precision computation there's a halving in effective lane count and for single-precision the computation is prolly so quick (very few terms) that it's prolly not worth the effort.

    By permute I presume you mean swizzle, though I suppose what you're getting at is swizzling across the entire 16 lanes, not just within groups of 4 as GPUs do.

    There might be an opportunity to use the 16 lanes to produce 4 transcendental results in less clocks than if they were produced separately in parallel on the four (x,y,z,w) sets of lanes. Apart from the approximation step, the reduction and reconstruction steps provides further opportunities to enhance utilisation on a wide SIMD. In effect using the 16-lane width of the SIMD to overlap computations for a set of 4 results.

    So you're saying generate 16 trascendentals in parallel? I dare say its simplicity is compelling and it could well coincide with the number of objects in a batch. The issue with running so many in parallel is the 16-way duplication of the lookup tables - though I guess they're all quite small.

    I've not heard of L0 before...

    I presume you're referring to the cache line width of current SSE implementations. We don't know this for Larrabee do we?

    I dare say sequential registers would only apply for simpler programs.

    I'm thinking that it increases the porting complexity because both the SIMD and the gather/scatter units are fetching and storing concurrently - although in an alternating pattern. I suppose doubling the banking would be the easiest solution.

    So we get back to the question of port widths...

    With the SIMD being pipelined and with any one instruction only able to consume, at most, 3 operands, thread B instructions can start issuing before thread B's register set has been fully populated. Meanwhile thread A's register set can start being written out before A has finished. OK, I know, it's hairy :razz:

    :lol: Thinking that we could be waiting 2 years to find out is, ahem, maybe not so funny.

    This is bloody tantalising:

    http://ieeexplore.ieee.org/Xplore/l...9/04343860.pdf?tp=&isnumber=&arnumber=4343860

    But it's hidden and I can't find anything else on the topic :sad:

    Jawed
     
  2. 3dilettante

    3dilettante Legend Alpha

    I went back to check what I wrote, and it's more like the AltiVec permute instructions, which would allow the unit to pick out the elements from the previous iteration's source and result registers and build the needed operands for the next step.
    The combination would have to span quad lanes and also be gathered from different registers.
    SSE in other x86s isn't quite as flexible in this regard, but with extra steps the same end result can be created.
    Either that, or a specialized transcendental unit can quickly pick out the needed elements, since where a given value is resulted and where it must be copied would be static and could be hardwired.

    The entries are pretty small, and a lookup table would probably be a pretty large part of the hardware in a transcendental unit.
    If a storage location the size of the L1 is 1 cycle in latency, it is possible that readying the lookup wouldn't be worse.
    It might potentially add issue latency for transcendental instructions, but how many back-to-back issues would be needed?

    It's been mentioned before, though it seems perpetually "in the future".
    I think some fanciful accounts of what would have come after AMD's K8 included mention of it.

    The leaked Larrabee slide said the line width was 64B, just like Gesher/Sandy Bridge.

    We'd have enough time to speculatively design Larrabee several times over.


    Hmm, the architecture is outlined elsewhere as a 48-ALU design divided into 8 clusters.
    Each cluster contains 3 adders and two multipliers.
    That's 16 add+mul pairs, equivalent to 16 FMAC lanes.
     
  3. MfA

    MfA Legend

    The LSU is essentially already a L0 cache in present CPUs.
     
  4. Scali

    Scali Regular

    We shouldn't have to wait *that* long... Intel plans to have engineering samples out in the second half of 2008. I hope they'll have some working software on it aswell, and perhaps are willing to release more info on the architecture and how they employ it for rendering.
     
  5. 3dilettante

    3dilettante Legend Alpha

    It's not because the entries only persist until the memory operations waiting in the buffers are retired.

    If AMD thought the same way, there would have been no point in mentioning the L0 because they've had LSUs for over a decade.
     
  6. 3dilettante

    3dilettante Legend Alpha

    Just to update one of the Larrabee threads:

    http://babelfish.yahoo.com/translat....de/ct/08/15/022/&lp=de_en&btnTrUrl=Translate

    Unless Gelsinger's just making stuff up, a 32-core 45nm Larrabee at 2 GHz could be expected to produce 2 SP TFLOPs.

    This appears to support the speculation that Larrabee's SP/DP ratio is 2:1, since Intel already proposed (although this data is much older) that Larrabee could do 1 TFLOP DP with 24 cores at 2.5 GHz.
    The DP ratio is a far sight better than current GPUs.
    The SP peak is more problematic a comparison, as there is a range of 1 to 1.5 process nodes that GPUs will transition through in the meantime.
    It seems clear that by late 2009/early 2010, peak numbers will be even less comparable, where we'll be comparing GPUs versus Larrabee in some possible kind of partial emulation.

    If other speculation that each core is roughly 10mm2 is true, we could also suppose that at a bare minimum, Larrabee will be at least 320mm2.

    Wattage numbers will prove interesting, I think.
    Power improvements on TSMC's processes are not expected to be that large, while Intel's rumored to have a worst-case draw of 300W on a 45nm Intel process.
    GPUs may increase power draw when their FLOP counts reach that high, though just looking at the 4850's peak FLOPs/Watt numbers would put it in a better light compared to Larrabee.

    The integer pipeline will harken back to the general outlines of the original Pentium, though the FP half of things would be radically expanded.
     
    Last edited by a moderator: Jul 8, 2008
  7. nAo

    nAo Nutella Nutellae Veteran

    While a 2 Ghz clock makes perfect sense to me it might end up being a conservative figure..
    BTW..does your 10mm2 estimation takes in account TMUs and L2 as well?
     
  8. 3dilettante

    3dilettante Legend Alpha

    Possibly.
    Intel's older slides had a 2.5 GHz ceiling.
     
  9. 3dilettante

    3dilettante Legend Alpha

    Going by the old Intel slides B3d showed a while back, no.
    This is each core and its corresponding L1s.

    I'm still not sure if the corresponding sector of L2 per core will count towards 10mm2 or not.

    I've excluded any special-purpose hardware, memory controllers, and all the fun bits of the uncore that are so important for multi-core designs.

    My bare minimum estimate is one that I expect to be exceeded by a good amount. If the core is only the core and L1s, I'd expect it to be exceeded by a very significant amount.
     
  10. nAo

    nAo Nutella Nutellae Veteran

    Then I guess the 10mm2 figure must be quite old and based on a different process, given that L2, TMUs, memory controller, etc.. might account for 1.5-2 times the cores area or even more.
     
  11. 3dilettante

    3dilettante Legend Alpha

    Intel never gave an estimate of anything beyond the core size. In my interpretation, it never included anything but the execution core + L1s in the 10mm2 estimate.

    Without knowing the number of other units and their relative areas, I couldn't estimate more than 32 times the estimated core area. Obviously, the non-core elements have an area >>0.
     
  12. corysama

    corysama Newcomer

  13. Simon F

    Simon F Tea maker Moderator Veteran

  14. Jawed

    Jawed Legend

    Plebs like me can't read the document: pay to view.

    Jawed
     
  15. Jawed

    Jawed Legend

    With this and the Imagine paper kindly forwarded to me, hopefully there's some clues on what Intel will be doing for transcendentals on Larrabee.

    Presumably there'll be a split between single-precision high-speed versions for graphics and something more refined for double precision.

    I'll look at these later. Thanks all.

    Jawed
     
Loading...

Share This Page

Loading...