Larrabee at Siggraph

Discussion in 'Architecture and Products' started by nAo, Jun 2, 2008.

  1. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    I don't want to sound as a broken record but has anyone complaining about Intel not thinking about programming on many core architectures bothered to read this material?

    Ct: C for Throughput Computing

    Ct: A Flexible Parallel Programming Model for Tera-scale Architectures

    While it's true that until they ship it it's just vaporware, it looks very promising to me.
    They are clearing attacking the problem, and on top of that we (as developers) will probably have support for DX11's compute shaders and OpenCL.
     
    #441 nAo, Aug 14, 2008
    Last edited: Aug 14, 2008
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    In memory, yes. Not knowing the parameters of the cache and just how extensive Larrabee's cache control instructions are, we can't assume that it's necessarily in cache. It's likely that it's usually there, so long as other accesses don't start scatter-gathering across the address space.

    The paper says a tile is sized so that it can be fully loaded into cache, but does that mean the full load of fiber data and emulation code is part of that total?

    Hard to say that without implementation information. If we want intermediate values to persist for a fiber after a switch, some time must be taken to save the state and the new state must be read in.
    Depending on the implementation, and depending on other factors that should be more clear as more information trickles out, it could incur a higher cost than using a register that is in a known place and is guaranteed to be there.
     
  3. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    I take it we are still missing any info on number of vector registers, ALU latency, and max number of outstanding memory reads. So we really don't know what types of latency Larrabee can hide yet.

    I'm also going to have to agree with Marco here, I bet that nearly all game developers (ie majority of the market for the device) doing future gaming GPGPU and using Larrabee are going to use DX11 compute shader or OpenCL (assuming it is finished and established standard), unless Larrabee gets into a console and then all bets are off.

    DX11 compute looks to be CUDA rewrapped. So all of the fine grain sync advantages of Larrabee end up for driver writers only, bandwidth "advantage" of the tiled renderer go away (not applicable to compute shaders / GPGPU), and Larrabee is left with the trade off between perhaps lack of ability compared to GPUs to hide latency but lower latency because of cache.

    Also developers are going to have to design algorithms which use block sizes and shared register thread group sizing which is a common size which performs well on all hardware to see performance from the compute shaders on all platforms (or adapt the algorithm dynamically based on platform).

    So I think the reality of the situation is that we are relying on great drivers...
     
  4. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Sorry for a n00b ques, but what is DB incoherency?

    They said they had dedicated texture units. Isn't interpolation part of it? But yes they might not have interpolation instruction as part of vector ISA.

    Absolutely. It'll make the lot of HPC folks happy, but games won't need it. But I guess ATi will come ahead of larrabee in DP speed beacuse their absolute SP speed will probably be much more even though their SP/DP ratio is lower.
     
  5. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Dynamic branching, where pixels within a batch/fiber take different paths.

    The Larrabee paper stated that it's done in software.

    Intel's targeting something like 1 TFLOP DP.
    You expect AMD's DP rate to quadruple?
    Even then, without changing things, AMD's DP flops would be in some ways inferior to Intel's IEEE compliant math.
     
  6. heliosphere

    Newcomer

    Joined:
    Jun 15, 2005
    Messages:
    142
    Likes Received:
    15
    It seems like Intel has some conflicting incentives here. On the one hand they're making a big deal about the fact that Larabee is x86 and offers the chance of porting code, on the other they're introducing a new wide SIMD instruction set that would benefit from new programming models like Ct (which I haven't actually read much about yet) or, ironically, CUDA. The more they can abstract away the underlying architecture with a new programming model though the less their advantage from supporting x86.

    It seems like NVIDIA stand to do well if any of CUDA, OpenCL, DirectX 11 compute shaders or any other portable abstraction layer is ultimately the most adopted technology. Intel's big advantage comes from persuading people *not* to move to a new programming model but to take existing C++ codebases targeting x86 and adapt it to Larrabee.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Agreed: in memory, and with a following wind prolly in cache. What other accesses will be mucking things up though? The four threads in a core shouldn't be stomping over each other's cache lines.

    In general I guess that would be normal. The paper says that for pixel shading tile size hardly makes any difference in performance (sounding eerily similar to the "Xenos performance is fine if you use tiled rendering" argument - we know that's only mostly true...).

    With everything statically compiled, the shader's ALU:TEX is known and therefore the worst-case TEX latency that needs hiding. That decides fibre count. If the shader also does constant buffer fetches, the compiler can work out how many of those are in-flight at any given point in a shader and add that into the cache allocation. If the compiler decides to deploy re-packing for DB then that's more cache to allocate. Whatever's left is for the tile.

    Naturally there's a tipping point. All the GPUs have tipping points.
    It doesn't make sense for the register file to be so small that every fibre switch incurs a flush of the outgoing fibre's registers to memory.


    Registers should only be written to memory because:
    1. the entire population of registers wasn't initially allocated for the shader - D3D10 requires the GPU to support 4096 vec4 fp32 registers per element. So GPUs are expected to page registers over the lifetime of a shader when subjected to this kind of duress
    2. the code specifies the write, e.g. an indexed write - compute shaders in D3D11 should allow this I guess
    A flush shouldn't happen as part of normal scheduling around latency.

    L1 read bandwidth (and presumably write bandwidth) is only 1 operand (32 bit I guess) per element per clock. MAD rate would effectively drop to 1/3rd (1 cycle for the 1st MAD, then 2 cycles to write 2 operands to memory + 3 cycles to read them back = 6 cycles) if two MADs were separated by a TEX :shock:

    If someone designed a GPU to be this slow on its most fundamental latency-hiding operation, they'd be shot I reckon.

    ---

    I suspect Larrabee has the smallest register file of all these architectures, though. Because attribute interpolation will be "slow" on Larrabee I expect there will need to be relatively few fibres to hide TEX latency, as these instructions will naturally increase the shader's ALU:TEX.

    Predicate evaluation latency shouldn't be huge, either, unless it causes a jump to a page of code that isn't in instruction cache. Other GPUs have to contend with this problem too.

    Jawed
     
  8. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I understand, but that housekeeping is a lot of work for the x86 part. armchair_architect is echoing my beliefs above.

    Did you miss the part where each qquad's shader is a fibre? Larrabee doesn't have single cycle instruction latency, so it will be switching qquads every clock. We see similar things from ATI (a different 64 pixels every 4 clocks) and Nvidia.

    At the very least we will see smaller cycles of 8 qquads switching every clock, and maybe new ones swapping in (due to texture fetches) at lower frequency, but there is no mention of this in the paper. Static SW scheduling can do this quite well when there's no DB and texture latency is consistent and predictable, but otherwise scheduling starts benefitting drastically from dedicated logic.

    BTW, what exactly are you envisioning when you say "repacking"?
     
  9. Barbarian

    Regular

    Joined:
    Jun 27, 2005
    Messages:
    289
    Likes Received:
    15
    Location:
    California, USA
    Lots of presentations from today's "Beyond Programmable Shaders" at Siggraph:
    http://s08.idav.ucdavis.edu/

    There are many interesting details about all the major architectures from NVidia, AMD/ATI and Intel.
     
  10. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    I agree it looks really nice, but based on the limited details I've read about Ct and the limited stuff available to non-customers on RapidMind, Ct seems extremely similar - both have support for deterministic parallelism, nested data parallelism, and are embedded in C++. Am I missing some reason why Ct is a big advance over what is currently available?
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    2-10 fibres required to hide TEX latency :grin:

    Jawed
     
  12. glw

    glw
    Newcomer

    Joined:
    Aug 29, 2003
    Messages:
    64
    Likes Received:
    0
    Intel's biggest advantage is that if some clever developers come up with a brilliant idea to make writing parallel programs much easier those clever developers don't have to layer their brilliant idea on top of everyone else's 'cruft'.

    CUDA or CAL, the Windows only DirectX and whatever OpenCL turns into will all 'get in the way'.

    Exposing x86 will assist library and toolkit programmers, even if application developers avoid it.
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    What else is the scalar pipeline doing? Also I suspect there's a lot less work than first imagined.

    Fibres are an entirely software-based construct.

    Let's say we've got a shader, which only uses 1 register:

    Code:
    texld r0, v0, s0
    mad r0, r0, r0, r0
    this translates into LRB native as:

    Code:
    intrp r0, v0 [macro]
    texld r0, r0, s0
    mad r0.x, r0, r0, r0
    mad r0.y, r0, r0, r0
    mad r0.z, r0, r0, r0
    mad r0.w, r0, r0, r0
    I'm too lazy to work out what the interpolation macro comes out as. Let's say it's 10 instructions per dimension, so 20 for a 2D lookup.

    Now let's say that this is compiled for 4 fibres and we're looking at just the VPU code:

    Code:
    intrp r0 [f0r0], v0 [f0v0]
    [texld ...]
    intrp r1 [f1r0], v1 [f1v0]
    [texld ...]
    mad r0.x [f0r0], r0, r0, r0
    mad r0.y [f0r0], r0, r0, r0
    mad r0.z [f0r0], r0, r0, r0
    mad r0.w [f0r0], r0, r0, r0
    intrp r2 [f2r0], v2 [f2v0]
    [texld ...]
    mad r1.x [f1r0], r1, r1, r1
    mad r1.y [f1r0], r1, r1, r1
    mad r1.z [f1r0], r1, r1, r1
    mad r1.w [f1r0], r1, r1, r1
    intrp r3 [f3r0], v3 [f3v0]
    [texld ...]
    mad r2.x [f2r0], r2, r2, r2
    mad r2.y [f2r0], r2, r2, r2
    mad r2.z [f2r0], r2, r2, r2
    mad r2.w [f2r0], r2, r2, r2
    mad r3.x [f3r0], r3, r3, r3
    mad r3.y [f3r0], r3, r3, r3
    mad r3.z [f3r0], r3, r3, r3
    mad r3.w [f3r0], r3, r3, r3
    
    Notice this is statically compiled, circular, fibre-switching. Also notice that my example shader has a disastrous ALU:TEX ratio which will stall on all current GPUs :razz:

    Obviously, we don't know the length of the VPU pipeline, nor what the register read after write latency is. And whether Intel has implemented an in-pipe memory to function like ATI's "previous register", obviating simple consecutive-instruction RAW hazards.

    Let's say you have the following predicates for 4 fibres, each of 16 elements:

    1101010101110010
    0100100101001011
    1001011100101010
    1000101110001101

    you'd re-pack the elements to get:

    1111111111111111
    1111111111111111
    0000000000000000
    0000000000000000

    (by pure luck I got 32 of each :lol: )

    Obviously, whether it's worth doing is a different question... If each clause is 1000 cycles then yes, you'd prolly want to halve execution time and the re-packing cost would be easily amortised.

    Jawed
     
  14. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Yes it is, but only works nicely when there isn't DB screwing around with each qquad's instruction sequence. Otherwise, if you want 16 pixel granularity, you need to find a new qquad to perform an instruction on every clock.

    That's impossible for software. As the scheduler is written in the paper, the branching granularity is no better than G70's. Moreover, while a GPU only needs to cover the average texture latency, a circular scheduler needs to cover the maximum latency.

    Okay, that's what I figured, but you are seriously underestimating the software power needed to do that. Not only are the combinatorics very ugly, but you need each fibre to be at the same branch location, which is either highly improbable or needs some seriously complicated scheduling to "realign" the IPs of different fibres. In terms of graphics, this repacking messes up a lot of things like sample masks, derivatives, texturing, etc. No longer is each fibre spatially confined to one 4x4 block.

    It'll only be useful in very specialized circumstances, and those won't look much like shaders at all.
     
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    A fully associative GPU cache or a set-associative cache?
    The P54 had a 2-way L1 Dcache.
    Probably with 4 times the threads each Larrabee core has a more associative cache.

    In simulations, bad things hardly ever make a difference in performance... ;)

    Perhaps the general rule for DB will be to grin and bear it.
    If pixels start migrating in sequence, overly aggressive packing will wreck locality, and that will start trashing cache lines.
    If everything is done in sequence the compiler has a good chance of arranging it so cache lines can be cycled between the L1 and L2 in a mostly transparent fashion. Each pixel's memory can also be allocated in such a way to get around stride conflicts on a set-associative cache.

    Perhaps pixel 1's data won't have a cache conflict with pixel 32 because the compiler knows pixel 1 will be done with a given line because it will be unneeded by the time pixel 32's evicting line is loaded if the pixel is hit in-order. If pixel 1 is repacked adjacent to 32, the two pixels are going to bash each other's operands.

    Well, this is x86, though the VPU register count is up in the air.
    Fibers switch cooperatively within a single hardware thread, which means the register pool is the 8, 16,or possibly 32 each hardware thread gets.
    Something like 10 fibers on a thread with just 1 register per pixel would blow an 8 register file, and it would be a bit cramped on 16. 10 strands would allow one register operand per pixel as long as they only do 2-source operations.
    Spilling registers in the middle of a fiber would not be a good thing.
    Intel makes a lot of leaning on the L1, which is obviously a critical resource.

    edit: One register per qquad
     
    #455 3dilettante, Aug 15, 2008
    Last edited by a moderator: Aug 15, 2008
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    When branching Larrabee pre-emptively switches hardware thread. So a new qquad (or other VPU work, e.g. the Setup hardware thread) switches in.

    So if a hardware thread contains 4 fibres a repeat until with some nasty dependent texturing inside could look like this:

    Code:
    LOOP:
    if f0
    ALU f0 // texture coordinate math
    TEX f0 // fetch
    if f1
    ALU f1
    TEX f1
    if f2
    ALU f2
    TEX f2
    if f3
    ALU f3
    TEX f3
    if f0
    ALU f0 // process result of fetch
    if f1
    ALU f1
    if f2
    ALU f2
    if f3
    ALU f3
    test to set predicate f0
    test to set predicate f1
    test to set predicate f2
    test to set predicate f3
    if not (f0 or f1 or f2 or f3) jmp EXIT
    jmp LOOP
    EXIT: // program continues after loop...
    ALU f0
    ALU f1
    ALU f2
    ALU f3
    
    Undoubtedly, as some fibres finish their loops, this hardware thread will be able to hide less and less of the TEX latency caused by the remaining fibres. Luckily there's 3 other hardware threads running on the core too. If one fibre runs for 1000 iterations more than any other then you've definitely got a problem :razz:

    If there was no TEX within that loop then the fibres would be completely independent and Larrabee would be seeing the full benefit of having 16-wide hardware threads. Other forms of latency (e.g. evaluating branch destination) are entirely hidden by hardware thread switching.

    So the moral of the story is don't put dependent TEX inside a variable duration loop. ATI will definitely run this kind of code far far better - but it has 4x the number of elements per hardware thread, so it is usually far behind on the more common cases of incoherent DB with no TEX.

    I remembered the formal name for this, Conditional Routing - been bugging me for over a day now:

    http://cva.stanford.edu/publications/2004/kapasi-thesis/

    By flushing state to memory and then using gather, driven by the predicates of the fibres, you can assemble state for all the fibres that share coherent branching. This is only required for the section of code inside the DB.

    Also, because Larrabee is circular-scheduling, the IPs are aligned, as I showed in the code snippet above.

    But you're not allowed to do that stuff inside DB - you have to perform the fetch before entering the DB.

    And anyway, after the DB has completed you can scatter state back to the original mapping of elements to fibres. It's reversible.

    I'm not arguing it's computationally trivial - merely that incoherence can be so expensive that it can be worth doing.

    Jawed
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Where is stated that Larrabee preemptively switches on branches?


    But-but-it's x86, so it can do anything! ;)

    I wonder if it's possible through some pathological code scheduling and inopportune scatter writes to inadvertently evict a qquad's state back to main memory.
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I was thinking of L2 rather than L1. Larrabee's special cache instructions appear to be targetted at controlling L2.

    So, erm, I've got no idea whether a comparison with P54 is useful. Did it have L2?

    And now that we've found out they're planning a maximum of about 10 fibres per hardware thread, i.e. >30 fibres per core (assuming 3 hardware threads are doing pixel shading), we can get a sense of the register file size.

    At least 2 registers per element: 32 fibres * 16 elements * 2 registers * vec4 * 4 bytes ~ 16KB.

    Jawed
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Section 3.2:
    "Switching threads covers cases where the compiler is unable to schedule code without stalls". Because there's no branch prediction and everything is done with predication, a stall arises when jumping - but if switching hardware threads obviates the stall. This is statically compilable - just like in ATI.

    Actually, hmm, "pre-emptively" isn't the correct term. Hmm, lol, I was thinking of the compiler pre-empting the branch stall - the core just acts dumb here, I don't think it's switching threads of its own accord. Sorry about that.

    Which is why I decided to do the dependent texturing example - because it does look like a disaster on Larrabee. I can't think of a decent solution to this if the program is allowed to have an arbitrary loop count.

    Clearly this is likely to come in general computation, e.g. doing sparse gathers inside a loop. Ct seems to revel in doing this, so I guess I'm missing something.

    Hmm, I think in the general case this is the default behaviour of Larrabee.

    For example if you want to send register data from VPU to the scalar unit, you have to write it to memory. Obviously, that just means it goes out to cache most of the time.

    Jawed
     
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The L2 was off-die.

    That would indicate 64 64-byte vector registers per hardware thread. That's downright RISC-like.


    I was making a distinction between cache and RAM. I think Larrabee would do best to not force an upcoming qquad's state to memory hundreds of cycles away.
    Perhaps there is a way the core or compiler can ensure the furthest it can go is the L2.

    That's reminiscent of Xenon, where moving data between pipes requires a similar trip. The latency from that is pretty significant in the Xbox implementation.
    Perhaps that is something we can expect to improve with LarrabeeII.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...