C++ implementation of Larrabee new instructions

Discussion in 'Architecture and Products' started by nAo, Mar 25, 2009.

  1. frogblast

    Newcomer

    Joined:
    Apr 1, 2008
    Messages:
    79
    Likes Received:
    4
    My guess is that they have 3 variants because only the last argument in the instruction encoding can come from memory and/or have format conversion applied. The trio of instructions gives you flexibility as to which of the arguments that applies to.
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    So it appears that LOG2, EXP2, RECIP and RSQRT are intrinsic transcendentals that operate on all 16 lanes of a vector. That appears to be quite a serious investment in die area! Though we can't tell what latency these instructions will have.

    Jawed
     
  3. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Not necessarily. Just like G80 they could use less execution units than vector components and process them partially sequentially. So it's a die area versus latency versus throughput tradeoff. Four execution units seems like a reasonable choice. Note that Intel has done this before with SSE; before Core 2 they only had 64-bit wide execution units and every 128-bit wide operation took an extra cycle to process both the lower and upper half.
     
  4. codedivine

    Regular

    Joined:
    Jan 22, 2009
    Messages:
    271
    Likes Received:
    0
    "Math utility functions do not correspond directly to Larrabee new instructions, but are added for programming support."

    I think these will likely be executed in a scalar fashion in h/w.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    The instructions I listed aren't in the set of "math utility operations", they're there right next to ADD, MAD132, etc.

    Because the math utility operations section is specifically about instructions that take multiple cycles it seems to me like it's a very strong hint that these four transcendentals are single-cycle intrinsic 16-wide vector operations.

    What's bugging me is why the presentations at GDC haven't appeared yet :sad:

    Jawed
     
  6. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Is the gdc under nda ala apple's wwdc? Doesn't look like that to me. I would have thought that intel would want to popularize the techniques, that's why the vector isa intrinsics were released. Or may be not much ppl care abt rasterizing in software so no point in telling them abt it.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    From Formal Verification of IA-64 Division Algorithms:

    http://www.cl.cam.ac.uk/~jrh13/slides/tphols-15aug00/slides.pdf

    slide 9, it appears that frcpa has 5 cycles of latency. (IA64 seems to have two floating point units.) I know essentially nothing about IA64, so I'm unclear whether the floating point units have a latency of 5 cycles as that slide can be interpreted in multiple ways.

    There's more background here:

    http://download.intel.com/technology/itj/q41999/pdf/ia64fpbf.pdf
    http://www.ncsu.edu/wcae/ISCA2003/submissions/cornea.pdf

    This:

    http://www.pdc.kth.se/~pek/ia64-profiling.txt

    says that FPU latency is 4 cycles, for Itanium 2, which was ~1GHz it seems.

    So, what's the latency of Larrabee's vector ALUs? It's 8 cycles in ATI and, er, I forget in NVidia, 12? That's purely the math, not fetching operands or storing.

    Jawed
     
  8. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    A shader with 32 live registers at any point is pretty rare... note that on modern NVIDIA hardware 10+ registers is considered a lot (see the D3D tutorial slides from this year's GDC), so I don't think 32 is exactly starving anyone at the moment ;)

    I'm also confused as to why people think loop unrolling would increase the live register count at all... if anything, it may actually optimize out the loop iterator, but by no means should it increase the register count of the program since registers used internal to the loop are necessarily only live for one iteration, and thus can be trivially reused on the next.
     
  9. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    In a worst case scenario loop unrolling might slightly increase register pressure but it's not something I would particularly worry about. As Andrew says, in some cases it might actually decrease register pressure.
     
  10. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,684
    Likes Received:
    449
    Is branching completely unimproved over desktop x86 on Larrabee? (ie. no zero overhead looping, no data dependent branch hints.) Also, does it have register renaming?

    Writing assembly with branches where you know the direction of the branch dozens of cycles ahead of time and having no alternative but to just let the processor but it's head into a wall of unnecessary branch misses always irks me (even if after a while it learns to avoid the wall occasionally).
     
    #50 MfA, Mar 29, 2009
    Last edited by a moderator: Mar 29, 2009
  11. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    At the shader level each register would correspond to one scalar value, not a 4-component vector. So with 10 registers you wouldn't even have enough to do a MAC operation on a whole vector, let alone execute a complex shader. So 32 hardware registers really isn't too many if each of them stores 16 scalars from 16 pixels.
    There's loop unrolling to lower the loop test-and-branch overhead, and there's loop unrolling to hide latency. The latter form rapidly increases register usage.
     
  12. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    It has 4-way HyperThreading. By the time the result of the branch is needed, it's ready.
    Why would you need that in an in-order processor?
     
  13. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,684
    Likes Received:
    449
    From what they say in the papers/presentations it sounds like it switches on branch mispredicts (to avoid the bubble) but not when the branch enters the pipeline. It would be nice if they did this, but even then the vertical multithreading wouldn't always be enough (you need on the order of 10-20 cycles, it's not just about instruction latency since branches are done higher up in the pipeline).
     
  14. Barbarian

    Regular

    Joined:
    Jun 27, 2005
    Messages:
    289
    Likes Received:
    15
    Location:
    California, USA
    Larrabee's design is supposed to closely match Pentium1's short execution pipeline. If I remember correctly, a mispredicted branch used to cost 4 cycles.
    Of course Pentium1 had only static prediction. I don't know if Larrabee has anything beefier than that.
     
  15. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
  16. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    The typical shaders I see spit out of FXC are 1-8 "vector" temporary registers (xyzw - although it's not packing those at all, so after scalarizing it's often less than 4 * # registers), which fits fine into 32 registers. I just compiled a very complex shader with significant control flow and it uses 13 FXC "vector" registers, and many of those it appears to be using only one or two components, so I don't think the situation is as bad as you seem to imply. In fact I think 32/HW thread is a pretty reasonable number for most shaders. Time will tell though :)

    Ah I see... in this context I'd consider that more directly related to software fibering (i.e. "unrolling" statically or dynamically the implicit outer loop over pixels), rather than unrolling shader loops, which is how I interpretted your original comment. And yes, that can certainly increase register pressure, but it's not exactly hard to spill/restore the register context at a really low cost on typical fiber switches.
     
  17. Rufus

    Newcomer

    Joined:
    Oct 25, 2006
    Messages:
    246
    Likes Received:
    60
    I think it's pretty clear that some of the ops, most likely being the transendentals, will be multi-cycle ops even if they're logically a single instruction.
     
  18. Rufus

    Newcomer

    Joined:
    Oct 25, 2006
    Messages:
    246
    Likes Received:
    60
    Interesting thing that just hit me from their slides: none of the vector instructions can take constants. If you want a constant (like a looped (Va+Vb)/2 or something) you have to use an existing X86 instruction to write the immediate value into memory, and then do a broadcast load from memory on the vector instruction that uses it.
     
  19. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    lrb is in order core. I don't think it will have register renaming.
     
  20. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Why would you unroll the latter loops? Unless they are incredibly short it doesn't make much sense.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...