Larrabee delayed to 2011 ?

Discussion in 'Architecture and Products' started by rpg.314, Sep 22, 2009.

  1. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    832
    Likes Received:
    505
    For tapping into the potential FLOPS of a current CPU you still need to write specific SIMD (ie SSE) code.
    You would not want to do this with assembler but with intrinsics, which are very close to assembler;
    just abstracting registers, still allowing normal C++ variables, flow control etc.
     
  2. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Would you prefer to write shader-like code (which is implicitly parallelized) or to write your kernels via intrinsics (and have full control..)?
     
  3. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    With shared memory and guaranteed execution domains (ie. CUDA) the difference is rather arbitrary.
     
  4. Humus

    Humus Crazy coder
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,217
    Likes Received:
    77
    Location:
    Stockholm, Sweden
    Correct. On top of my head, I can't think of any assembly language code in our codebase. If there exist any, it's not significant. However, there is some use of compiler intrinsics that are architecture specific, such as for SSE and Altivec. These are used in performance critical parts of the code.

    Writing SSE code using intrinsics is generally orders of magnitude less productive than writing shader code, even if you have to spend time optimizing the high level code to generate better machine code (something you amy need to do on intrinsics code too btw). Anyone intending to write any significant portions of code for Larrabee using the intrinsics would be insane. The x86 ISA has no particular advantage I can see as a developer. Especially since it's not even the same x86. Larrabee is best programmed using OpenCL or HLSL.
     
  5. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,629
    Likes Received:
    1,227
    Location:
    British Columbia, Canada
    Not totally true... the abstraction penalties with respect to work group sync and control flow are pretty hurtful in some cases. Put another way, there are operations that are cheap in hardware that are not (yet?) efficiently expressed in the DC/OpenCL/CUDA abstraction. That's not to say that the abstractions are not useful, but they're not totally "there" yet either.
     
    #285 Andrew Lauritzen, Dec 7, 2009
    Last edited by a moderator: Dec 7, 2009
  6. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    Could you give examples about the kinds of code where SSE is useful in a game? CPU-based physics definitely qualifies. Sound processing maybe? But that about as far as my imagination takes me...

    I suspect those intrinsics are very similar to those that are used to program TI DSP's in something that tries to pretends it's still C. :wink:
     
  7. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    832
    Likes Received:
    505
    For programming GPUs, shader code definitely is the way. It still can pay off to look at the generated assembler to see if it's optimal and rearrange expressions accordingly.
     
  8. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Why wouldn't use the same strategy on a CPU or on a LRB-like architecture? In the end writing shaders is way easier and the programming model is likely to allow you to get very good performance with minimal effort, at least with fairly regular kernels/data.
     
  9. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    They are very close to the assemblies, but you don't have to worry about register allocation or even scheduling. For example, it's like

    Code:
    __m128 a, b, c;
    ...
    a = _mm_add_ps(b, c);
    Basically _mm_add_ps directly map to the instruction ADDPS. Most intrinsics work this way. Although there are also a few intrinsics that's "combo," such as _mm_load1_ps (which loads from a single float and spread into the 4 words in a variable). There is no such instruction, so it's done with a load and a shuffle instruction.

    Intel has tried to provide an overloaded C++ library with a 4D vector type which operators are overloaded to support SSE, so you just write a = b + c and it should be replaced with _mm_add_ps. This is in theory more readable but practically computation operators in C++ is not very much so it's not that useful.

    There are also automatic vectorization compilers (I believe both Intel C++ compiler and GCC now have these options), but they are still not very good.
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    \Shameless plug

    try eigen, not only will it remove temporaries, it will also simdify and unroll loops automagically wherever possible.
     
  11. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    832
    Likes Received:
    505
    If you write code that needs to run on a CPU you can not use shader language.
    With OpenCL for CPUs that may change.
     
  12. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    That was my point. Just because you cannot do it right now it doesn't mean you won't be able to do it in the future (Cuda, OpenCL, etc..)

    See http://www.cercs.gatech.edu/tech-reports/tr2009/git-cercs-09-01.pdf
     
  13. PeterT

    Regular

    Joined:
    May 14, 2002
    Messages:
    702
    Likes Received:
    14
    Location:
    Austria
    Not really, much of it seems (to me) to be straight out of 1999.
     
  14. Humus

    Humus Crazy coder
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,217
    Likes Received:
    77
    Location:
    Stockholm, Sweden
    It is used extensively in the occlusion culling code. Some use in terrain code as well, and some in the DX10 backend.
     
  15. Lazy8s

    Veteran

    Joined:
    Oct 3, 2002
    Messages:
    3,100
    Likes Received:
    19
    The unfavorable performance per area and power consumption of what Intel considers the most basic of x86 designs, the Atom core, compared to natively mobile cores like an ARM makes me inclined to believe that x86 overhead would become a design limit of any type of processor focused on being lean, like a GPU, and that it was a significant factor in Larrabee's failure.
     
  16. compres

    Regular

    Joined:
    Jun 16, 2003
    Messages:
    553
    Likes Received:
    3
    Location:
    Germany
    We have argued here before that making the cores x86 was a bad decission in this case. I did not however though that it would delay/fail the project at all and we probably won't know for sure.
     
  17. Barbarian

    Regular

    Joined:
    Jun 27, 2005
    Messages:
    289
    Likes Received:
    15
    Location:
    California, USA
    I personally thought LRBni was the first decent vector set in the history of Intel. After the back to back failures of MMX and SSE, I really lost faith Intel will be able to produce a decent vector set (we're up to SSE 4.2 and it's still being tinkered with and still lacking).
    I didn't think the x86 "overhead" was such a big deal - after all they at least picked the IMHO last decent straight-up x86 implementation (Pentium1, pre-MMX). The ISA is certainly archaic but at least they were getting some value out of it, short pipeline, memory-ops, flexible address generation, etc.
     
  18. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    If you ask me, everything apart from LRBni (and to a limited extent SSE1 and SSE2), all the ISA's to come out of Intel have been pretty much damaged goods.

    Really? I am still waiting for someone to give me one example of apps that scale to O(100) hw threads and need full cache coherency in hw because of extensive inter thread communication. The killer app of lrb, rasterization, doesn't.

    They need 64 bit support, so that means SSE and SSE2. Of course the lovely x87 FPU and the BCD instructions aren't gonna go away. Short pipeline? not sure if it helps, LRB is gonna be clocked at >2GHz. As for the memory op's and address generation, I could be wrong but it seems that none of that happens on the x86 bit. The addresses seem to be generated on the vector units and memory op's on the vpu seem to be more powerful than whatever x86 has to offer.

    The real gem in lrb for me (apart from it's sexy vpu) is the unification of cache hierarchy, context storage and shared memory, bringing maximum flexibility and increasing utilization of all the the kinds of on-chip memory pools. This thing driven by something like an ARM cortex A8 (without the pesky thumb, jazelle, vfp, trust zone and neon bits) would be way more area efficient, with power efficiency being somewhat higher too.
     
  19. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Compared to TRIPS even that is rather traditional ... but it's revolutionary for Intel.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...