Larrabee delayed to 2011 ?

It would be interesting if some of the real game developers here give some insight about this. But I thought the era of coding even small sections of code in ASM was long gone.

For tapping into the potential FLOPS of a current CPU you still need to write specific SIMD (ie SSE) code.
You would not want to do this with assembler but with intrinsics, which are very close to assembler;
just abstracting registers, still allowing normal C++ variables, flow control etc.
 
Would you prefer to write shader-like code (which is implicitly parallelized) or to write your kernels via intrinsics (and have full control..)?
 
With shared memory and guaranteed execution domains (ie. CUDA) the difference is rather arbitrary.
 
It would be interesting if some of the real game developers here give some insight about this. But I thought the era of coding even small sections of code in ASM was long gone.

Correct. On top of my head, I can't think of any assembly language code in our codebase. If there exist any, it's not significant. However, there is some use of compiler intrinsics that are architecture specific, such as for SSE and Altivec. These are used in performance critical parts of the code.

Writing SSE code using intrinsics is generally orders of magnitude less productive than writing shader code, even if you have to spend time optimizing the high level code to generate better machine code (something you amy need to do on intrinsics code too btw). Anyone intending to write any significant portions of code for Larrabee using the intrinsics would be insane. The x86 ISA has no particular advantage I can see as a developer. Especially since it's not even the same x86. Larrabee is best programmed using OpenCL or HLSL.
 
With shared memory and guaranteed execution domains (ie. CUDA) the difference is rather arbitrary.
Not totally true... the abstraction penalties with respect to work group sync and control flow are pretty hurtful in some cases. Put another way, there are operations that are cheap in hardware that are not (yet?) efficiently expressed in the DC/OpenCL/CUDA abstraction. That's not to say that the abstractions are not useful, but they're not totally "there" yet either.
 
Last edited by a moderator:
Correct. On top of my head, I can't think of any assembly language code in our codebase. If there exist any, it's not significant. However, there is some use of compiler intrinsics that are architecture specific, such as for SSE and Altivec. These are used in performance critical parts of the code.
Could you give examples about the kinds of code where SSE is useful in a game? CPU-based physics definitely qualifies. Sound processing maybe? But that about as far as my imagination takes me...

Writing SSE code using intrinsics is generally orders of magnitude less productive than writing shader code, even if you have to spend time optimizing the high level code to generate better machine code (something you amy need to do on intrinsics code too btw).
I suspect those intrinsics are very similar to those that are used to program TI DSP's in something that tries to pretends it's still C. ;)
 
Would you prefer to write shader-like code (which is implicitly parallelized) or to write your kernels via intrinsics (and have full control..)?

For programming GPUs, shader code definitely is the way. It still can pay off to look at the generated assembler to see if it's optimal and rearrange expressions accordingly.
 
For programming GPUs, shader code definitely is the way. It still can pay off to look at the generated assembler to see if it's optimal and rearrange expressions accordingly.
Why wouldn't use the same strategy on a CPU or on a LRB-like architecture? In the end writing shaders is way easier and the programming model is likely to allow you to get very good performance with minimal effort, at least with fairly regular kernels/data.
 
I suspect those intrinsics are very similar to those that are used to program TI DSP's in something that tries to pretends it's still C. ;)

They are very close to the assemblies, but you don't have to worry about register allocation or even scheduling. For example, it's like

Code:
__m128 a, b, c;
...
a = _mm_add_ps(b, c);

Basically _mm_add_ps directly map to the instruction ADDPS. Most intrinsics work this way. Although there are also a few intrinsics that's "combo," such as _mm_load1_ps (which loads from a single float and spread into the 4 words in a variable). There is no such instruction, so it's done with a load and a shuffle instruction.

Intel has tried to provide an overloaded C++ library with a 4D vector type which operators are overloaded to support SSE, so you just write a = b + c and it should be replaced with _mm_add_ps. This is in theory more readable but practically computation operators in C++ is not very much so it's not that useful.

There are also automatic vectorization compilers (I believe both Intel C++ compiler and GCC now have these options), but they are still not very good.
 
Intel has tried to provide an overloaded C++ library with a 4D vector type which operators are overloaded to support SSE, so you just write a = b + c and it should be replaced with _mm_add_ps. This is in theory more readable but practically computation operators in C++ is not very much so it's not that useful.

\Shameless plug

try eigen, not only will it remove temporaries, it will also simdify and unroll loops automagically wherever possible.
 
Why wouldn't use the same strategy on a CPU or on a LRB-like architecture? In the end writing shaders is way easier and the programming model is likely to allow you to get very good performance with minimal effort, at least with fairly regular kernels/data.

If you write code that needs to run on a CPU you can not use shader language.
With OpenCL for CPUs that may change.
 
Could you give examples about the kinds of code where SSE is useful in a game? CPU-based physics definitely qualifies. Sound processing maybe? But that about as far as my imagination takes me...

It is used extensively in the occlusion culling code. Some use in terrain code as well, and some in the DX10 backend.
 
The unfavorable performance per area and power consumption of what Intel considers the most basic of x86 designs, the Atom core, compared to natively mobile cores like an ARM makes me inclined to believe that x86 overhead would become a design limit of any type of processor focused on being lean, like a GPU, and that it was a significant factor in Larrabee's failure.
 
The unfavorable performance per area and power consumption of what Intel considers the most basic of x86 designs, the Atom core, compared to natively mobile cores like an ARM makes me inclined to believe that x86 overhead would become a design limit of any type of processor focused on being lean, like a GPU, and that it was a significant factor in Larrabee's failure.

We have argued here before that making the cores x86 was a bad decission in this case. I did not however though that it would delay/fail the project at all and we probably won't know for sure.
 
I personally thought LRBni was the first decent vector set in the history of Intel. After the back to back failures of MMX and SSE, I really lost faith Intel will be able to produce a decent vector set (we're up to SSE 4.2 and it's still being tinkered with and still lacking).
I didn't think the x86 "overhead" was such a big deal - after all they at least picked the IMHO last decent straight-up x86 implementation (Pentium1, pre-MMX). The ISA is certainly archaic but at least they were getting some value out of it, short pipeline, memory-ops, flexible address generation, etc.
 
I personally thought LRBni was the first decent vector set in the history of Intel. After the back to back failures of MMX and SSE, I really lost faith Intel will be able to produce a decent vector set (we're up to SSE 4.2 and it's still being tinkered with and still lacking).

If you ask me, everything apart from LRBni (and to a limited extent SSE1 and SSE2), all the ISA's to come out of Intel have been pretty much damaged goods.

I didn't think the x86 "overhead" was such a big deal
Really? I am still waiting for someone to give me one example of apps that scale to O(100) hw threads and need full cache coherency in hw because of extensive inter thread communication. The killer app of lrb, rasterization, doesn't.

- after all they at least picked the IMHO last decent straight-up x86 implementation (Pentium1, pre-MMX). The ISA is certainly archaic but at least they were getting some value out of it, short pipeline, memory-ops, flexible address generation, etc.

They need 64 bit support, so that means SSE and SSE2. Of course the lovely x87 FPU and the BCD instructions aren't gonna go away. Short pipeline? not sure if it helps, LRB is gonna be clocked at >2GHz. As for the memory op's and address generation, I could be wrong but it seems that none of that happens on the x86 bit. The addresses seem to be generated on the vector units and memory op's on the vpu seem to be more powerful than whatever x86 has to offer.

The real gem in lrb for me (apart from it's sexy vpu) is the unification of cache hierarchy, context storage and shared memory, bringing maximum flexibility and increasing utilization of all the the kinds of on-chip memory pools. This thing driven by something like an ARM cortex A8 (without the pesky thumb, jazelle, vfp, trust zone and neon bits) would be way more area efficient, with power efficiency being somewhat higher too.
 
Back
Top