C++ implementation of Larrabee new instructions

dunno about sm4 requirements, but yes by spilling it to memory, it is possible. But if you use volatile keyword in cuda for instance, it will force it to become register and that can cause launch failure.
 
Rage3d speculated that rv770 has 64kb L1/SIMD (=640kbyte total on rv770) and 256kb/memory controller l2 (=total 1mb on rv770).

Yeah, but I now know that's wrong (albeit it seemed really right back then). I think Rys determined the sizes at some point, albeit I don't remember his measurements...maybe you should summon him in:D
 
On larrabee the entire 2k register file gets flushed to cache on every fiber change no matter what.
Intel paper on Larrabee specifies in paragraph 4.5 that each core is going to run 4 threads i.e. one thread per hardware context so there won't be any context switches (notice that the paper also specifies that threads are pinned to a core). The four threads are split into one for setup and three for rasterization/shading/blending and in each worker thread strands (i.e. fibers) is switched co-operatively to hide the latency of texture operation. This probably means that depending on some heuristics the Larrabee runtime will assign x quads for each worker thread and the worker thread code is unrolled/software-pipelined to accommodate for this. The code might look like this for a worker thread with four 'strands' (some very simple pseudo-code here):

Code:
for each batch of quads {
    // strand 1
    rasterize & interpolate quad 1
    send the texture fetches for quad 1 to the TUs
    // strand 2
    rasterize & interpolate quad 2
    send the texture fetches for quad 2 to the TUs
    // strand 3
    rasterize & interpolate quad 3
    send the texture fetches for quad 3 to the TUs
    // strand 4
    rasterize & interpolate quad 4
    send the texture fetches for quad 4 to the TUs

    // strand 1
    texels for quad 1 arrive in the L1
    shade quad 1
    blend quad 1
    // strand 2
    texels for quad 2 arrive in the L1
    shade quad 2
    blend quad 2
    // strand 3
    texels for quad 3 arrive in the L1
    shade quad 3
    blend quad 3
    // strand 4
    texels for quad 4 arrive in the L1
    shade quad 4
    blend quad 4
}

I kept the example very simple but if the shader contains dependent texture fetches or complex constructs it might be compiled to a sequence of loops iterating over the quads and not just one.

Anyway considering that the runtime cannot know the exact latency of the texture operations there might be inefficiencies: if too few strands are assigned to a thread the thread will stall waiting for the TUs; if the strands are too many the code might not look good (too many spills due to the increased register pressure, TU underused, etc...).

However since the four threads are all running on the same core it is likely that a large part of the inefficiencies associated with the code will be absorbed by the thread-switching mechanism.
 
This is sort of OT - but is it just me, or has Intel got the worst naming semanthics ever. Everytime I look at their opcode/intrinsic/whatever listings, I feel like my tongue is breaking, inside my head.

Reminds of of this old article I once read on "Writting bad code for dummies":
There was a section about keyword naming:
"Write your variable names as long and complicated as possible, with fewest characters possible different, preferably such that are hard to notice"
eg:
345fewrtrtderewrt
345fewrtrtberewrt

The article was back in early 90s, so maybe Intel adopted the ideology for all their subsequent naming schemes...
 
I looked over the ISA (or whatever you call this release), and here are some random thoughts of mine.

Has it been announced how large of a virtual address space LRB will support? I have to assume it has >32 bit support (you can already buy 4GB GPUs), but is it 40 or 48 bits or greater? Either way, the scatter and gather instructions have a limited address range. They're defined as:
SCATTERD - Scatter Doubleword Vector to Memory
Downconverts and stores elements in doubleword vector v1 to the memory locations pointed by base address m + index vector index * scale scale.
Notice that the array base is most likely 64-bits, but the index offsets are only 32-bit integers. If you're doing normal 4-byte accesses that limits you to a 16GB of virtual address space you can gather/scatter across. Nothing you'd ever hit when doing graphics, but certainly something that the compiler will have to take into consideration and could become an issue in larger HPC systems.

The other limitation is on the shuffle instruction:
SHUF128x32 - Shuffle Vector Dqwords Then Doublewords
Shuffles 128-bit blocks of the vector read from vector v2, and then 32-bit blocks of the result.
From a hardware perspective this implementation makes sense. You have an initial 4x4x128-bit crossbar followed by 4 individual 4x4x32-bit crossbars. That results in a much smaller and faster design than a full 16x16x32-bit crossbar. However to do an any-to-any shuffle it takes 4 instructions and you need to either pre-set or set 4 mask registers. Again you're extremely unlikely to hit this in graphics, but it's a limitation for general purpose usage.

Now these two instructions are amazingly useful. They basically take the input/output stage of lots of parallel algorithms and make them into a single instruction. I'm really curious as to how they're implemented and how they preform:
COMPRESS{D,Q} - Pack and Store Vector to Unaligned Memory
EXPAND{D,Q} - Load Unaligned and Unpack to Vector

Getting 3 options for the float MAD is kind of silly. V1 = V1*V3 + V2 is usefully different than V1 = V2*V3 + V1, but the compiler should be able to generate the 3rd variant. Also it's interesting that the float MADs have 3 variants plus 3 negate variants, while int MAD only has a single variant. It's also interesting that they appear to only be implementing a standard multiply-add, not a fused multiply-add.

The MADD233 instruction clearly exists to support a specific algorthm, but I can't figure out what it is off the top of my head. It's Vec4 = Vec4*A+B, where A and B are scalars.

I'm curious about their transcendental support. They list EXP2, LOG2, RECIP, and RSQRT, but no mention of precision (often the hardware instruction is only a close approximation, not exact). Also no SIN/COS support?

Nothing else jumps out at me, other than their god-awful names.
 
I'm curious about their transcendental support. They list EXP2, LOG2, RECIP, and RSQRT, but no mention of precision (often the hardware instruction is only a close approximation, not exact). Also no SIN/COS support?

http://software.intel.com/en-us/articles/prototype-primitives-guide/
Are you looking for cos_ps, cos_pd etc ?

They also mention : "Math utility functions do not correspond directly to Larrabee new instructions, but are added for programming support."
 
Notice that the array base is most likely 64-bits, but the index offsets are only 32-bit integers. If you're doing normal 4-byte accesses that limits you to a 16GB of virtual address space you can gather/scatter across. Nothing you'd ever hit when doing graphics, but certainly something that the compiler will have to take into consideration and could become an issue in larger HPC systems.

I find this a limit especially for vectorizing compilers, you want gather/scatter operations to accept an array of pointers. A gather/scatter operation with indexes can be used only for accessing arrays which you know fall into the range of the indexes (depending on the language you are using determining the size of an array at compile time can go from 'easy' to 'impossible').

However to do an any-to-any shuffle it takes 4 instructions and you need to either pre-set or set 4 mask registers. Again you're extremely unlikely to hit this in graphics, but it's a limitation for general purpose usage.

Or you can write your data to memory and use gather which should be simpler but not necessarily faster.

Getting 3 options for the float MAD is kind of silly. V1 = V1*V3 + V2 is usefully different than V1 = V2*V3 + V1, but the compiler should be able to generate the 3rd variant. Also it's interesting that the float MADs have 3 variants plus 3 negate variants, while int MAD only has a single variant. It's also interesting that they appear to only be implementing a standard multiply-add, not a fused multiply-add.

You mean that they implemented only multiply-accumulate (MAC, 3 operands) instructions instead of multiply-add (MAD, 4 operands)? Because the only difference between a MAD/MAC and FMAD/FMAC is that the fused ones do not round the result of the multiplication. In the instruction descriptions there is no mention of rounding the result of the multiplications so those are all fused multiply-accumulate instructions.

[edit] I realized it is impossible to tell from the descriptions if the hardware instructions will be 3 operand MACs or 4 operand MADs, my bad. Anyway they are fused.

I'm curious about their transcendental support. They list EXP2, LOG2, RECIP, and RSQRT, but no mention of precision (often the hardware instruction is only a close approximation, not exact).

Probably the precision should be enough so that you can get the full 32-bit result with two Newton-Raphson iterations like their SSEx brethern.
 
Kind of surprising there is a fixed amount at all (very much unlike GPUs). In theory the task switch instructions might have a couple of bits to indicate the amount of registers used though.
There's a fixed number of hardware threads, 4, so it seems reasonable.

Jawed
 
32 is quite a lot, too many even if they aren't being used ... but maybe you can chose to use less. That all depends on the really interesting instructions.
It's not a lot really. What shader nowadays uses only 32 values (or eight 4-component vectors)? And if possible you probably also want to use them for interpolation parameters and such. Even if then you still have 'too many' registers, there's always loop unrolling.
 
It's not a lot really. What shader nowadays uses only 32 values (or eight 4-component vectors)? And if possible you probably also want to use them for interpolation parameters and such. Even if then you still have 'too many' registers, there's always loop unrolling.
I assume the compiler will try to execute 4 wide SIMD code in a 4x4 configuration rather than 16x1.
 
Probably the precision should be enough so that you can get the full 32-bit result with two Newton-Raphson iterations like their SSEx brethern.
It's probably higher than that. As far as I know it's not that hard to design fast hardware that is accurate up to 1-2 ulp (32-bit). That would make them immediately usable for graphics and most other applications that use 32-bit floats. Applictions that require exact results work with double precision anyway, and these are implemented in the 'utility math' functions. Newton-Rhapson works great for division and square root, and for everything else I imagine they use gather to look up polynomial coefficients.
 
We are talking about 32x512 bits registers though... you have space for 16x32 bits values per register per HW thread or am I missing your point here?
The point is how many pixels are processed per thread? If it's 16 then you only have 32 scalars per pixel. And it can be even more pixels per thread by unrolling. So either way 32 registers is never going to be too many (nor too few for that matter).

It's software, you can do whatever you want to maximize efficiency.
 
The point is how many pixels are processed per thread? If it's 16 then you only have 32 scalars per pixel. And it can be even more pixels per thread by unrolling. So either way 32 registers is never going to be too many (nor too few for that matter).

It's software, you can do whatever you want to maximize efficiency.
Exactly. And that would be 32 live registers, not just 32 registers.
 
That's interesting, the registers referred to in the opcodes then could be virtual registers and each fiber switch would just change "banks of registers' so to speak.
 
You mean that they implemented only multiply-accumulate (MAC, 3 operands) instructions instead of multiply-add (MAD, 4 operands)? Because the only difference between a MAD/MAC and FMAD/FMAC is that the fused ones do not round the result of the multiplication. In the instruction descriptions there is no mention of rounding the result of the multiplications so those are all fused multiply-accumulate instructions.

[edit] I realized it is impossible to tell from the descriptions if the hardware instructions will be 3 operand MACs or 4 operand MADs, my bad. Anyway they are fused.

This is quite interesting because that's essentially what they did in AVX. Originally AVX was supposed to have a 4-operand FMA, but later they revised it to a 3-operand FMA, with three variants, just like this. I don't know the rationale behind this decision, but I think it may be related to x86 instruction encoding or some hardware issues.
 
Back
Top