C++ implementation of Larrabee new instructions

nAo · Mar 25, 2009

Overview
This .inl file provides a C++-implementation of the Larrabee new instructions. It allows developers to experiment with developing Larrabee code without a Larrabee compiler and without Larrabee hardware.

Prototype Primitives Guide

rpg.314 · Mar 25, 2009

thanks

rpg.314 · Mar 25, 2009

some good stuff there. but i am somewhat disappointed at only 32 vector registers. But perhaps Intel knows better. *shrugs*

Jawed · Mar 25, 2009

Hmm, the mask register is for predication and is 8 way. Could that imply 8 hardware threads, not 4?

EDIT: Hmm, no, it seems it's just to make manipulating masks easier, e.g. while constructing a mask for nested control flow. And masks can be moved to/from integer registers (i.e. registers outside of the vector unit).

Jawed

rpg.314 · Mar 25, 2009

It also talks of scalar registers. Do you think they are same as existing scalar registers of pentium or they are something else?

MfA · Mar 25, 2009

I'd rather have seen the specs on the task switching and texture sampling instructions than the whole lot of that.

MfA · Mar 25, 2009

rpg.314 said:
some good stuff there. but i am somewhat disappointed at only 32 vector registers. But perhaps Intel knows better. *shrugs*

Kind of surprising there is a fixed amount at all (very much unlike GPUs). In theory the task switch instructions might have a couple of bits to indicate the amount of registers used though.

DeanoC · Mar 25, 2009

MfA said:
Kind of surprising there is a fixed amount at all (very much unlike GPUs). In theory the task switch instructions might have a couple of bits to indicate the amount of registers used though.

Its 32*16*4 float register (exlcuding the FPU stack) per core which is quite a lot really...

MfA · Mar 25, 2009

32 is quite a lot, too many even if they aren't being used ... but maybe you can chose to use less. That all depends on the really interesting instructions.

rpg.314 · Mar 25, 2009

DeanoC said:
Its 32*16*4 float register (exlcuding the FPU stack) per core which is quite a lot really...

that's 2k registers

gt200 has 16k of them and rv770 has 128 regs/thread * 4 float4 * 128threads at min

edit : rv 770 corrected.

so in comparison, it ain't quite a lot.

but may be it doesn't need them because it has caches to hide latency so doesn't need those many execution contexts. :???:

hoho · Mar 25, 2009

rpg.314 said:
but may be it doesn't need them because it has caches to hide latency so doesn't need those many execution contexts.

I was just about to ask how big are and at what latency do GPU caches work

nAo · Mar 25, 2009

hoho said:
I was just about to ask how big are and at what latency do GPU caches work

AFAIK GPUs have caches for textures, vertices and constants, but registers space is just that..a e big register file/memory, not a cache.

hoho · Mar 25, 2009

I know that but is spilling register contents to L1 cache so bad? I think on most x86 CPU's it gives additional latency of around 2-4 cycles or so, shouldn't be that bad I think.

Gubbi · Mar 25, 2009

They could implement a really simple dirty bit per register, and on a context switch only spill registers to memory that has been written since the last switch. That would reduce context switching time and eliminate polution of the caches with inactive register states.

Cheers

MfA · Mar 25, 2009

hoho said:
I know that but is spilling register contents to L1 cache so bad? I think on most x86 CPU's it gives additional latency of around 2-4 cycles or so, shouldn't be that bad I think.

On fiber switch the entire register set needs to get swapped ...

hoho · Mar 25, 2009

Hm, let me get this straight:

On current GPUs there is a huge register file where threads/fibers reserve a bunch of registers as they need. When a thread gets swapped those registers are not flushed unless some other thread needs a few and there are no free ones available.
On larrabee the entire 2k register file gets flushed to cache on every fiber change no matter what.

Are there really no rename registers on Larrabee that could be used for kind of a double (quad?) buffer for fiber changes?

MfA · Mar 25, 2009

Not the entire register set for all the threads, just the one for the single thread.

They did say the fiber switch happened automatically, they could simply have an extra cache for the registers from inactive fibers they are not telling us about.

nAo · Mar 25, 2009

hoho said:
I know that but is spilling register contents to L1 cache so bad? I think on most x86 CPU's it gives additional latency of around 2-4 cycles or so, shouldn't be that bad I think.

Latency is bad only when you cannot hide it.

rpg.314 · Mar 25, 2009

hoho said:
Hm, let me get this straight:

On current GPUs there is a huge register file where threads/fibers reserve a bunch of registers as they need. When a thread gets swapped those registers are not flushed unless some other thread needs a few and there are no free ones available.
On larrabee the entire 2k register file gets flushed to cache on every fiber change no matter what.

Are there really no rename registers on Larrabee that could be used for kind of a double (quad?) buffer for fiber changes?

No. the number of threads launched on gpu's are determined by registers/thread. so there is no spill and shader will refuse to launch if you exceed the limits.

I should have remembered that lrb can use the L1 cache as extended reg file. so it probably doesn't need more registers. What happens at fibre switch, afaik, we still don't know how it is implemented. Dumping the entire context to L1 would be very costly. Perhaps there are more execution contexts but a fibre switch performs a context switch upon a software instruction instead of being decided by a thread scheduler in classic gpu's.

nAo · Mar 25, 2009

rpg.314 said:
No. the number of threads launched on gpu's are determined by registers/thread. so there is no spill and shader will refuse to launch if you exceed the limits.

I believe R7xx GPUs can actually spill registers to ext memory when you need too many of them. IIRC SM4 hardware has to support up to 4096 temporaries per shader instance!

C++ implementation of Larrabee new instructions

nAo

Nutella Nutellae

rpg.314

rpg.314

Jawed

rpg.314

MfA

MfA

DeanoC

Trust me, I'm a renderer person!

MfA

rpg.314

hoho

nAo

Nutella Nutellae

hoho

Gubbi

MfA

hoho

MfA

nAo

Nutella Nutellae

rpg.314

nAo

Nutella Nutellae

Similar threads