C++ implementation of Larrabee new instructions

some good stuff there. but i am somewhat disappointed at only 32 vector registers. But perhaps Intel knows better. *shrugs*
 
Hmm, the mask register is for predication and is 8 way. Could that imply 8 hardware threads, not 4?

EDIT: Hmm, no, it seems it's just to make manipulating masks easier, e.g. while constructing a mask for nested control flow. And masks can be moved to/from integer registers (i.e. registers outside of the vector unit).

Jawed
 
It also talks of scalar registers. Do you think they are same as existing scalar registers of pentium or they are something else?
 
I'd rather have seen the specs on the task switching and texture sampling instructions than the whole lot of that.
 
some good stuff there. but i am somewhat disappointed at only 32 vector registers. But perhaps Intel knows better. *shrugs*
Kind of surprising there is a fixed amount at all (very much unlike GPUs). In theory the task switch instructions might have a couple of bits to indicate the amount of registers used though.
 
Kind of surprising there is a fixed amount at all (very much unlike GPUs). In theory the task switch instructions might have a couple of bits to indicate the amount of registers used though.
Its 32*16*4 float register (exlcuding the FPU stack) per core which is quite a lot really...
 
32 is quite a lot, too many even if they aren't being used ... but maybe you can chose to use less. That all depends on the really interesting instructions.
 
Its 32*16*4 float register (exlcuding the FPU stack) per core which is quite a lot really...

that's 2k registers

gt200 has 16k of them and rv770 has 128 regs/thread * 4 float4 * 128threads at min

edit : rv 770 corrected.

so in comparison, it ain't quite a lot.;)

but may be it doesn't need them because it has caches to hide latency so doesn't need those many execution contexts. :???:
 
Last edited by a moderator:
I was just about to ask how big are and at what latency do GPU caches work :)
AFAIK GPUs have caches for textures, vertices and constants, but registers space is just that..a e big register file/memory, not a cache.
 
I know that but is spilling register contents to L1 cache so bad? I think on most x86 CPU's it gives additional latency of around 2-4 cycles or so, shouldn't be that bad I think.
 
They could implement a really simple dirty bit per register, and on a context switch only spill registers to memory that has been written since the last switch. That would reduce context switching time and eliminate polution of the caches with inactive register states.

Cheers
 
I know that but is spilling register contents to L1 cache so bad? I think on most x86 CPU's it gives additional latency of around 2-4 cycles or so, shouldn't be that bad I think.
On fiber switch the entire register set needs to get swapped ...
 
Hm, let me get this straight:

On current GPUs there is a huge register file where threads/fibers reserve a bunch of registers as they need. When a thread gets swapped those registers are not flushed unless some other thread needs a few and there are no free ones available.
On larrabee the entire 2k register file gets flushed to cache on every fiber change no matter what.

Are there really no rename registers on Larrabee that could be used for kind of a double (quad?) buffer for fiber changes?
 
Not the entire register set for all the threads, just the one for the single thread.

They did say the fiber switch happened automatically, they could simply have an extra cache for the registers from inactive fibers they are not telling us about.
 
I know that but is spilling register contents to L1 cache so bad? I think on most x86 CPU's it gives additional latency of around 2-4 cycles or so, shouldn't be that bad I think.
Latency is bad only when you cannot hide it.
 
Hm, let me get this straight:

On current GPUs there is a huge register file where threads/fibers reserve a bunch of registers as they need. When a thread gets swapped those registers are not flushed unless some other thread needs a few and there are no free ones available.
On larrabee the entire 2k register file gets flushed to cache on every fiber change no matter what.

Are there really no rename registers on Larrabee that could be used for kind of a double (quad?) buffer for fiber changes?

No. the number of threads launched on gpu's are determined by registers/thread. so there is no spill and shader will refuse to launch if you exceed the limits.

I should have remembered that lrb can use the L1 cache as extended reg file. so it probably doesn't need more registers. What happens at fibre switch, afaik, we still don't know how it is implemented. Dumping the entire context to L1 would be very costly. Perhaps there are more execution contexts but a fibre switch performs a context switch upon a software instruction instead of being decided by a thread scheduler in classic gpu's.
 
No. the number of threads launched on gpu's are determined by registers/thread. so there is no spill and shader will refuse to launch if you exceed the limits.
I believe R7xx GPUs can actually spill registers to ext memory when you need too many of them. IIRC SM4 hardware has to support up to 4096 temporaries per shader instance!
 
Last edited:
Back
Top