Larrabee at Siggraph

nAo · Aug 7, 2008

A minor note: according this thread on Ace's hardware forums Larrabee vector ISA is called LRBni, and it's different from AVX, but I guess we already knew that.

bowman · Aug 7, 2008

nAo said:
A minor note: according this thread on Ace's hardware forums Larrabee vector ISA is called LRBni, and it's different from AVX, but I guess we already knew that.

That's not the final name, it's 'Larrabee New Instructions', similar to 'Prescott New Instructions', 'Katmai New Instructions' and so on.

3dilettante · Aug 7, 2008

The initials fit.

The forking of vector sets is unfortunate. It sounds like Larrabee's vectorization capabilities are better than AVX, while AVX has a niftier and possibly more extendable encoding.

Killer-Kris · Aug 7, 2008

ShaidarHaran said:
What you say is all true. I just think a design from the early 90's would be more at home in a museum than tomorrow's multi-teraflop C-GPU

More than anything such an old chip would be an excellent starting point. How many transistors did the original Pentium have, like 3M? Now how many of those can you fit into the G80's transistor budget, and those would be truly scalar units

!!! Of course that's a poor comparison because adding the vector extensions, MT, and x86-64, and an L2 cache would leave it a little more bloated than 3M transistors per core but it gets the ball rolling.

3dilettante · Aug 7, 2008

Going by 1.4 billion for Nvidia's chip, and sandpile.org's listing of 4 million for the P54, 350 original Pentiums would fit.

They wouldn't be too useful, since they lack the other 26 million transistors Larrabee's cores have, and there's no other logic or interconnect to actually talk to them.

nAo · Aug 7, 2008

Half of that 30M transistor budget should be cache..

3dilettante · Aug 7, 2008

nAo said:
Half of that 30M transistor budget should be cache..

How useful would it be without that cache?

nAo · Aug 7, 2008

3dilettante said:
How useful would it be without that cache?

LOL. I was hinting to the fact that caches are more dense that logic, that's it!

3dilettante · Aug 7, 2008

nAo said:
LOL. I was hinting to the fact that caches are more dense that logic, that's it!

True, but I was basing my numbers off transistor count, not area. So I still think the 350 cores would be a little useless, unless we say everything but the cores takes 0 transistors...

nAo · Aug 7, 2008

3dilettante said:
True, but I was basing my numbers off transistor count, not area. So I still think the 350 cores would be a little useless, unless we say everything but the cores takes 0 transistors...

Agreed. 32+ cores is more likely

bowman · Aug 7, 2008

3dilettante said:
Going by 1.4 billion for Nvidia's chip, and sandpile.org's listing of 4 million for the P54, 350 original Pentiums would fit.

They wouldn't be too useful, since they lack the other 26 million transistors Larrabee's cores have, and there's no other logic or interconnect to actually talk to them.

Larrabee cores are 30M? Where did you hear this?

3dilettante · Aug 7, 2008

Where was the source on the 30 million number?
It sounds reasonable, but I can't find the source of the figure.
I didn't see it in the larrabee pdf, but I may have overlooked it.

WaltC · Aug 7, 2008

3dilettante said:
They wouldn't be too useful, since they lack the other 26 million transistors Larrabee's cores have, and there's no other logic or interconnect to actually talk to them.

Thanks...

Took the words right out of my mouth. I think that for the sake of "conceptual simplicity" some of us just might be emphasizing the "simplicity" notion a tad too much. Things like cache and the glue logic to make the whole shebang work probably would take at least--oh, at least a dozen or so transistors, I should think...

ArchitectureProfessor · Aug 7, 2008

3dilettante said:
The hints of Larrabee's vector functionality and scatter/gather memory access capability seem to indicate its vector functionality exceeds the limitations of current x86 SSE, so why expect current x86 tools to do it justice?

Not the current tools, but slightly updated versions of the current tools.

As Larrabee's vectors are easier than current x86 vectors (SSE), modifying the Intel C compiler (ICC) to support these new vectors shouldn't require a total re-write of the whole compiler. In fact, the existing auto-vectorization should map well to Larrabee. With a bit more tweaking, the ICC compiler should be able to vectorize even more loops (for example, ones with conditionals) to create more efficient vector code. The rest of the non-vector aspects of the compiler's code generation would be basically unchanged.

So, not zero effort in updating the tools, but much easier than making an entire EPIC compiler for a new ISA.

nAo · Aug 7, 2008

Hope Intel guys are working on a Larrabee implementation of the Ct language, unfortunately they don't mention it at all on their siggraph paper.

heliosphere · Aug 7, 2008

nAo said:
Hope Intel guys are working on a Larrabee implementation of the Ct language, unfortunately they don't mention it at all on their siggraph paper.

They do mention Ct, but they don't mention working on an implementation if that's what you meant.

Besides high throughput application programming, we anticipate
that developers will also use Larrabee Native to implement higher
level programming models that may automate some aspects of
parallel programming or provide domain focus. Examples include
Ct style programming models [Ghuloum et al. 2007], high level
library APIs such as Intel® Math Kernel Library (Intel® MKL)
[Chuvelev et al. 2007], and physics APIs. Existing GPGPU
programming models can also be re-implemented via Larrabee
Native if so desired [Buck et al. 2004; Nickolls et al. 2008].

3dilettante · Aug 7, 2008

bowman said:
Larrabee cores are 30M? Where did you hear this?

It was back in this thread.

http://forum.beyond3d.com/showthread.php?p=1199833&highlight=30m#post1199833

At the time, I hadn't seen the larrabee paper, so I thought it was from there, but after reading it I don't recall seeing that figure.

bowman · Aug 7, 2008

Oh! Hm. Anyone know the transistor count for 256KB of cache, PCI-E bus and some memory controllers?

Time to bust out the calculator..

3dilettante · Aug 7, 2008

bowman said:
Oh! Hm. Anyone know the transistor count for 256KB of cache, PCI-E bus and some memory controllers? Time to bust out the calculator..

Can't speak to the controllers, but 256KiB of 6T SRAM is about 12.5 million transistors.

edit: this doesn't count any cache tags, just data arrays

ArchitectureProfessor · Aug 8, 2008

3dilettante said:
edit: this doesn't count any cache tags, just data arrays

Roughly speaking cache tags are around 10% or less the size of the data array.

The details: For 64-byte blocks, the worst-case tag overhead is around 12.5%, which is a full 64-bit tag for a 64-byte block. But assuming a 48-bit physical address space and a 6-bit block offset, you're down to 42 bits or 8%. For the 256KB cache, the index would likely be around 10 bits, so now you're down to only 6% overhead for the tags.

Larrabee at Siggraph

nAo

Nutella Nutellae

bowman

3dilettante

Killer-Kris

3dilettante

nAo

Nutella Nutellae

3dilettante

nAo

Nutella Nutellae

3dilettante

nAo

Nutella Nutellae

bowman

3dilettante

WaltC

ArchitectureProfessor

nAo

Nutella Nutellae

heliosphere

3dilettante

bowman

3dilettante

ArchitectureProfessor

Similar threads