Larrabee at Siggraph

At the time, I hadn't seen the larrabee paper, so I thought it was from there, but after reading it I don't recall seeing that figure.
The paper says 10 cores have the same area and power consumption as a Core 2 Duo (with 4MB cache), so that's where I got the info from.

If, as Jawed suggested, this does not include a texture unit, then the final core would be substantially bigger.
 
Thanks...;) Took the words right out of my mouth. I think that for the sake of "conceptual simplicity" some of us just might be emphasizing the "simplicity" notion a tad too much. Things like cache and the glue logic to make the whole shebang work probably would take at least--oh, at least a dozen or so transistors, I should think...;)


Actually, I think some of us are over-complicating a statement regarding a why a museum piece might make a good starting point for such a project.
 
Actually, I think some of us are over-complicating a statement regarding a why a museum piece might make a good starting point for such a project.

Yes, indeed. Considering we have absolutely nothing in the way of hardware to look at, let alone analyze, the complication rate so far *everywhere* concerning the finer points of Larrabee is astounding...;)
 
I'm confused over Larrabee's bus architecture(s).

At first I thought, based on older articles, that it would have a 1024-bit external memory bus. Apparently that's not the case. Larrabee has an internal ringbus that's 512-bit in both directions. That's for internal communications between cores.

Then the external memory bus has apparently not been announced, and speculations are that it could be 128-bit or 256-bit.

Someone set me straight if that's not the case.
 
The paper says 10 cores have the same area and power consumption as a Core 2 Duo (with 4MB cache), so that's where I got the info from.

If, as Jawed suggested, this does not include a texture unit, then the final core would be substantially bigger.


The wording in the paper is a little bit strange:

Table 1: Out-of-order vs. in-order CPU comparison: designing
the processor for increased throughput can result in ½ the peak
single-stream performance, but 20x the peak vector throughput
with roughly the same area and power. This difference is 40x in
FLOPS, since the wide VPU supports fused multiply-add but SSE
doesn’t. These in-order cores are not Larrabee, but are similar.


The test design in Table 1 is not identical to Larrabee. To provide
a more direct comparison, the in-order core test design uses the
same process and clock rate as the out-of-order cores and includes
no fixed function graphics logic. This comparison motivates
design decisions for Larrabee since it shows that a wide VPU with
a simple in-order core allows CPUs to reach a dramatically higher
computational density for parallel applications.

so they are saying that the in-order cores in this comparison are not Larrabee cores but similar cores.
 
Last edited by a moderator:
I'm confused over Larrabee's bus architecture(s).

At first I thought, based on older articles, that it would have a 1024-bit external memory bus. Apparently that's not the case. Larrabee has an internal ringbus that's 512-bit in both directions. That's for internal communications between cores.

Then the external memory bus has apparently not been announced, and speculations are that it could be 128-bit or 256-bit.

Someone set me straight if that's not the case.

The speculation is that the external bus could be 128- or 256-bit. I doubt it will be anywhere near 1024-bit, they've said they're trying to reduce the need for bandwidth with this tiling technique, so with GDDR5 they're not going to need that. I'd think. :???:
 
Doesn't Intel hint at pretty high power consumption?
I mean ok for a larrabee like core the consumption should a tenth of what a penryn would use.

But, ALu are likely to be more busy, the bus can be pretty much a power hog, etc.

The only value we have is around ~300watts and this is from Intel won mouth.

For me it looks like perf per mm² for larrabee could be competitive/good enough (now).
power consumption and thermal dissipation could be more bothering.

Intel based is estimation on a 24cores larrabee running @ 1 Ghz, it's could be ok against actual gpu but by the time larrabbe is released even if Intel packs together more cores they need to hit their frenquencies figure (~2GHz).

I think it's basicely the reason why larrabee won't be here in early 2009, even @45nm 16/24 cores larrabe clocked high enough could have a really huge impact on the gpgpu market.
Intel has a lot less "software" reasons to hold back on the launch of the larrabee as a general purpose accelarator than as a GPU.

I'm looking for their upcoming presentation to learn more about this potential issues (ie power/frequencies).
 
At first I thought, based on older articles, that it would have a 1024-bit external memory bus.

A 1024-bit external bus would just be too many pins. You need something like one power and one ground pin for each signal pin, so what would be in the 3000 pin package range. Although some of IBM's multi-chip modules have that many pins, today's desktop parts are more like in the 500 pin range. So, 3000 seems way too high to be cost effective.

I've heard that Larrabee will have around 150 GB/second of memory bandwidth or so (somewhere in the 100GB/second to 200GB/second range). Various next-generation high-speed DRAM is in the 3 or 4 gigabit/pin range. So, a 256-bit bus would be in the range of 100GB/second to 128GB/second. That would keep the pin count (and thus package cost) reasonable.

I would also guess the number of memory controllers would scale up and down linearly with the number of cores. So, this much bandwidth might be for the Larrabee with the largest number of cores.
 
I'm looking for their upcoming presentation to learn more about this potential issues (ie power/frequencies).

As Larrabee hasn't fully "taped out" (meaning, the actual design is still being tweaked and debugged), even Intel probably doesn't know the actual power/frequency numbers. They have estimates and targets, but lots of things can still go wrong with the low-level implementation. A chip frequency is only as fast as its slowest part, so Intel is quite wise not to talk actual frequency numbers quite yet...
 
Although some of IBM's multi-chip modules have that many pins, today's desktop parts are more like in the 500 pin range. So, 3000 seems way too high to be cost effective.

Agreed that 1024-bit is generally out of the question, but ultimately the pin-count they're comfortable with will depend largely on the die-size of the final product. They probably won't want to go much larger in packaging than they have to, but it should be a larger package overall than the equivalent mainline CPU offerings at the time.

Intel and AMD are at roughly 800 and 1000 pins on their desktop offerings, GPUs completely aside.
 
They have plans for socket compatible Larrabee processors as well with QPI interfaces, so I guess it will have to be compatible with one of the QPI sockets (LGA1366 or LGA1567). Still not sure how they're going to implement memory for such a product, though.
 
The polaris cores wouldn't fall under my idea of similar. They're extremely stripped down VLIW FMADD units with a switch fabric interface. I wonder if four or five of those could fit in the area of a Larrabee core.
 
Intel and AMD are at roughly 800 and 1000 pins on their desktop offerings, GPUs completely aside.

You're right; my estimate of 500 pins was low by 2x. It looks as if Core 2 Duo uses either a 500-pin socket or a 775-pin socket, but it looks like Nehalem has a 775-pin and 1366-pin option, which is more than I thought Intel could build at a reasonable cost. AMD has had higher pin counts than Intel in the recent past because of on-chip memory controllers, but with Nehalem that is no longer the case.

Ok, so that said, a 512-bit off-chip memory bus wouldn't be totally out of the question for Larrabee (or certainly for Larrabee II). It seems like a 256-bit bus would be the minimum.
 
They have plans for socket compatible Larrabee processors as well with QPI interfaces, so I guess it will have to be compatible with one of the QPI sockets (LGA1366 or LGA1567). Still not sure how they're going to implement memory for such a product, though.

As the QPI products (specifically Nehalem) have their own memory controllers, a significant number of those pins are already for memory (DDR2 or DDR3). However, GDDRx generally have much higher bandwidth per pin. So, it might be that mapping Larrabee to a QPI socket might not be that difficult.

I wonder if (when?) we're going to see the various DDR, GDDR, and XDR memory standards unify back into a single DRAM standard. Part of what caused the split was the different bandwidth requirements for GPUs and CPUs. Now even with multi-core CPUs, the memory bandwidth needs are approaching GPU levels. Perhaps DDR3 will help bridge the gap (as it is getting up to 1.6 Gbits/second per pin, which is still off by 2x from GDDR, but 2x better than DDR2.
 
As the QPI products (specifically Nehalem) have their own memory controllers, a significant number of those pins are already for memory (DDR2 or DDR3). However, GDDRx generally have much higher bandwidth per pin. So, it might be that mapping Larrabee to a QPI socket might not be that difficult.
Perhaps Intel would implement support for non-GDDR DRAM for socket variants. GDDRx thus far has rather restricted capacity and zero expandability. The electrical issues with running the DRAM bus at GDDR speeds going through a possibly multidrop channel onto DIMMs would push the clocks back down again.

Maybe if we start seeing optical interconnects...

I wonder if (when?) we're going to see the various DDR, GDDR, and XDR memory standards unify back into a single DRAM standard.
Not any time soon if the DRAM makers have any say in it. If it's all one DRAM standard, it's going to be all one low-margin standard. It's partially their fault, but even so they have an interest in inventing higher margin RAM products.

Part of what caused the split was the different bandwidth requirements for GPUs and CPUs. Now even with multi-core CPUs, the memory bandwidth needs are approaching GPU levels. Perhaps DDR3 will help bridge the gap (as it is getting up to 1.6 Gbits/second per pin, which is still off by 2x from GDDR, but 2x better than DDR2.
Expandability, reliability, capacity, and power consumption are still things CPU RAM needs to take into account. Technically, GPGPUs should be worrying about it, too, but we know how far that's gotten so far.
 
Single ended 5+ GHz signalling across 2 sockets with 10+ cm of traces is a little harder than covering a couple of cm between two BGAs ... there's always going to be a huge gap until things go optical.
 
Back
Top