Larrabee, console tech edition; analysis and competing architectures

The question is when the specifications for a new console need to be locked down for a certain release date. If looking at a late 2011 early 2012 timeframe, they could have early alpha machines 2-3 years before a (hypothetical) release date in 20011-12. Surely that would be enough time to have 10-12 good launch titles?

The Xbox 360 beta kits where Mac G5 dual core cpu's and some ati gpu. They didn't get the Xenos and xenon until 1 month before release.

Alpha kits are usually are very very off from what the final product is, having machines with the same architecture in alpha kits 2-3 years before the launch would be a blessing for all devs.
 
#Intel needs higher perfs per transistors, what could they do in this regard for the larrabee2?
They could aim at higher frequency (closer to cell/xenon).
Designs show a very non-linear increase in power consumption once you start climbing to the top end of the clock envelope.
If Larrabee's stock clocks are ~2 GHz, increasing to 3 GHz would offer 50% throughput increase at >>50% power increase.
In terms of power consumption on a parallel workload, just adding 50% more cores and counting on Intel's likely greater manufacturing capability seems safer.

#They could do change to the chip layout. Actual larrabee layout is still unclear but Intel makes it sounds like they are going with a monolithic shared L2 cache accessed via a large ring bus.
I'm not sure what you mean by monolithic in this context. It's on the same die, which makes the entire design monolithic.
The L2 itself appears to be a multi-banked cache that allows direct fast access of each core to its local bank.
The ring bus appears to handle anything that goes to different banks or other destinations.

It could limit the need of a large ring bus by keeping some traffic on"grape".
Thus Intel could end packing some more cores or with a tinier more cost efficient chip.
Texture units would also be tied to a grape.
Past 16 cores, Intel starts using multiple short-linked rings, which would keep contention to a more manageable level than trying to scale the same ring up to handle ever higher core counts.

In regard to latencies to me it's unclear if latency are the same among the whole L2 cache. Does a given core benefits from lower latency while accessing its dedicated part of the L2?
One presentation put the L2 latency at 10 clocks, which would probably apply to a core accessing its local section. The ring bus adds latency, which can vary depending on the collective behavior of the core and its neighbors.

#could Intel do some change the way L1 works?
In the ROCK SUN has one L1 instruction cache for 4 cores and one L1 data cache per two core.
The data cache on Larrabee is already being shared between 4 threads. Sharing it with other cores would worsen things.
Since the L1 is being used as an extended register file for each core, having multiple cores screwing with each other's extended register space may not profile well with actual workloads.

Could Intel go even further and "share" L1 Instructions cache among some cores (two or four)?
For in-order cores, instruction throughput isn't usually a limiting factor. I'm not sure what Intel would gain by complicating instruction caching.
Also, sharing instructions over 4 cores can lead to restricting the flexibility of the design. Too many different instruction streams will lead to destructive interference. If there's a lot of common execution, it saves space. However, this does set up some dangerous parallels to another design that has 64 basic FMADD units running the same instruction...
 
Ok, thanks first thank for your response :)
Designs show a very non-linear increase in power consumption once you start climbing to the top end of the clock envelope.
If Larrabee's stock clocks are ~2 GHz, increasing to 3 GHz would offer 50% throughput increase at >>50% power increase.
In terms of power consumption on a parallel workload, just adding 50% more cores and counting on Intel's likely greater manufacturing capability seems safer.
In fact I was thinking that it could be feasable to aim at a tinier die as the larrabee is likely to be bigger than a Xenon. As you gain 50% in thouhput you could cut the chip die size.
But it's likely to imply more pipeline stages which would cost in number of registers (x4 due to hyperthreading) and scalar perfs. Not to mention that it would affect yields in a bad way.
Overall I get your point, pumping frequency comes at multiple costs.

I'm not sure what you mean by monolithic in this context. It's on the same die, which makes the entire design monolithic.
The L2 itself appears to be a multi-banked cache that allows direct fast access of each core to its local bank.
The ring bus appears to handle anything that goes to different banks or other destinations.
----------------------------
Past 16 cores, Intel starts using multiple short-linked rings, which would keep contention to a more manageable level than trying to scale the same ring up to handle ever higher core counts.
---------------------------
One presentation put the L2 latency at 10 clocks, which would probably apply to a core accessing its local section. The ring bus adds latency, which can vary depending on the collective behavior of the core and its neighbors.
What I meant by monolithic was made of only one piece. In phenom for example the L2 cache is made out t of four pieces, in the peryn the L2 is made out of only one piece.
My understanding about how LRB should look is that the L2 must be made of one piece 'as in the peryn). Thus core may include L1/scalar part/VPU and be "stuck" directly to the ring bus. That's why I thank (might be stupid...) that latency could be the same for anypart of the L2. I also thank that constant acces time to the "unified/whatever" may helps on the software side as acces to data in cache would be constant.

That's why I pointed it and ask about the latencies.

My idea was to have L2 made of discrete parts each one tied to a given numbers of cores.
The idea was to limit the pressure put on the ring bus.
But you comment on the nature of ring bus made me think that it's not a good idea.
It also help to understand how the chip could look as I was thinking about something stupid :oops: (like the actual layout could be close to schematic Intel's presentation...and the L2 would end thin and streched... the bus going around... around the bus the cores... now thinking again about about this... I laughed about stupid this was :LOL: )

So the layout could end somethinf like this (32 cores in this case):
|core|core|core|core|core|core|core|core|L2
|-----------------RING BUS 1------------------|L2
|core|core|core|core|core|core|core|core|L2
|core|core|core|core|core|core|core|core|L2
|------------------RING BUS 2-----------------|L2
|core|core|core|core|core|core|core|core|L2

With the "monolithic" L2 cache on one side or the other. I'm not sure about what you mean by "past16 cores Intel starts to use short link bus".
I'm "drawing" the short link would be between the two bus.
Do you think of some thing like that?
The data cache on Larrabee is already being shared between 4 threads. Sharing it with other cores would worsen things.
Since the L1 is being used as an extended register file for each core, having multiple cores screwing with each other's extended register space may not profile well with actual workloads.

For in-order cores, instruction throughput isn't usually a limiting factor. I'm not sure what Intel would gain by complicating instruction caching.
Also, sharing instructions over 4 cores can lead to restricting the flexibility of the design. Too many different instruction streams will lead to destructive interference. If there's a lot of common execution, it saves space. However, this does set up some dangerous parallels to another design that has 64 basic FMADD units running the same instruction...
OK.I got your point.
 
What I meant by monolithic was made of only one piece. In phenom for example the L2 cache is made out t of four pieces, in the peryn the L2 is made out of only one piece.
My understanding about how LRB should look is that the L2 must be made of one piece 'as in the peryn). Thus core may include L1/scalar part/VPU and be "stuck" directly to the ring bus. That's why I thank (might be stupid...) that latency could be the same for anypart of the L2. I also thank that constant acces time to the "unified/whatever" may helps on the software side as acces to data in cache would be constant.
The descriptions I've read make me think that Larrabee's L2 is not shared the way Penryn's is.
Each L2 sector behaves independently. The sections behave in much the same way AMD's separate L2 caches do. Data can exist in multiple sections, and the changes are kept coherent like standard separate caches.
The difference is that AMD uses a crossbar that each core is hooked to, which would have less variation in latency than the ring bus.

With the "monolithic" L2 cache on one side or the other. I'm not sure about what you mean by "past16 cores Intel starts to use short link bus".

My interpretation is that the most cores there can be per ring is 16, and that adding more cores means adding more rings, with the rings having a few connections between one another.
 
3dilettante: I read somewhere (don't remember where) that the number of cores on LRB has to be a multiple of 8, not 16. Can you confirm that or am I just remembering it wrong?
 
My interpretation is that the most cores there can be per ring is 16, and that adding more cores means adding more rings, with the rings having a few connections between one another.

3dilettante: I read somewhere (don't remember where) that the number of cores on LRB has to be a multiple of 8, not 16. Can you confirm that or am I just remembering it wrong?
Thanks 3Dilettante it's clear now ;)

Nao I would tend to think that 16 is 3Dilettante's own estimate (as I never saw this data in recent Intel's presentation).
I heard of multiple of 8 too, like 16/24/32 cores.
Anwyay would be interesting to learn how you came to this result 3Dilettante?

I think I read somewhere the bus width (1024? I don't remember), did you estimate from there when it could start to saturate?
 
Last edited by a moderator:
Ok, I found it, I read it on AnandTech:

The final product will be some assembly of a multiple of 8 Larrabee cores

Perhaps 3dilettante is right, and each ring connects a certain number of cores (8 in this case) and these rings are connected to some hub or switch.
 
Ok, I found it, I read it on AnandTech:



Perhaps 3dilettante is right, and each ring connects a certain number of cores (8 in this case) and these rings are connected to some hub or switch.

The statement in the SIGGRAPH paper was that they switch to using multiple short linked rings when scaling to more than 16 cores. So I guess they use one ring for 16 cores and switch to three smaller ones for 24 cores?
 
Tim Sweeney argues 1 byte of memory bandwidth per flop.
Intel stated the goal of Larrabee was 2 bytes of read/write bandwidth per flop.


The point is besides cache, the system memory bandwidth can be achieved to have teraflop chips.


16flops per clock x 16cores x 4000 clock = a terraflop chip

The 360 had 3 of the cores clocked at 3.2ghz...on a modern process a 800mhz increase to 4ghz seems reachable.

2 of these chips would achieve less than the 2.5 terraflop goal Tim Sweeney stated, but they would be fully programmable.


As far costs to make such a chip, Microsoft could just offer IBM royalties and not bother with ATI. Although all the rumors constantly mention ATI so I'm not sure. Heck I'm not sure of the 16 core thing anyway, but thats the fun of speculation.


This is back from 2009.
graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf

Hey Brimstone I dared copying your post here so we can further discuss.

I though a bit more a bout it the whole idea behind larrabee or a powerpc larrabee is too run standard X86 Powerpc code ruight. It might not be the best idea but anyway.

I was thinking about how too achieve better density than in larrabee. I considered to pack more resources (SIMD units texture units) in a core a more complex one not by any mean light and build the core around it. I though a bit and you would need to support more than 4 hardware thread, to have more resources, etc.

Then I looked at the GPU world. for the sake of density the L1 cache for four CU is implemented as a block. 4 CU is 4x4 16 wide SIMD, altogether support 40 active threads (out of a pool of 4x256?).
Then you have the number of pending memory request, etc.

I came to the conclusion that you may want to get ths SIMDs tex units, out of your cpu core and then having him to keep track of things, etc. Then you specialize them, etc.

That's pretty much like reinventing the wheel and creating a GPU :LOL:

At any point in the design the scalar ISA of the design has been that much of a concern.
Even in larrabee the usefulness of the core being compliant to the X86 ISA is questionable to what the chip is intended to achieve.

I got a crazy idea, simple cores tied a SIMD or multiple SIMDs tied a complex cores, what if that is the wrong way to build something based on existing CPU ISA that would be intended to do among other things graphics?

Actually what I though what if Intel for that matter may have taken another road?

The idea I had is that the only I can see for the ISA to have any relevance is do what AMD did for a while with its VLIW designs, having MIMD designs to act in a vectorized fashion.
For Intel that CISC simple X86 cores, for IBM that would be simple RISC POWERPC cores.
Basically is doing some sort of GPU out of CPUs not sticking stuffs to CPU cores.

So Intel used for larrabee p55 cores for IBM it could have been power 1 or 2. In both case a new design would have been really likely.
Still those cores are super tiny, you remove the front end.
Then you create a new front end that forward the same intructions to those different "cores" (not proper but anyway) and different data. pretty much what AMD gpus do.

The challenge for IBM or Intel would be to make the front end more complex than the one in even nowadays CPU so each cluster act more as an autonomous CPU core than to an "assisted" lets say AMD SIMD.

In a worse case scenario (complex data dependency ? or I don't know what :LOL: ) the thing would act as a single CPU cores. Depending on the number of "cores" in the array it would be a way more severe drop than the worse case scenario in AMD (ie using 1 alu out of five).
Put on purely data parrallel problem it would act as X X86 or PPC cores.

It's a bit ridiculous but that the only way I can see the ISA being relevant in that kind of design and having a chance to compete with GPUs (larrabee may end competing but does that make x86 relevant to the picture?)
It's kind of the only way I can see traditional CPU cores and GPU/throughput one sharing the same ISA. You may want the ISA to support a new integrated DSP the tex units.

So Imagine you would have a SIMD (AMD parlance) of MIMD X86 or PowerPc cores, including texture units powered up by a really complex CPU front end.


Anyway people don't write in assembly anymore and it looks like there should be languages that further hide the difference between the GPU and CPU from the programmer.
Either way what people might want it's not something like larrabee but more complex GPU and nobody cares for the GPU ISA even if one were to use X86 or ppc for it.

But Ithere might a good reason for that to not happen but I though it would be funny to bring the idea as it is somehow what AMD were doing till now MIMD units acting in a vectorized fashion.

EDIT
It's a bit of try to demonstrated 'by absurd' that the idea being larrabee may not be relevant.
 
Last edited by a moderator:
Back
Top