Larrabee, console tech edition; analysis and competing architectures

Interesting, Xfire scales slightly better with AA on. :)

This actually makes sense. The larger the computation problem, the easier it is to effectively parallelize it. Really small datasets are the ones that are hardest to get good scaling out of (the non-parallel overheads just kill the performance).
 
Then again, situation is not exactly better when PPE is surrounded by multiple SPEs.

Actually, being in the corner of the die is a really bad place to put the hottest component. The heat can spread only in a 90 degree arc. If you put the hottest element more in the center of the chip, it can radiate heat in all directions. Heat dissipation is related to area, so to get the best heat dissipation, you want to spread the temperature as evenly as possible (which Cell's current layout doesn't do so well).

As such, for the current CELL layout, disabling the closest SPU should increase power consumption at the cost of slightly better heat distribution.

Why would disabling an SPU *increase* power consumption?
 
I've actually wondered if this was a yield decision or a thermal hot-spot decision.

Check out the following die-heat photo for Cell from their ISSCC 2005 paper "The Design and Implementation of a First-Generation CELL Processor" (attached). If you look at the temperature photo, one of the Cell SPEs (the one next to the PowerPC core) is *really* hot. I saw a presentation from IBM once that said an internal research project re-floorplanned Cell to put the PowerPC core in the middle of the chip (rather than on the edge). It helped reduce the maximum heat of the die substantially.

Does anyone know for certainly if "8th SPE" that isn't use is based on whichever of the SPEs is the slowest (or has a fabrication defect), or is it always the one next to the PowerPC core?

Once nice thing this diagram shows is how impressively low-power the SPEs are.

Well if Cell PPE have hotspot problem, the XeCPU is really in trouble. XeCPU has similar PPE core surrounded by each other, there is one in the corner surrounded by two other. But again 360 has above average failure rate, could this be one of the cause ? I somehow doubt it, but who knows.
 
Actually, being in the corner of the die is a really bad place to put the hottest component. The heat can spread only in a 90 degree arc. If you put the hottest element more in the center of the chip, it can radiate heat in all directions. Heat dissipation is related to area, so to get the best heat dissipation, you want to spread the temperature as evenly as possible (which Cell's current layout doesn't do so well).
True, though I'm sure PPE is more than capable of working under it's own heat.
I was talking about the case in which SPEs're under heavy load since that was the scenario for PPE overheating in your post.
Why would disabling an SPU *increase* power consumption?
It wouldn't.
Disabling the closest one as opposed to some other would.
 
So, what does this imply for Larrabee?

I think there are two key differences between the SPE model vs XeCPU vs Larrabee-like many-core CPU design.

First, the SPE vs XeCPU shows us that all the control logic and such of a CPU (rather than a co-processor) really adds up. As such, I think Larrabee's decision to use 4x wider vectors than the SPE is really critical. It seems that one of the reasons the Larrabee design might be competitive is that it amortizes the control logic of a CPU over a larger (wider) data computation path. Conversely, that also means that Larrabee's vector utilization will be critical in its overall performance.

Are you expecting Larrabee to have >90% utilization of its wide vector units ? At the moment I am having trouble of how you will get high utilization from Larrabee.

Also On 45nm I am expecting SPE to be clocked faster than Larrabee. Cell is 3.2 GHz on 90nm where Larrabee is aiming at a modest 2.5 GHz on 45nm. SPE with its local store is 6.47 mm2 on 45nm while Larrabee core is indicated to be around 10 mm2 without the L2 cache. This is the same size as the Cell PPE core @ 45nm which is 11.32 mm2.

Considering utilazation, clock speed and area, can Larrabee be competitive against next generation Cell ?
 
Slightly different jobs though. In brute peak float-power, likely not. In sustained float power, possibly not. But if Larrabee includes GPU functions in hardware like proper texture units, it'll outperform Cell in some areas. In the context of a console, a Larrabee only console might be a better choice than a Cell only console. Throw GPUs in there and it's a different matter, with all the complications of development environment to weigh in too.
 
I'm with Shifty; if the context of the discussion is Larrabee vs a GPU, then although discussion of Cell makes sense due to their both belonging to the class of new massively-multicore/simple CPUs, it doesn't make sense to ask how they would compare to one another on the same node on a per-mm performance basis, since presumably we're talking about Larrabee here as targeting a completely different workload than Cell in the console. Best to stick to predicting Larrabee performance vs GPUs... or if we're going to discuss its merits vs an architecture like Cell, simply take the angle of hypothesizing how well Big-L would perform in the HPC space.
 
And also considering how developer friendliness might be affected by choosing Larrabee versus a Xenon++ and GPU configuration. The software development side of these machines is ever gaining in importance.
 
Well until they release detail on the fixed function part of Larrabee it's difficult to compare it to GPU. Beside we are discussing its design choices and its performance implication. HPC or raytracing is probably a good target application to discuss with the little available info on Larrabee.
 
I don't see a big difference..as long as they provide you with an already implemented and fairly optimized modern rendering pipeline.
 
Are you expecting Larrabee to have >90% utilization of its wide vector units ? At the moment I am having trouble of how you will get high utilization from Larrabee.

The same way a modern GPU gets high SIMD utilization. The G80 basically has wide "warps" that do the same computation on different data in parallel. It is a bit different from traditional vector processing, as I understand it, as it is using the wide-SIMD (or vectors) to work on many pixels or triangles at the same time (rather than using wide vectors to speed up working on a single vertex or pixel).

I'm not exactly sure exactly how the G80 does it, but GPUs have found a way to use fairly wide SIMD units with high utilization.

Edit: finished after some wild key press caused it to post before I was ready.
 
I'm not exactly sure exactly how the G80 does it, but GPUs have found a way to use fairly wide SIMD units with high utilization.
Well, it's not that hard really, given pixel and vertex shaders programming model.
Each vector slot is independent from the others, that's it.
 
Larrabee supports 4 threads per core.

Would it be very difficult for STI do the the same for the SPEs?

And the Waternoose?
 
Larrabee supports 4 threads per core.

Would it be very difficult for STI do the the same for the SPEs?
Doesn´t really fit with the concept of having your program and data stored in the local memory of the SPU.

The SPU branch penalties are fairly low when they happen and the memory latency for accessing data in main memory should mainly be taken care of by double buffering streamable data using DMA transfers.
 
Given that new information has emerged about Larrabee *shoot me if im wrong* but I thought it might be appropriate to bump this one back into life considering the new information gleaned from the Siggraph conference.

Intel has been quoted for saying that they had aproached console vendors regarding the use of this technology in the next generation of consoles so for that reason I think this thread becomes relevant again.

So to start things off once again:

1. What are the possible different implementations of Larrabee for a next generation console? For example, Will Larrabee be used in a traditional sense in a GPU + CPU configuration or will it be packaged in a Fusion type arangement with Sandy Bridge Cores?

2. How many Larrabee or Larrabee/Sandy Bridge Cores are likely to be implemented within a ~ 225w TDP window and aproximately the same silicon budget as the current generation at a 32nm process node to begin with. (Im assuming a 2011)
 
Last edited by a moderator:
If Intel were to put Larrabee cores on the same die as a mainstream x86, Nehalem may not be the best choice.

Sandy Bridge is the mainstream core that is supposed to have a ring bus, something Nehalem lacks.

225W for a single chip seems pretty extreme for the console environment.
 
If Intel were to put Larrabee cores on the same die as a mainstream x86, Nehalem may not be the best choice.

Sandy Bridge is the mainstream core that is supposed to have a ring bus, something Nehalem lacks.

225W for a single chip seems pretty extreme for the console environment.

Whoops you're right there the timeframe seems appropriate.

I was using a rough figure of 225w as a rough guideline. It seems about right considering the initial power draw of this current generation.
 
I suppose 225W is being considered as a total CPU+GPU thermal envelope, but dealing with that heat from one chip strikes me as a Bad Move. Thermal reduction over the life of the platform will be retarded due to decreasing process-shrink returns, so the box will be looking at being hot for all SKU iterations, even at mainstream prices. This would need an expensive cooling solution either pushing the cost up, or decreasing profit margins significantly at the lower price end.

If Larrabee's to be used, I see it more as a Wii-like role, in a platform that's 'good enough' but not 'cutting edge' in performance but very developer friendly, for a lower cost platform. This would fit in better with the New World Order of mainstream, wide gaming appeal. If we're gonna have a box that everyone's playing, keeping it cheap to suit all pockets makes sense. A single CPU/GPU solution would fit very nicely into that system offering developers supreme flexibility in processing budget, whether to throw the majority of that power at visuals or to reign back on the visuals for a title that uses the processing power for other things.

Also, as i understand it, the preferred rendering model will be TBDR. This will reduce RAM BW requirements, good for reducing overall system costs. Will that actually become a preferred rendering model for other architectures though? I'm not seeing a huge difference between Larrabee and Cell in overall processor structure architecture. A Cell-based PlayStation could presumably manage much the same. What about other GPUs? Would a honking great nVidia GPGPU require more RAM BW overall, or will we be seeing more flexibility in how the GPU IHVs handle rendering, offering effective low-bandwidth solutions?

And could Larrabee be paired up with eDRAM for some very fast framebuffer options?
 
As always this thread is more questions than answers. I think we will have to wait until late 2009 for Larrabee and DX11 parts to be released, All console tech will almost certainly be based off a derivative of this. Questions will soon be answered.
 
How could the larrabee V.2 look like?

Some thoughts/questions about how Intel could make the larrabee better.

It's likely that larrabee will have a higher memory/alu ratio than its competitors.
Intel process can serve as a leveler but it would better bring preformances advantages

#Intel needs higher perfs per transistors, what could they do in this regard for the larrabee2?
They could aim at higher frequency (closer to cell/xenon).
Obviously they're facing power consumption/thermal dissipation issues, the chip could ends tinier, but from an economical pov it could make sense (more chips per wafer, lower yields?).
To aim higher speed Intel may need a longer pipeline wich could hurt some part of the performances. Do you think Intel could find a "middle ground"? (say ~ten stages, could be clocked higher, while would not suffer as much on branch misspredict etc. as a longer pipeline design).

#They could do change to the chip layout. Actual larrabee layout is still unclear but Intel makes it sounds like they are going with a monolithic shared L2 cache accessed via a large ring bus. I'm missing knowledge here (insight welcome).
Could the bus end being large? Power hungry?
For the cell we discovered that the EIB ends not shrinking that well. A huge bus could end limiting costs reductions due to die shrink.(?)
If there any truth to my speculations Intel may end reworking the chip layout.
Intel may consider using a grape of say 4 or 8 cores sharing 1 or 2 Mo of L2 caches.
It could limit the need of a large ring bus by keeping some traffic on"grape".
Thus Intel could end packing some more cores or with a tinier more cost efficient chip.
Texture units would also be tied to a grape.

In regard to latencies to me it's unclear if latency are the same among the whole L2 cache. Does a given core benefits from lower latency while accessing its dedicated part of the L2?
If yes, does a "grape" layout would prevent achieving the same trick? (I guess it's convenient for software to have a constant access time to a given data within the L2).
(NB while changing the layout I don't mean change the cache "hierarchy" ie one core would still have read/write acces t only 256Kb of L2).

#could Intel do some change the way L1 works?
In the ROCK SUN has one L1 instruction cache for 4 cores and one L1 data cache per two core.
I also read that IBM may allow in the POWER7 multiple SIMD units to act as a wider one.
Could it be a win for Intel to reworked the L1 cache intructions/data ration (the data one would be bigger than the instructions one)?
Could Intel go even further and "share" L1 Instructions cache among some cores (two or four)?

#then come the fixed functions units, do some of you think that Intel may include another kind of fixed function unit on top of the texture samplers? which/why?
Could they change the way the texture sampler are shared (one per core actually)?
While changing the L2 layout they could end having the texture samplers tied to a "grape" not a core. Could you see some advantage in this change?
Same about the hypothetical added fixed function units?
 
Last edited by a moderator:
Back
Top