Intel pushing Larrebee console deal with Microsoft

If we see a 32 core LRB1 in 09, I wouldn't rule out 64+ core LRB2 by 11/12. Perhaps more realistically for a high end console you could have 2 48 core rather than 1 gigantic one (for yields, heat dissipation, memory bandwidth scaling - I'm not a chip architect so maybe I'm completely off here).

Each LRB core is going to be larger than Cell SPU or even PPU.
If IBM is predicting 32 SPUs Cell, I am doubtful Intel is going to have 64+ LRB in the same time frame unless it is a really large chip like >400mm2.
 
EDRAM is definitely a bit unflexible to work with, but then it's not too much different from larrabee style tile based rendering, except your tiles are larger and your EDRAM is your cache...
EDRAM makes a big difference (for the worst) when you are fabbing your chip.
We can simply live without it, I hope LRB to lead the way from this standpoint in the console market. (PowerVR is excellently taking care of the mobile market already).
 
Sorry, I don't get this. Where do you put the line between the Xbox EDRAM, where developers have to tile themselves, and the on-chip cache on Larrabee, where developers (okay, maybe also the "driver", but mad-scientist types would do it themselves) have to tile? To me, the differences between what is done on Xenon and what is described in the Larrabee PDF are minor and numeric.
1) You, as a developer, don't need to manually tile a thing on LRB (that would be in theory possible on 360 as well, but MS took another route..)
2) The difference is in how you fab your chip + edram, it's generally not a walk in the park.

EDRAM that doesn't effect logic is like ray tracing; it's the technology of the future and it will always be.
 
archie4oz said:
Then again I liked buffer math and thought that PixelPlanes was cool...
Ultra-wide MIMD 6502 FTW! :p

Screen Based Deferred Renderer (or at least that's what the scotch bottle is telling me).
Actually I meant Scanline in that context. To be fair I didn't even coin the abbreviation myself, nAo did it *points finger*.
 
Based on some of the replies to my post the other night, I was thinking, if a custom Microsoft-tailored Larrabee 2 (in 2011-2012) could offer a few powerful OoO cores with good single threaded performance plus a bunch more (improved) simpler cores that first-gen Larrabee will have (64 to 128 instead of 16-48) then Microsoft could have an awesome, Cell2-rivaling/beating CPU/GPGPU that's also the front-end for pixel pumper, a monster ATI GPU, or just a rasteriser with EDRAM, like PS2's GS but generations ahead, or something like a Xenos2. Thus, a 2-chip console. The ATI chip would not do anything except rendering to the screen, doesn't need to calculate anything on the front end (vertex shaders, geometry shaders), which would all be in the hands of custom Larrabee2. The ATI chip could even be a split die on one package, logic + EDRAM. just guessing.


you're describing a console with two monster GPGPU/GPU chips, one made by Intel and the other one made by AMD, with different feature sets and instructions, competing on the same board :). That looks unlikely.

Cell 2 + GPU, while still a bit weird to my liking, at least makes more sense as you would expect the SPU to pack more general purpose flops per area than a GPU or LRB, and the GPU to pack more pixel flops power per area than LRB. It's what you're looking for.

Regarding Larrabee I think there would be a single one in a console, not two, for the same reasons I don't believe at all in a two GPU consoles. What you gain in die size (or not, as you need two dies) is mitigated by the more complex board, interconnection etc., and you don't solve anything regarding power. I think Larrabee in a console would be quite a huge chip, with an amount of redudancy, with or without OoOE cores, and would balance its hugeness by being the only processing chip on the console (with the benefits of simplicity, effectiveness of die shrink and the lead Intel has on silicon process)
 
Each LRB core is going to be larger than Cell SPU or even PPU.
If IBM is predicting 32 SPUs Cell, I am doubtful Intel is going to have 64+ LRB in the same time frame unless it is a really large chip like >400mm2.
Assuming same time frame, same process, and same size, then obviously not.
From the Cell roadmap, it looks like 32iv is set for 10/11, probably at 45nm. I think Intel would be up to 32nm LRB by 11/12.
I would also think that a LRB will be smaller than a PPU core (cache size and OoO buffers), so we can take into account the 4 PPU cores sitting Cell as well, maybe making up for the size difference between LRB and SPU cores.
If we consider LRB for either discrete graphics or a single chip console, it's not completely unreasonable to expect a 400+mm2 chip. (wasn't G80 something like 480mm2?)
All in all, I stand by my original suggestion.


Cell 2 + GPU, while still a bit weird to my liking, at least makes more sense as you would expect the SPU to pack more general purpose flops per area than a GPU or LRB, and the GPU to pack more pixel flops power per area than LRB. It's what you're looking for.
The way graphics are trending, there's less and less difference between general purpose flops and "pixel flops". General purpose might be too broad, but I'd say anything that falls into the category of vectorizable throughput computation should run well on LRB, and I think Intel have picked the right time in investing in this as the future direction for GPUs.

Regarding Larrabee I think there would be a single one in a console, not two, for the same reasons I don't believe at all in a two GPU consoles. What you gain in die size (or not, as you need two dies) is mitigated by the more complex board, interconnection etc., and you don't solve anything regarding power. I think Larrabee in a console would be quite a huge chip, with an amount of redudancy, with or without OoOE cores, and would balance its hugeness by being the only processing chip on the console (with the benefits of simplicity, effectiveness of die shrink and the lead Intel has on silicon process)
One chip to rule them all, eh :p

I completely agree with your argument regard to power, and since you need to keep a console within a reasonable power budget, if you can exhaust that with a single chip, then that's probably the way to go.
I was wondering if it complicated design and manufacturing to go with the bigger single die/chip. Layout gets more complicated, particularly the ring would have to grow, and latencies across the chip get higher. Your external connectors would have to grow to fascilitate more memory channels to feed the chip (although NUMA adds it own share of load balancing issues, at least it's easier to scale).
In the end I think you're right, because of power limits, a single chip solution is the most likely. I just wouldn't mind seeing 2 for the performance - assuming you won't be able to pack twice the number of cores on a die, and won't be able to scale the memory interface (internal and external) linearly.
 
It really doesn't get hard to run faster than a PPU on a per clock basis. XCPU cores are significantly faster, though I expect LRB perf on scalar code to be better than XCPU (clock per clock)
 
I think with any (new) architecture there's always room for improvement.

...<snip>...

As for concrete examples I think it's much to early to speculate (we don't even have larrabee 1 specs). I think memory architecture is going to play an increasingly important role as we move to more and more parallel architectures, both for sheer throughput, but also latency of comunication and synchronization, so I expect to see some work there.

For purely parallel problems adding more hardware contexts to each core is a fairly easy way to turn a latency bound problem into a bandwidth bound one.

What is going to be interesting is when a massively threaded application has threads that needs to communicate. That is, a significant amount of synchronization and data exchange has to occur.

Speculative lock elision will obviously boost basic mutex performance.

However the demand loaded nature of caches will be a problem when cores have to exchange data. Imagine one core (producer) writing to a FIFO, and another (consumer) reading from it. In a conventional MOESI coherency protocol processor the producing core will get the cache line (with intent to write, invalidating all other copies) and put it in an exclusive state when writing it. The reading core afterwards needs to obtain the cache line. A whole bunch of coherency traffic is going on.

Special loads and stores that bypass the data caches could help significantly. Stores to FIFOs (or similar structures) could even be handled by a special smaller and therefore faster L2-look-aside SRAM structure to lower load-to-use latencies.

There would still be some latency incurred from accessing lower level caches, so the next step would be to add OOO capabilities. Data-capture schedulers are tiny dense SRAM structures (look at an Athlon core floorplan and the new ARM Cortex A9), decouple the super-wide SIMD registers from the main ROB to save space.

Cheers
 
The PPU is in-order. For it's perf/W or perf/mm^2 Si it's just a crappy design (as is XCPU).
Yep, I was referring to the next generation PPU. I thought they had plans to out-of-order execution to bring the single threaded perf up to scratch (seeing as how they have the SPUs for heavy throughput work).

For purely parallel problems adding more hardware contexts to each core is a fairly easy way to turn a latency bound problem into a bandwidth bound one.
As demonstrated by modern GPUs. How many algorithms does this map well to though, and how well does it scale as the ratio of compute/external bandwidth worsens? As the number of cores and request demand on external memory increases, there should be less and less idle time for the memory interface. In this situtation doesn't it become redundant to try and hide latency with more contexts, where you should rather try to optimize locality and reuse and minimize external (read: slow and energy inefficient) traffic?


Speculative lock elision will obviously boost basic mutex performance.
If I understand SLE correctly, it should avoid a reader even needing exclusive access to anything. Today a reader would at a minimum require writing to the lock primitive. This is definitely something that looks interesting and practical in the short term.

Special loads and stores that bypass the data caches could help significantly. Stores to FIFOs (or similar structures) could even be handled by a special smaller and therefore faster L2-look-aside SRAM structure to lower load-to-use latencies.
How feasable is it to automatically treat shared cache lines differently (either from heuristic analysis or program hinting) and transparently, rather than changing the programming model?
I saw this paper: http://portal.acm.org/citation.cfm?...l=Portal&dl=ACM&CFID=2303156&CFTOKEN=37913594 at SPAA this year, but it doesn't look terribly scalable to me..
I also heard murmurs of transactional memory support but noone from Intel would comment either way on this :p
 
At the cost of streaming transformed vertices through memory in frame sized chunks.
A cost that imo one would be pleased to pay compared to the complexity and costs of having edram on board. On the other hand it seems that NVIDIA and AMD think TBDR designs are like the boogeyman, they can't even name it (perhaps, as you suggested a long time ago, it's all due to patents wars and stuff..)
 
Last edited:
Special loads and stores that bypass the data caches could help significantly.
I don't see how that would help with the FIFO case, unless you mean bypassing the local cache. So what you would need is to be able to lock cache lines to a specific processor.
 
I don't see how that would help with the FIFO case, unless you mean bypassing the local cache. So what you would need is to be able to lock cache lines to a specific processor.

Yes, bypassing data caches for both cores, but no, not locking the cache line to a specific processor. I want the cacheline to reside in the L2. I want semantics or at least hints to the loads and store to tell the L2 apparatus that the data just stored is likely to be read by some other core (or possible by a different context on the same core).

This alone would save on coherency traffic to and from the data-caches. The next step is to make a special chunk of logic to handle these loads and stores, ie a subset of the L2 with much faster access times to lower communication latencies.

Cheers
 
A cost that imo one would be pleased to pay compared to the complexity and costs of having edram on board. On the other hand it seems that NVIDIA and AMD think TBDR designs are like the boogeyman, they can't even name it (perhaps, as you suggested a long time ago, it's all due to patents wars and stuff..)

Unless GigaPixel IP 's cannot be used because it somehow infringes on IMG tech IP's (and we know how angry Simon gets when you touch his IP's), nVIDIA should have the basics building blocks for a TBDR...
 
MfA said:
At the cost of streaming transformed vertices through memory in frame sized chunks.
Unlike most other GPUs, LRB should be able to walk the scene trees itself (in fact, if it's anything it's cut out to be, it could be damn good at it as well), so why would it have to do "geometry capture" on top-scene level?
And dated as it is, we've talked of stuff like that back in "Realizer" patent days.
 
Unlike most other GPUs, LRB should be able to walk the scene trees itself (in fact, if it's anything it's cut out to be, it could be damn good at it as well), so why would it have to do "geometry capture" on top-scene level?
And dated as it is, we've talked of stuff like that back in "Realizer" patent days.
Unless Intel provides a scenegraph API that means you "need to manually tile" ... where it happens is irrelevant.

Intel could go a long way to providing something close enough to a scenegraph API to suit our purposes by creating an OpenGL extension for per display list bounding volumes.

PS. like this ... or rather, like this.
 
Back
Top