Larrabee at Siggraph

aaronspink · Aug 22, 2008

MfA said:
If you think you can give a realistic estimate of the area cost, by all means ... contribute something useful.

Having multiple banks is not about local store vs cache ... it's orthogonal (works with both).

almost all caches of any reasonable size have multiple banks, the issue is how much those banks are exposed, the more it is exposed, the higher the latency and the higher the area impact as well as the power impact.

But fundamentally, anything you can do with a local store can be done equally well with a cache. The whole local store architecture of CELL is one of its biggest drawbacks. In general local stores significantly complicate programming as well.

MfA · Aug 22, 2008

I wonder how they do the texture cache ... will they simply add a couple of cycles of extra latency so they can merge read accesses or will they use multiple banks with narrow ports?

Gubbi · Aug 22, 2008

aaronspink said:
The whole local store architecture of CELL is one of its biggest drawbacks. In general local stores significantly complicate programming as well.

It's certainly different.

If one looks at a SPE with 'normal' programming model glasses, a SPE is just a processor with a two level register file, the local store forming the second level, - and no cache. I'd suspect a future CELL would do good having a big fat shared cache before the memory interface. I still don't think it beats processors with normal cache hierarchies.

Cheers

3dilettante · Aug 22, 2008

crystall said:
Well, the whole description of their software renderer screams "there's no free lunch".

The buzz prior to the release of details screamed something a little different.
I believe the memory model is a good one, but the naive parroting of sound bites about it was nearly as silly as the "OMG realtimeratracer!!!" fluff articles floating around the net.

nAo · Aug 22, 2008

Gubbi said:
If one looks at a SPE with 'normal' programming model glasses, a SPE is just a processor with a two level register file, the local store forming the second level, - and no cache. I'd suspect a future CELL would do good having a big fat shared cache before the memory interface. I still don't think it beats processors with normal cache hierarchies.

SPU patents describe such a cache, though all CELL variations out there don't have any, afaik.
On the other hand with a cache or without the programming model would still be the same, unless SPUs ISA gets extended.

Andrew Lauritzen · Aug 22, 2008

TimothyFarrar said:
How about we take this to say a practical example, like say optimal sorting 16 million objects (say {sort key, object id}). Does the cache advantage (Larrabee) vs banking+local store (NVidia) argument still apply?

That's an extremely difficult question to answer in any sort of precise way you know

Not only does it depend entirely on the algorithm used (and indeed you might use a different algorithm for it on different architectures), but if you take it all the way to database-land where only IO matters (which is not going to be an unreasonable model in the future I think), the complexities are entirely dependent on your ability to touch non-local data as infrequently as possible. In this land, large caches/local stores win hands-down as increasing block sizes reduces the number of "passes" (in GPU parlance) over the data set.

The concept here is similar to many algorithms actually: you want to extract the "minimum" amount of parallelism out of your problem to keep all of the cores busy in general, and run the serial algorithm on the in-core data set. This is precisely how reductions, scans, segmented scans, sorts and many other primitives are implemented on parallel processors, and that's entirely what you're doing when you use CUDA's local store in most cases. It's just that the programming model has you looking out from inside some nested loops so it doesn't make it clear that you're really just writing SIMD code over a block of data in local store.

So while multi-banked memories like CUDA's local store are useful, so are bigger caches, more ALUs and any number of other things that you could spend the transistors on

.

TimothyFarrar · Aug 22, 2008

Andrew Lauritzen said:
The concept here is similar to many algorithms actually: you want to extract the "minimum" amount of parallelism out of your problem to keep all of the cores busy in general, and run the serial algorithm on the in-core data set. This is precisely how reductions, scans, segmented scans, sorts and many other primitives are implemented on parallel processors, and that's entirely what you're doing when you use CUDA's local store in most cases. It's just that the programming model has you looking out from inside some nested loops so it doesn't make it clear that you're really just writing SIMD code over a block of data in local store.

So while multi-banked memories like CUDA's local store are useful, so are bigger caches, more ALUs and any number of other things that you could spend the transistors on .

In the terms you have described. Seems to me for similar ALU capacity between Larrabee and say an NVidia style GPU, that Larrabee might only have a 1.5 to 2.0 size advantage in the in-core data set, and be at a disadvantage in terms of utilization of out-of-core data bandwidth. So I'm still skeptic on this idea that Larrabee is going to be a huge win. For the problems I would like to use Larrabee to solve, it still seems as if out-of-core bandwidth utilization is more important.

Perhaps I'm way off base here, but taking my other simple example, I don't really expect a huge win for Larrabee in overall time for sorting 16M elements on shipping hardware with similar ALU capacity.

However, there is one area which it seems as if Larrabee might have quite an advantage, in that is general scatter where the scatter as some locality.

NVidia and ATI GPUs have a readable write combined cache for each ROP/OM unit right? Seemed as if NVidia at one time had plans to expose this surface cache in CUDA (.surf in PTX spec), but that never materialized (and neither did programmable blending to which it might have been used). If the Larrabee model (SIMD+scatter/gather+ R/W caching) ends up been the best thing since sliced bread, seems like other GPUs adopting an accessible surface cache might be a good evolution (to give bandwidth reduction on scatter). Of course latency could be far better on Larrabee, but the overall memory bandwidth reduction could be similar.

Davros · Aug 24, 2008

incase anybody hasnt read this
http://www.pcpro.co.uk/news/220947/nvision-larrabee-like-a-gpu-from-2006.html

nAo · Aug 24, 2008

It might be case of history repeating. David Kirk released a similar interview 3 or 4 years ago (we simulated a unified architecture, etc..) and we know how that ended. Perhaps we should expect them to release something very close to Larrabee in 2010

Andrew Lauritzen · Aug 25, 2008

Article Linked Above said:
"As [blogger and CPU architect] Peter Glaskowsky said, the 'large' Larrabee in 2010 will have roughly the same performance as a 2006 GPU from Nvidia or ATI."

... enough said about the intelligence of that article.

Scali · Aug 27, 2008

Some things that jumped out at me in that article:

"They've put out a certain amount of technical disclosure in the past five weeks," he noted, "but although they make Larrabee sound like it's a fundamentally better approach, it sn't. They don't tell you the assumptions they made. They talk about scaling, but they disregard memory bandwidth. They make it sound good, but we say, you neglected half a dozen things."

I think this is a pretty weird statement to make. The information that Intel released was aimed at the point that they could scale better BECAUSE they reduced bandwidth requirements, among other things.

"Every GPU we make, we always consider this type of design, we do a reasoned analysis, and we always conclude no. That's why we haven't built that type of machine."

This might go for nVidia, but Intel works within a different environment. For example, nVidia didn't decide to design the G80 10 years ago. The time wasn't right. nVidia didn't have the expertise yet, there was no major API that would require it (nor would the rest of the PC be up to the task of driving it), and it would be impossible to manufacture with the state of chip manufacturing at the time.

Intel is ahead in chip manufacturing, and Intel has expertise in areas that nVidia doesn't have yet (and vice versa). So it might not be the right design for nVidia at this point. But it could be for Intel. It could also be the right design for nVidia some time in the future.

"ATI did not spend on things like PhysX and CUDA. But we believe that people value things beyond graphics. If you compare only on graphics, that's a relative disadvantage to us, but the notion of what you measure a GPU on will change and evolve," he argued

This is something that I've been saying aswell. Cuda is the real power of G80 and beyond, and we have yet to see what AMD's answer will be to that.

Jawed · Aug 28, 2008

Apparently an EETimes-Asia article includes this snippet:

Ct was initially geared toward Intel’s general purposed Nehalem quad core chips, but is now up and running on its prototype 16-core Larrabee graphics processors.

But you have to be registered to read it. I came across it here:

http://insidehpc.com/2008/08/27/two-paths-to-multicore-intel-and-ms-talk-about-their-tech-at-idf/

Prototype 16-core Larrabee, eh?...

Jawed

Jawed · Nov 25, 2008

Some (new?) slides from Intel form the basis of this article:

http://pc.watch.impress.co.jp/docs/2008/1125/kaigai477.htm

The slides are mostly about L2 cache and ring bus architecture.

Jawed

3dilettante · Nov 25, 2008

The ring bus is physically situated on top of the L2s, which would save space. It might hint at some kind of rough correlation between bus width and L2 tile size. An L2 that didn't match or exceed the physical width of the bus would leave space that would need filling.

I wonder if the ring bus sits a metal layer or two above the signal lines needed to signal the cache and link it to its local core.

Other things to note:
Intel's slide seems to indicate Larrabee's moved to MOESI.
This probably makes sense with so many cores modifying data on that ring bus that the modifications not write back out over memory.
This moves Larrabee:
1) in an AMD direction
2) in a direction not matching Intel's QPI

There's a "td" box situated at the ends or at the crosslinks between ring busses. Some kind of directory?

There's a block for "fixed function" that is both separate from the PCI-E, display unit, textures, and memory controller.
I wonder what's left that they haven't mentioned yet.

Jawed · Nov 25, 2008

3dilettante said:
The ring bus is physically situated on top of the L2s, which would save space.

Isn't it normal for any kind of bus to run over logic - with only repeater logic consuming area in "islands"?

It might hint at some kind of rough correlation between bus width and L2 tile size. An L2 that didn't match or exceed the physical width of the bus would leave space that would need filling.

What would happen as process scalings kick in? Would L2 shrink more rapidly than the bus?

Other things to note:
Intel's slide seems to indicate Larrabee's moved to MOESI.
This probably makes sense with so many cores modifying data on that ring bus that the modifications not write back out over memory.

Presumably also good since with core scaling you get multiple rings and ultimately, I guess, multiple chips linked together.

There's a block for "fixed function" that is both separate from the PCI-E, display unit, textures, and memory controller.
I wonder what's left that they haven't mentioned yet.

No idea

---

In rummaging around www.intel.com I found this paper on game physics:

Game Physics Performance on the Larrabee Architecture

Doesn't really provide any architecture insights though.

Jawed

3dilettante · Nov 25, 2008

Jawed said:
Isn't it normal for any kind of bus to run over logic - with only repeater logic consuming area in "islands"?

Cell's EIB doesn't route over the local stores. The EIB has a fair amount of dedicated logic sitting right in the center of the die.

The coherent bus in Larrabee might have made the case for distributing it amongst the caches.

What would happen as process scalings kick in? Would L2 shrink more rapidly than the bus?

That's what I'm curious about.
SRAM compacts pretty well with process. Logic less so. Interconnect beyond the lowest levels scales more slowly, and the higher layers are at higher geometries.
It might depend on just where the bus is running.

There might be a design-specific inflection point where the work in compacting all the signal lines balances with the challenges of running it at speed versus the space savings of keeping the L2 physically small and the desire to have more L2.

Jawed · Nov 25, 2008

3dilettante said:
Cell's EIB doesn't route over the local stores. The EIB has a fair amount of dedicated logic sitting right in the center of the die.

And probably "saves space" by having EIB control logic passed over by interconnects.

All I'm saying is that in terms of the die as a whole, "space saving", by laying a bus over logic is normal.

Now it might be that laying a bus over RAM is the easiest of configuration. Don't know, I suppose it's a question of the impact of repeater logic islands on L2 latency (due to increasing the radius of L2).

Jawed

3dilettante · Nov 25, 2008

Jawed said:
And probably "saves space" by having EIB control logic passed over by interconnects.

I wouldn't expect them to route completely around the EIB logic, since the EIB logic deals with the interconnect directly. I don't have a high res shot of Cell's EIB section, but it looks like part of the path the signals go through is in the logic block that takes up die space.

Routing the interconnect over the dedicated EIB logic is a bit different than lofting it over non-dedicated silicon.
Perhaps Cell also routes some of its bus over other silicon, I haven't seen a diagram for that.

All I'm saying is that in terms of the die as a whole, "space saving", by laying a bus over logic is normal.

But it's also not required. The fact that IBM aggregated the EIB's logic in one place instead of distributing it indicates there are other considerations.

Now it might be that laying a bus over RAM is the easiest of configuration. Don't know, I suppose it's a question of the impact of repeater logic islands on L2 latency (due to increasing the radius of L2).

Given the likely size of the L2 caches and the relatively simple ring bus scheme, I'm not sure there would be enough to be prohibitive.
My question is what happens when the SRAMs shrink, and if the ring bus will scale accordingly or if Intel will relax a bit and allow the cache capacity to go up to take pressure off of the ring bus designers.

Jawed · Nov 25, 2008

3dilettante said:
I wouldn't expect them to route completely around the EIB logic, since the EIB logic deals with the interconnect directly. I don't have a high res shot of Cell's EIB section, but it looks like part of the path the signals go through is in the logic block that takes up die space.

I've got a high res shot of Cell, but I don't know how it would help you to discern anything about the routing of the bus lines...

Routing the interconnect over the dedicated EIB logic is a bit different than lofting it over non-dedicated silicon.
Perhaps Cell also routes some of its bus over other silicon, I haven't seen a diagram for that.

A cache is also "non-dedicated silicon", so the question of whether any "area saving" applies is moot.

But it's also not required.

I'm merely saying that it seems to be normal to route an interconnect over un-related logic.

My question is what happens when the SRAMs shrink, and if the ring bus will scale accordingly or if Intel will relax a bit and allow the cache capacity to go up to take pressure off of the ring bus designers.

If the cache increases in capacity then that affects latency. Clearly, it's too early to tell how sensitive to cache latency Larrabee will be. Arguably, as the number of cores rises, any slight increase in L2 latency will be overwhelmed by ring-induced cache-coherency latency, and other scaling factors. So maybe it doesn't matter so much?

In then end it seems extremely unlikely to me that the bulk of the ring bus in Larrabee is formally restricted to the area of the die occupied by L2 - I don't think there's much value in assigning a 1:1 scaling question-mark over these two components (RAM + bus) of this subsystem. If the bus overspills the RAM in future it will lower the density of the logic it flies over - but this is just another version of the scaling problem for physical I/O, where analogue stuff scales poorly.

Jawed

3dilettante · Nov 25, 2008

Jawed said:
A cache is also "non-dedicated silicon", so the question of whether any "area saving" applies is moot.

That was my point. Cell has a contiguous area of the die devoted to the ring bus and its logic. I'm not privy to the details of the design, but I have a hard time accepting it is wholly made up of repeater blocks.

If the cache increases in capacity then that affects latency.

The dominant factor for that is area, and latency roughly scales with sqrt2 of the physical area of the cache.
If the SRAMs shrank, we'd expect better latency.
If the cache capacity were expanded to give roughly equivalent area, we'd have the same latency with more capacity.

In then end it seems extremely unlikely to me that the bulk of the ring bus in Larrabee is formally restricted to the area of the die occupied by L2 - I don't think there's much value in assigning a 1:1 scaling question-mark over these two components (RAM + bus) of this subsystem.

Perhaps I'm reading to much into the part of the slides that said that the ring bus is physically layered on top of the L2.

If the bus overspills the RAM in future it will lower the density of the logic it flies over

It might require the redesign or rerouting of all the logic it flies over, possibly at the expense of poorer density in logic that already scales worse than SRAM.
Depending on how large an L2 tile is compared to its directly linked compute core, the penalty may be worse if the logic expands.

The SRAMs might not require too many additional layers for their signalling, the more complex logic of the cores might have uses for the interconnect at the altitude of the ring bus, plus whatever margin of safety is needed to keep both layers from interfering with one another.

Larrabee at Siggraph

aaronspink

MfA

Gubbi

3dilettante

nAo

Nutella Nutellae

Andrew Lauritzen

Moderator

TimothyFarrar

Davros

nAo

Nutella Nutellae

Andrew Lauritzen

Moderator

Scali

Jawed

Jawed

3dilettante

Jawed

3dilettante

Jawed

3dilettante

Jawed

3dilettante

Similar threads