Larrabee at Siggraph

armchair_architect · Aug 20, 2008

PeterT said:
In the terminology you use here, are the manual shared memory management and coalescing requirements of CUDA part of the programming model or just of NVs current implementation?

Coalescing would be implementation. You could relax those rules significantly and (a) existing code would still run without change, and (b) apps that do uncoalesced access would automatically go faster. Whether global memory is cached or not is also in this category.

The on-chip shared memory is part of the programming model. You can implement it with normal cache and good cache eviction policy controls (like Larrabee apparently has). But your code has to be written to deal with it explicitly or you get no benefit, and it affects your algorithms and data structures. So I'd call it part of the programming model.

I wouldn't be upset if shared memory evolved in the Larrabee direction. The important thing is to be able to guarantee very low latency and very high bandwidth to a chunk of data used by a group of cooperating threads. Generic caching isn't good enough, IMHO.

MfA · Aug 20, 2008

armchair_architect said:
The on-chip shared memory is part of the programming model. You can implement it with normal cache and good cache eviction policy controls (like Larrabee apparently has). But your code has to be written to deal with it explicitly or you get no benefit.

You have to make sure your shared data fits in the cache and you don't trash the cachelines with poorly chosen access patterns ... that's a lot less work than managing a local store.

That said, Larrabee has hardly implemented something which is a pure superset of NVIDIA's shared memory ... scatter/gather within a vector will bring bandwidth for the cache to it's knees, but works at full bandwidth with shared memory.

PS. oops nm, Larrabee can scatter/gather at full speed within a vector ... still NVIDIA with it's banked architectures will in general achieve far higher bandwidth when kernels perform irregular accesses inside the shared memory.

Tridam · Aug 20, 2008

nAo said:
Hopefully they released some more details..

No more details actually. Pat Gelsinger confirmed in a round table that they plan to take the performance lead at launch even in D3D and OGL emulation. He also said that Larrabee will be high end so they don't mind if it sucks lot of power even at idle hehe

It was also said to me that everything running on Nehalem will run on Larrabee implying Larrabee vector ISA is a superset of SSE 4.2.

Most interesting part about Larrabee at this IDF was probable Larry Seiler singing alleluia on stage while Pat, a bit ashamed, was wondering why he invited him on stage

3dilettante · Aug 20, 2008

Tridam said:
He also said that Larrabee will be high end so they don't mind if it sucks lot of power even at idle hehe

That's a complete inversion of Intel's previously stated strategy of targeting the volume segments first, one that I had just come around to being the most prudent.

I also feel the pain of power supplies everywhere.

aaronspink · Aug 21, 2008

MfA said:
You have to make sure your shared data fits in the cache and you don't trash the cachelines with poorly chosen access patterns ... that's a lot less work than managing a local store.

IF you can manage a local store YOU CAN EASILY manage cache blocking esp if you have cache placement attributes. Caches are always easier to deal with than local stores.

Then enable you to block what is required and no have to worry about blocking what isn't required for performance. Caches aren't around in just about every silicon processing design for backwards compatibility when they could of easily saved area and power by using a control store.

That said, Larrabee has hardly implemented something which is a pure superset of NVIDIA's shared memory ... scatter/gather within a vector will bring bandwidth for the cache to it's knees, but works at full bandwidth with shared memory.

PS. oops nm, Larrabee can scatter/gather at full speed within a vector ... still NVIDIA with it's banked architectures will in general achieve far higher bandwidth when kernels perform irregular accesses inside the shared memory.

possibly, possibly not. Scatter/Gather is primarily used to speed up the instruction stream by enabling the instruction stream to continue executing while the scatter gather takes place.

While optimization to reduce the bandwidth overhead of the scatter/gather are nice, you always will have a bandwidth overhead with scatter/gather type workloads/designs with modern architectures and components.

This includes having highly banked caches/local stores. While they can reduce some of the overhead they aren't magic. you still have to deal with bank conflicts, conflicting accesses, etc.

aaronspink · Aug 21, 2008

3dilettante said:
That's a complete inversion of Intel's previously stated strategy of targeting the volume segments first, one that I had just come around to being the most prudent.

I also feel the pain of power supplies everywhere.

the pain of power supplies? We're already in the 300W+ range already with graphics, whats a little more! Some configs are actually already pulling down upwards of 700W+!

ShaidarHaran · Aug 21, 2008

aaronspink said:
the pain of power supplies? We're already in the 300W+ range already with graphics, whats a little more! Some configs are actually already pulling down upwards of 700W+!

What graphics card consumes more than 225W under full load?

nAo · Aug 21, 2008

MfA said:
PS. oops nm, Larrabee can scatter/gather at full speed within a vector ... still NVIDIA with it's banked architectures will in general achieve far higher bandwidth when kernels perform irregular accesses inside the shared memory.

Good luck with having the data you need already there, plus we know next to nothing about LRB cache architecture.

aaronspink · Aug 21, 2008

ShaidarHaran said:
What graphics card consumes more than 225W under full load?

4870x2 has shown testing numbers that put it well beyond 225W under load.

most SLI/CF configs easily exceed 225W and most 400+ watts. Tri-SLI 280 is probably close to 700W.

PeterT · Aug 21, 2008

MfA said:
You have to make sure your shared data fits in the cache and you don't trash the cachelines with poorly chosen access patterns ... that's a lot less work than managing a local store.

I agree, especially since at least on current NV cards you also have to worry about your local shared memory access patterns.

pcchen · Aug 21, 2008

PeterT said:
I agree, especially since at least on current NV cards you also have to worry about your local shared memory access patterns.

I think with Larrabee you still have to worry about cache access patterns. For example, it would be pretty bad if you load your vectors where each elements are from different cache lines.

Jawed · Aug 21, 2008

It seems the intention is to program Larrabee so that fetches from disparate L2 lines have their fetch latency hidden by other threads/fibres. L1 is used to cohere these fetches.

Jawed

3dilettante · Aug 21, 2008

aaronspink said:
the pain of power supplies? We're already in the 300W+ range already with graphics, whats a little more! Some configs are actually already pulling down upwards of 700W+!

And those numbers are already insipid.
What would one call opening the field to a SKU class that draws in more than that?

We can fully expect the competition to widen its wattage numbers if Larrabee goes that route, meaning things get even more stupid than a tri GT200s.

By the way, if I start seeing 3 slot coolers (edit: cards), I will kill somebody.

ShaidarHaran · Aug 21, 2008

aaronspink said:
4870x2 has shown testing numbers that put it well beyond 225W under load.

I miscalculated the total potential power delivery previously (8 pin + 6 pin). I was working off dual 6 pin numbers.

aaronspink said:
most SLI/CF configs easily exceed 225W and most 400+ watts. Tri-SLI 280 is probably close to 700W.

That's multiple cards though. I don't think anyone would argue that it would be difficult to exceed 225W with multiple high-end graphics cards

Enforcer · Aug 21, 2008

There is one thing im worried about:
- nothing was told about multi-chip solutions

- Larrabee has its own memory interface, like all GPU today. Many chips+RAM can work independently
- Larrabee's fully programmable architecture should help (ex good scaling without AFR, efficient memory management), but nothing mentioned about it
- If multi-chip solution possible, some kind of high speed interconnect needed,but i didnt find anything about it either
- Intel brings QuickPath Interconnect (QPI) bus for next gen CPU's. It's logical to expect the same bus for Larrabee interconnect. If it was the case it should already be mentioned, right?
(oops i just realised how slow it is! 12.8Gb/sec in each direction (Nehalem)... it woulndn't help with anything more serious than FB transfer in AFR)

- if Larrabee targets high-end, it should be able to compete with AMD/Nvidia multi-chip solutions
- i think Larrabee can get marketing advantage by providing an alternative to AFR (is cashe/bus coherency mechanism helps?)
- multi-chip scalability is essential to long-term success

MfA · Aug 21, 2008

aaronspink said:
This includes having highly banked caches/local stores. While they can reduce some of the overhead they aren't magic. you still have to deal with bank conflicts, conflicting accesses, etc.

At least you have the luxury of being able to deal with them, at worst it will get as bad as Larrabee's cache ... being consistently slow when accessing multiple vectors is not a plus.

MfA · Aug 21, 2008

PeterT said:
I agree, especially since at least on current NV cards you also have to worry about your local shared memory access patterns.

You have to worry about it more on Larrabee (it needs a cycle for every vector accessed in a vectorized read, whereas NV cards need a cycle for every bank conflict which is a number up to 16 times smaller).

nAo · Aug 21, 2008

Mfa, you keep forgetting that you data don't automagically end up on chip on their own, you are comparing apples to oranges. A multibanked local memory is cool, but you have to get your data there in the first place.

Andrew Lauritzen · Aug 21, 2008

nAo said:
Mfa, you keep forgetting that you data don't automagically end up on chip on their own, you are comparing apples to oranges. A multibanked local memory is cool, but you have to get your data there in the first place.

Indeed - non-coalesced reads/writes in CUDA are not happy things... so much so that you generally have to use the local storage to do your scatter/gather coalesce and throw out any data that you don't need. It starts to sound pretty similar to hardware cache lines at this point.

So as nAo says, certainly banked local memories are neat, but arguably they're yet another highly non-linear point in the already impossibly-difficult optimization space of writing fast CUDA code. People *do* run into bank conflicts in CUDA and they are an insipid beast that is hardware dependent but requires software workarounds... arguably it's easier to deal with "just try to keep your data contiguous and you pay something when it is not", since that's in-line with the physics of the situation anyways. If you want to multi-"bank" your memory layout in software too, you're always welcome to do that, and in a way that's a lot more explicit and with fewer hidden performance cliffs.

Anyways in the end I think the two models are actually more similar than they may seem. Incoherent memory accesses have a cost - and will continue to - so they're always undesirable, and its in your best interest to minimize them and capture coherence wherever possible. This is very true for all parallel processors, and even true to a lesser extent for CPUs.

nAo · Aug 21, 2008

Andrew Lauritzen said:
Anyways in the end I think the two models are actually more similar than they may seem. Incoherent memory accesses have a cost - and will continue to - so they're always undesirable, and its in your best interest to minimize them and capture coherence wherever possible. This is very true for all parallel processors, and even true to a lesser extent for CPUs.

Yep, it seems to me that LRB memory model allows it to scale much better in the future.
LRB2 might be able to fetch 2 cache lines per clock cycle during a gather operation (ok, it wouldn't be cheap but it can be definitely done) and this would benefit any past or future application, while CUDA memory model won't scale perf that nicely.

Larrabee at Siggraph

armchair_architect

MfA

Tridam

3dilettante

aaronspink

aaronspink

ShaidarHaran

hardware monkey

nAo

Nutella Nutellae

aaronspink

PeterT

pcchen

Moderator

Jawed

3dilettante

ShaidarHaran

hardware monkey

Enforcer

MfA

MfA

nAo

Nutella Nutellae

Andrew Lauritzen

Moderator

nAo

Nutella Nutellae

Similar threads