Larrabee at Siggraph

Jawed · Aug 21, 2008

At the same time it seems like Larrabee takes heed of some of the techniques for using its L2 cache from Cell. With Cell the programmer divvies-up LS into a number of regions for different types of access. The L2 cache control instructions in Larrabee sound like they're doing much the same - treating L2 as "directly addressable" and allowing the programmer to specify the size and behaviour of regions.

The paper gives an example of marking some L2 cache lines as low-priority so that a thread can stream data through L2 without trampling all over other lines.

Jawed

3dilettante · Aug 21, 2008

Jawed said:
At the same time it seems like Larrabee takes heed of some of the techniques for using its L2 cache from Cell. With Cell the programmer divvies-up LS into a number of regions for different types of access. The L2 cache control instructions in Larrabee sound like they're doing much the same - treating L2 as "directly addressable" and allowing the programmer to specify the size and behaviour of regions.

The paper gives an example of marking some L2 cache lines as low-priority so that a thread can stream data through L2 without trampling all over other lines.

Jawed

Another thing is that the tiling scheme for rasterizing is designed to keep as much as possible local to a core's L2 with minimal writing to common structures.
It's somewhat ironic that all the noise was "coherent caching!!!!111!!!!" and Intel's eventual qualifier was "now try not to use it".

Not knowing details, I wonder about the exact implementation. Is it setting a cache line in particular, or setting whatever cache line is hit by a special load/prefetch.

The latter involves less work and less intelligence on the part of the cache controller, but is Larrabee able to specify that demand loads be done in this manner, or only prefetches?

nAo · Aug 21, 2008

3dilettante said:
Another thing is that the tiling scheme for rasterizing is designed to keep as much as possible local to a core's L2 with minimal writing to common structures.
It's somewhat ironic that all the noise was "coherent caching!!!!111!!!!" and Intel's eventual qualifier was "now try not to use it".

I don't find it ironic at all, when coherency is a low hanging fruit and it improves perf a lot while keeping your overall software renderer architecture highly scalable and future proof it would idiotic to not exploit it.

3dilettante · Aug 21, 2008

nAo said:
I don't find it ironic at all, when coherency is a low hanging fruit and it improves perf a lot while keeping your overall software renderer architecture highly scalable and future proof it would idiotic to not exploit it.

It is actually does meet the defintion of irony when everyone is fed the idea that coherent caching is an unqualified second coming and then it is shown that it has to be used judiciously.

aaronspink · Aug 21, 2008

3dilettante said:
And those numbers are already insipid.
What would one call opening the field to a SKU class that draws in more than that?

We can fully expect the competition to widen its wattage numbers if Larrabee goes that route, meaning things get even more stupid than a tri GT200s.

By the way, if I start seeing 3 slot coolers (edit: cards), I will kill somebody.

fyi I already have a 3 slot cooler and I installed it myself. It gives me 2x+ better cooling at <23db of noise something that just isn't possible in 2 slot cooling. It also enables me to get ~40% more performance.

Enforcer · Aug 21, 2008

Some thoughts:

The speed of gather/scatter is limited by the cache, which
typically only accesses one cache line per cycle.

Incoherent reads are up to 16 times slower than Nvidia.
(Coherent reads can be slower on Nvidia if data isnt patritioned properly (highly unlikely? why one use shared memory for that?). Nvidia support 1 broadcast per cycle (read the same address by many threads)).

The VPU and its registers are approximately one third the area of the CPU core but provide most of the integer and floating point performance.

Is it 33% of logic or (logic+L1+L2)?

Intel Atom has 47 million transistors with 512KB L2, 4W TDP with upto 2.2GHz speed. What can we expect from 45nm Larrabee?
2 Tflops = 32 cores x 2Ghz isnt impossible...

Simulated FEAR performance looks very promising (~120 average FPS 1600x1200x4AA at 1GHz x24 cores, ex. GTX280 has avg 140fps in fear internal benchmark),
however 50% time wasted on Rasterization+DepthTest (rendering shadow volumes i guess) looks kinda scary, isnt it?
Too bad the results aren't directly comparable as we dont know which frames were used...

Ring bus:

Each ring data-path is 512-bits wide per direction.

Only 128 GB/sec at 2 GHz??
Thats lower than memory bandwidth of current cards (140 GB/s GTX 280).
What about scaling beyond 24-32 cores? Can they increase bus width in the future?
R600 had 512+512 ring bus too...

Nothing was told about multi-chip communication and scaling...

3dilettante · Aug 21, 2008

aaronspink said:
fyi I already have a 3 slot cooler and I installed it myself. It gives me 2x+ better cooling at <23db of noise something that just isn't possible in 2 slot cooling. It also enables me to get ~40% more performance.

After-market or some kind of special edition card I haven't seen yet?

Jawed · Aug 21, 2008

Enforcer said:
Some thoughts:

Incoherent reads are up to 16 times slower than Nvidia.
(Coherent reads can be slower on Nvidia if data isnt patritioned properly (highly unlikely? why one use shared memory for that?). Nvidia support 1 broadcast per cycle (read the same address by many threads)).

Larrabee can hide the latency of incoherent reads as it uses L1 to accumulate the data before the thread resumes. NVidia takes incoherent reads on the chin.

Intel's price appears to be fully-manual context switching though, with VPU cycles lost - so Larrabee can't actually hide all that latency.

So, overall, it looks like we'll just have to benchmark it

Is it 33% of logic or (logic+L1+L2)?

It seems to me that a core is defined as logic+L1+L2. Each L2 is only used by its core. Cores can only access foreign L2s under the cache-coherency protocol, which is effectively a request to fetch data to make a local copy.

Ring bus:

Only 128 GB/sec at 2 GHz??
Thats lower than memory bandwidth of current cards (140 GB/s GTX 280).
What about scaling beyond 24-32 cores? Can they increase bus width in the future?
R600 had 512+512 ring bus too...

If you take account of the fact that a ring bus normally supports multiple packets per direction per clock (between non-overlapping start-end segments) then you get more bandwidth. Also, the average trip length per direction is rather less than half the cicumference.

Interestingly, with the huge amount of bandwidth that Larrabee saves in render target operations, it means that texturing will take up a far larger proportion of the overall bandwidth of each frame than we see on FF GPUs.

So the TUs, while they're likely to be equally distributed around the ring, will also incur the highest average ring-bus trip lengths. They'll be fetching texels from all MCs and providing results back to all cores. That's my impression, anyway.

The TUs have their own cache. I dare say I'm assuming this cache is distributed per TU - though they're likely to be able to share texels amongst themselves.

So, the TUs will be using the ring bus pretty heavily and lowering the effective bandwith somewhat because of the relatively long trips they'll incur - and per texture result that is:

a request packet from the core
a TU cache coherency request and response, if the TUs share texels
a TU fetch command to multiple MCs (fetch + pre-fetch)
texels fetched from memory by multiple MCs
texture results returned to requesting core

Jawed

3dilettante · Aug 21, 2008

Jawed said:
It seems to me that a core is defined as logic+L1+L2. Each L2 is only used by its core. Cores can only access foreign L2s under the cache-coherency protocol, which is effectively a request to fetch data to make a local copy.

Possible, but with the caveat that Intel up until now has not considered the L2 to be part of the core for its CPUs, although this is possibly changing with Nehalem.

The Larrabee paper doesn't quite define the relationship, but some of the wording seems to indicate that in the authors' minds there is a distinction of sorts.

aaronspink · Aug 21, 2008

3dilettante said:
After-market or some kind of special edition card I haven't seen yet?

the thermalright HR-03 cooler. Aftermarket of course, haven't seen a graphics card shipped with anything but the most basic low cost design in forever. As a general rule, the vendor supplied coolers are some of the worst designed and worst build things in the heatsink universe. Horrible quality control, internally bent/blocked fins, missing solder, etc.

MfA · Aug 21, 2008

Andrew Lauritzen said:
So as nAo says, certainly banked local memories are neat, but arguably they're yet another highly non-linear point in the already impossibly-difficult optimization space of writing fast CUDA code.

The headaches it provides is irrelevant. It's about cost/benefit, being consistently slow is never an advantage. Sure cache to shared memory is only an apples to apples comparison when your algorithm has a local data set which fits shared memory ... but I assumed that much was obvious.

People *do* run into bank conflicts in CUDA and they are an insipid beast that is hardware dependent but requires software workarounds... arguably it's easier to deal with "just try to keep your data contiguous and you pay something when it is not"

You could argue that, but it's not a tenable argument. Lets say for a moment Larrabee could implement it's cache as banked with 0 area overhead ... you would still argue they shouldn't just because you are too lazy to make use of it?

since that's in-line with the physics of the situation anyways. If you want to multi-"bank" your memory layout in software too, you're always welcome to do that, and in a way that's a lot more explicit and with fewer hidden performance cliffs.

In the end a multibanked architecture will always be as fast or faster, no amount of tweaking your data layout or other software tricks can change that.

Kaotik · Aug 21, 2008

Not sure if it's been mentioned here already, but Larrabee utilizes some sort of Ringbus, similar to what ATI used apparently.
Few shots from the presentation at IDF:

According to Larry Seiler, the first products will appear late 2009 or 2010, and that the first product will be PCI Express add-in card, which will work alongside a CPU and possibly a GPU

Andrew Lauritzen · Aug 22, 2008

MfA said:
The headaches it provides is irrelevant. It's about cost/benefit, being consistently slow is never an advantage.

I'm just saying it's not as simple as what you're saying. Indeed in my experience getting data into local memory is usually more of an issue than operating on it there, and generally that's the step where you do your data conversions, coalescing, etc.

MfA said:
Lets say for a moment Larrabee could implement it's cache as banked with 0 area overhead ... you would still argue they shouldn't just because you are too lazy to make use of it?

In reality though everything has a cost, be it hardware or otherwise. Typically I'd prefer to have a larger cache/local memory and thus more chance of fitting bigger blocks of the problem into than too much cleverness like banking, etc. And seriously, optimizing CUDA has nothing to do with laziness... it's is quite literally beyond the ability of mere mortals, as several recent papers have demonstrated

MfA said:
In the end a multibanked architecture will always be as fast or faster, no amount of tweaking your data layout or other software tricks can change that.

Sure, but nowadays it's all about the hardware cost/trade-off and no argument is very useful (particularly comparing architectures) without that component. Honestly it's not useful to say things like "Core 2 is faster than Atom"... clearly the actual hardware trade-offs matter.

In this case, while it's fun to imagine an architecture with a different bank for every byte (or even bit!) of memory, I can't see that being a good use of transistors

Thus given the types of applications that I see, I don't thing a multi-banked memory architecture would be a huge win and it's probably best to spend the transistors elsewhere (such as on more local memory). But hey, I'm happy to be proven wrong on that one, and I certainly don't know the details of how much hardware it takes

aaronspink · Aug 22, 2008

MfA said:
You could argue that, but it's not a tenable argument. Lets say for a moment Larrabee could implement it's cache as banked with 0 area overhead ... you would still argue they shouldn't just because you are too lazy to make use of it?

while we're going off in lala land lets make the local store infinite size as well.

In the end a multibanked architecture will always be as fast or faster, no amount of tweaking your data layout or other software tricks can change that.

There are LOTs of cases where local store will be slower. And guess what, local stores never scale either. Local stores are as much of a dead end on GPUs as they were on GPUs.

TimothyFarrar · Aug 22, 2008

Andrew Lauritzen said:
In reality though everything has a cost, be it hardware or otherwise. Typically I'd prefer to have a larger cache/local memory and thus more chance of fitting bigger blocks of the problem into than too much cleverness like banking, etc.

How about we take this to say a practical example, like say optimal sorting 16 million objects (say {sort key, object id}). Does the cache advantage (Larrabee) vs banking+local store (NVidia) argument still apply?

Assuming the same amount of registers, if compute to cache or local store ratios are 1.5 to 2.0 in favor of Larrabee, how much of that advantage is lost in poor cache utilization, and which architecture ends up with better performance in terms of utilization of un-cached bandwidth to main memory?

crystall · Aug 22, 2008

3dilettante said:
Another thing is that the tiling scheme for rasterizing is designed to keep as much as possible local to a core's L2 with minimal writing to common structures.
It's somewhat ironic that all the noise was "coherent caching!!!!111!!!!" and Intel's eventual qualifier was "now try not to use it".

Well, the whole description of their software renderer screams "there's no free lunch". The synchronization between the setup and work threads is done without using real thread synchronization primitives, however it is doable only because the 4 threads (1 setup and 3 workers) have been physically pinned to a specific core (or at least, that's how I understand their description). They probably use compare and exchange instructions without a LOCK prefix for this purpose.

In other words Larrabee is a general purpose solution but it cannot be used in the usual way when doing a GPU's work if you want it to perform decently.

Not knowing details, I wonder about the exact implementation. Is it setting a cache line in particular, or setting whatever cache line is hit by a special load/prefetch.

The latter involves less work and less intelligence on the part of the cache controller, but is Larrabee able to specify that demand loads be done in this manner, or only prefetches?

So called non-temporal loads/stores have already been used beside the usual non-temporal prefetches. The mechanism used is simple, the cacheline involved in the transfer has its (pseudo) LRU counter set to a value which will make it automatically eligible for eviction during the next cache miss. I guess will see something along this lines on Larrabee too.

crystall · Aug 22, 2008

Enforcer said:
Simulated FEAR performance looks very promising (~120 average FPS 1600x1200x4AA at 1GHz x24 cores, ex. GTX280 has avg 140fps in fear internal benchmark),
however 50% time wasted on Rasterization+DepthTest (rendering shadow volumes i guess) looks kinda scary, isnt it?
Too bad the results aren't directly comparable as we dont know which frames were used...

I'd take that results with a *massive* grain of salt for two reasons: the first one being that they are simulated and the second one that most of the data for evaluating them is missing. How many TUs where simulated? Which layout was used for the ring network above 16 cores? The details on it are sketchy (multiple short linked rings)... Was AF used? It is not mentioned so my guess is no. What is the supposed die area and power consumption of the various setups (8, 16, 24 cores, etc...)?

Another thing which bugs me is this quote from the paper when describing how they obtained the simulated results:

"We wrote assembly code for the highest-cost sections"

To me it is crystal clear that all the stages of the renderer will go through a JIT compiler, not only the shaders as all the functionality that has been moved from hardware to software requires JIT compilation. That quote suggests that they do not yet have such compiler in place or that its output is not good enough - something I find hard to believe as Intel has an excellent compiler team - but in both cases it seems to me that their software stack still lacks some *huge* parts and we cannot guess easily on the potential performance of Larrabee. We simply do not have enough data.

MfA · Aug 22, 2008

aaronspink said:
while we're going off in lala land lets make the local store infinite size as well.

If you think you can give a realistic estimate of the area cost, by all means ... contribute something useful.

Having multiple banks is not about local store vs cache ... it's orthogonal (works with both).

Jawed · Aug 22, 2008

A cache will have to be multi-banked just to support multiple concurrent reads and writes, won't it?

Jawed

MfA · Aug 22, 2008

You are confusing ports and banks I think.

Larrabee at Siggraph

Jawed

3dilettante

nAo

Nutella Nutellae

3dilettante

aaronspink

Enforcer

3dilettante

Jawed

3dilettante

aaronspink

MfA

Kaotik

Drunk Member

Andrew Lauritzen

Moderator

aaronspink

TimothyFarrar

crystall

crystall

MfA

Jawed

MfA

Similar threads