Larrabee delayed to 2011 ?

Panajev2001a · May 18, 2010

MfA said:
If SPUs were 16 wide.

Which is another of the "SPU v2 I would like" requests nAo was making

(increased vector width) IIRC.

MfA · May 18, 2010

Meh ... if cache to cache communication is fast enough you can use neighbouring processors to work on large vectors, but it's impossible to split up a SIMD to handle divergent workloads.

rpg.314 · May 18, 2010

The article didn't have much meat to it, given the amount of words expended.

You nailed it.

Panajev2001a · May 18, 2010

MfA said:
Meh ... if cache to cache communication is fast enough you can use neighbouring processors to work on large vectors, but it's impossible to split up a SIMD to handle divergent workloads.

It depends on what's the most common scenario they want to optimize for... I think they expect bad thread divergence to be low and thus handle divergence the same way nVIDIA handled it so far.

MfA · May 18, 2010

Just about the time where people want to switch to software renderers with micropolygons.

rpg.314 · May 18, 2010

I think the converged pipeline means LRBni 2.0 on top of a fat ooo x86 core.

I know it sounds odd.

aaronspink · May 18, 2010

MfA said:
Meh ... if cache to cache communication is fast enough you can use neighbouring processors to work on large vectors, but it's impossible to split up a SIMD to handle divergent workloads.

Except you would have to worry about sync issues between the two SPUs. And from a design perspective, even if you can only use say 8 lanes on average, 1x16 << power than 2x8, and also << area as well. And that rolls over into interconnect pressure, scheduling complexity, etc.

Realistically, for reasonably data parallel workloads, you want your SIMD size equal to your memory/cache block size for maximal efficiency.

MfA · May 18, 2010

Raytracing just for instance is not a reasonably data parallel workload (not for the hard stuff, ie. non primary/shadow rays).

Andrew Lauritzen · May 18, 2010

MfA said:
Raytracing just for instance is not a reasonably data parallel workload (not for the hard stuff, ie. non primary/shadow rays).

Right but it's not clear to me that you can do much better than CPU-like designs for those highly-irregular workloads anyways. 16-wide SIMD seems like a good sweet spot for most workloads.

MfA · May 18, 2010

Raytracing is not very highly data parallel, but it's still massively parallel ... I'd still take a Larrabee over an I7 for this kind of problem, the more cores the better.

Jawed · May 18, 2010

Ray tracing is mostly a random walk through cache lines, though.

One of the nice things about Larrabee's SIMD-16 is that all lanes are peers without going through shared memory - they're just a swizzle away (though swizzle bandwidth is limited and it can take a few swizzles to get to where you want). Neither ATI's nor NVidia's architecture exposes such high-bandwidth parallelism on that kind of scale - which coincides with the branching granularity of the SIMD.

In this case you can have a work item process an entire cache line (node in the data structure) across all 16 lanes. i.e. make the entire SIMD work on evaluating the node simultaneously. In a similar fashion to the parallel rasterisation algorithm.

Voila, no branching incoherency penalty, as there's only one path per work item, not 16 in the conventional data-parallel style of ray-/path-tracing.

Sure, each hardware thread needs to have multiple work items to hide memory latencies, but each work item is independent and notionally fully utilises the SIMD-16.

Jawed

Andrew Lauritzen · May 18, 2010

MfA said:
Raytracing is not very highly data parallel, but it's still massively parallel ... I'd still take a Larrabee over an I7 for this kind of problem, the more cores the better.

You can get quite a few i7 cores now in multi-socket systems and OOE logic and higher frequencies make a big difference. If you're not using the SIMD I would imagine that you're not going to do a lot better than a good CPU.

3dilettante · May 18, 2010

The cores per dollar argument is pretty strong in Larrabee's favor. Its target price bracket would have put 32 cores at a price approximately half of a single high-end hexacore's price, absent the cost of the board and not even counting the additional cost of a dual-socket. Any more sockets and we're wandering into orders of magnitude more costs.

We might expect 12 i7 cores in a DP configuration which could cost 5-10 times a single Larrabee card.
The octal-core and Xeon MP setups are in another pricing tier beyond that.

Andrew Lauritzen · May 19, 2010

3dilettante said:
The cores per dollar argument is pretty strong in Larrabee's favor.

Accepting the many assumptions in your post, sure, but we were just discussing the suitability of various architectures for MIMD-style code. The majority of the power of GPUs is in their SIMD units (Larrabee included, but obviously less than other GPUs) and they perform significantly less well with heavily divergent code/data structures.

I'm obviously a huge fan of GPU/Larrabee-like architectures, but it's worth noting places where latency-optimized processors shine, and they're going wide at a similar rate to GPUs.

3dilettante · May 19, 2010

The assumption was that a Larrabee card was released at its target clocks at its target price range.
That's 32 ~2GHz cores, with a price ceiling of $500-600.

With a Gulftown hexacore, the price of the chip alone can meet or exceed the board (by a lot with an extreme edition).
With more than one, we have that multiplier plus possibly a small dual-socket board premium.

Intel's market segmentation charges a significant premium for socket counts and core counts that are higher.

We can have much more modest prices with quad cores, but with a drop in core count.
Within a reasonable dual-socket realm, we can have 8 or 12 cores in total.
The 8 core scenario sounds the most cost effective, but that is a factor of 4 disadvantage in core count and a factor of 8 in thread count.
The costs ramp very quickly beyond that.

Larrabee, for all its possible weaknesses, would have been a lot of silicon for a very depressed price thanks to what it was targeting. This probably explains why those involved with managing Intel's Xeon product lines were not so enthusiastic about it.

kyetech · May 19, 2010

Questions

So whats the projected core count on i7 / ix cores in 2012?
What will the trannie budget for for the LRB3 be?
IF they are revising/streamlining the LRB core what do you think the core count will be for LRB3 in 2012? 64 / 96 / 128?
And more generally if they target late 2012 what process will they make it on? 22nm ? (It will prob need to).

Andrew Lauritzen · May 19, 2010

3dilettante said:
The assumption was that a Larrabee card was released at its target clocks at its target price range.
That's 32 ~2GHz cores, with a price ceiling of $500-600.

Not to dispute, but where are you getting these numbers from?

3dilettante said:
With a Gulftown hexacore, the price of the chip alone can meet or exceed the board (by a lot with an extreme edition).

You can't really consider the street price of these things when comparing architectural efficiency. I imagine the margins for CPUs and GPUs are pretty different.

3dilettante said:
Intel's market segmentation charges a significant premium for socket counts and core counts that are higher.

Again, see above. While this may be relevant to an end user building a system, I was discussing overall efficiency of an architecture for a given workload.

3dilettante said:
The 8 core scenario sounds the most cost effective, but that is a factor of 4 disadvantage in core count and a factor of 8 in thread count.

Not sure you can directly compare "thread count" like that between these architectures...

3dilettante said:
Larrabee, for all its possible weaknesses, would have been a lot of silicon for a very depressed price thanks to what it was targeting.

That much I completely agree with

Just give some credit to CPUs where it is due.

rpg.314 · May 19, 2010

kyetech said:
Questions

So whats the projected core count on i7 / ix cores in 2012?

What will the trannie budget for for the LRB3 be?

IF they are revising/streamlining the LRB core what do you think the core count will be for LRB3 in 2012? 64 / 96 / 128?

And more generally if they target late 2012 what process will they make it on? 22nm ? (It will prob need to).

I'd expect ~10Tflops sp. Whichever way with whatever pipelines they get there. :???:

3dilettante · May 19, 2010

Andrew Lauritzen said:
Not to dispute, but where are you getting these numbers from?

The clock range for Larrabee was initially put out in slides that had it ranging from 1.5 GHz to 2.5 GHz. Granted, those slides were old and did not plan on an 32-core variant.

There was also a goal of 1 TFLOPS in DP.
Running at 2 GHz at 32 cores for SP is 2 FMADD*16*32*2GHz = 2 TFLOP SP = 1 TFLOP DP.

You can't really consider the street price of these things when comparing architectural efficiency. I imagine the margins for CPUs and GPUs are pretty different.

Why can't I buy two Larrabees?
Saying that an unbounded number of i7 cores in an unbounded number of sockets is not a fair comparison to Larrabee, which apparently is being assumed to be singular.

A realistic implementation is not going to be an 8-socket board populated with Nehalem EX processors.

Again, see above. While this may be relevant to an end user building a system, I was discussing overall efficiency of an architecture for a given workload.

Efficiency in what terms?
Cost? Multisocket is going to lose there.
Perf/Watt?
Larrabee is rumored to have had power issues, yes. A quad-socket of 130-150W CPUs is not going to be far behind.

Not sure you can directly compare "thread count" like that between these architectures...

It's 4 hardware threads per core for Larrabee, 2 for Nehalem. There are significant differences in implementation, but nearly an order of magnitude difference should count for something.

Andrew Lauritzen · May 19, 2010

3dilettante said:
Why can't I buy two Larrabees?
Saying that an unbounded number of i7 cores in an unbounded number of sockets is not a fair comparison to Larrabee, which apparently is being assumed to be singular.

I'm not saying that at all... I'm say cost to produce is more relevant for comparing the efficiency of a processor than cost to consumer (which includes profit margins).

3dilettante said:
It's 4 hardware threads per core for Larrabee, 2 for Nehalem. There are significant differences in implementation, but nearly an order of magnitude difference should count for something.

The different hardware threads are just to cover various latencies, etc. They cannot all execute an instruction in the same clock. See the Larrabee architecture paper:

Switching threads covers cases where the compiler is unable to schedule code without stalls. Switching threads also covers part of the latency to load from the L2 cache to the L1 cache, for those cases when data cannot be prefetched into the L1 cache in advance. Cache use is more effective when
multiple threads running on the same core use the same dataset, e.g. rendering triangles to the same tile.

To consider the benefit of these "HW thread" implementations then, you need to consider the memory architecture, which is very different between the two. Sure Larrabee theoretically has twice as many hardware threads with which to hide latencies, but GPU memory latencies are typically far more than 2x longer than CPUs.

Again, I'm not really arguing with you per se, just pointing out that the real power of throughput architectures is the SIMD. They make trade-offs that make them much less impressive when running (even multi-threaded) scalar C code.

Larrabee delayed to 2011 ?

Panajev2001a

MfA

rpg.314

Panajev2001a

MfA

rpg.314

aaronspink

MfA

Andrew Lauritzen

Moderator

MfA

Jawed

Andrew Lauritzen

Moderator

3dilettante

Andrew Lauritzen

Moderator

3dilettante

kyetech

Andrew Lauritzen

Moderator

rpg.314

3dilettante

Andrew Lauritzen

Moderator