Larrabee delayed to 2011 ?

I'm not saying that at all... I'm say cost to produce is more relevant for comparing the efficiency of a processor than cost to consumer (which includes profit margins).
I suppose since neither Larrabee or at-cost Xeons and multi-socket boards are commercially procurable (unless you're building a top500 supercomputer), we can call this a draw.

The different hardware threads are just to cover various latencies, etc. They cannot all execute an instruction in the same clock. See the Larrabee architecture paper:
1/4 of the threads can execute a pair of instructions a clock, that is 64 instructions per clock on dual-issue cores. For the usage we are debating, this is more acheivable since there is so little peak issue width to waste.

For 8 i7 cores, some combination of threads can at most execute 32 in aggregate, maybe 40 if we get optimal macro-op fusion. This is less reachable and less relevant for a highly thread-parallel workload.

To consider the benefit of these "HW thread" implementations then, you need to consider the memory architecture, which is very different between the two. Sure Larrabee theoretically has twice as many hardware threads with which to hide latencies, but GPU memory latencies are typically far more than 2x longer than CPUs.
Larrabee's not sufficiently threaded to cover any significant amount of memory latency, and neither is Nehalem.
Larrabee's hardware prefetch capabilities may not have been developed, it was not disclosed. Given the penalty prefetching can impose on bandwidth-heavy workloads, I wouldn't expect it to be as aggressive as Nehalem.

In terms of memory accesses, Nehalem has 48 entries in its load buffer. This is an aggregate 384 outstanding loads. Larrabee is in-order and would not attempt to do this level of speculation in memory. Bare minimum, I'd expect at least one outstanding load per thread, for at least 128.

On the other hand, if Larrabee has a memory bus roughly equivalent to other GPUs, it would take between 2-4 Nehalems to match its bandwidth.

Again, I'm not really arguing with you per se, just pointing out that the real power of throughput architectures is the SIMD. They make trade-offs that make them much less impressive when running (even multi-threaded) scalar C code.
I've already made note of how horrid Larrabee would be on x86 code without vector instructions, in part because doing so leaves very little else it can run.
However, even inefficient usage of the VPU, like say predicating 3/4 of the lanes off, would still give it some interesting strengths.

edit:
Even 15/16 lanes off, if only because I don't know what other FP math support it would have.
Nehalem could do an FP MUL, ADD, and store per clock.
Larrabee can do one VPU op and a vector. store
If its an FMADD, then each Larrabee core has similar throughput, although in restricted circumstances.
If there is no dependent add, then it's either a FMUL and store, or FADD and store.
That's 2/3 the issue capability of a Nehalem core, and in the case of a dual-socket i7 that's with 4 times as many cores.
 
Last edited by a moderator:
However, even inefficient usage of the VPU, like say predicating 3/4 of the lanes off, would still give it some interesting strengths.
Sure, I'm just not sure that ray tracing with 1-4 lanes active with more cores but smaller caches per core is one of those strengths. Maybe it is though, we'll see.

That's 2/3 the issue capability of a Nehalem core, and in the case of a dual-socket i7 that's with 4 times as many cores.
Sure, but at a lower clock rate and requiring more parallelism to fill all of its hardware threads (which never scales perfectly). I agree it still has some advantages even with 1 lane, but it's in the same ball-park. With good VPU use is when it starts to look particularly impressive :)
 
An Update On Our Graphics-related Programs

http://blogs.intel.com/technology/2010/05/an_update_on_our_graphics-rela.php

1. Our top priority continues to be around delivering an outstanding processor that addresses every day, general purpose computer needs and provides leadership visual computing experiences via processor graphics. We are further boosting funding and employee expertise here, and continue to champion the rapid shift to mobile wireless computing and HD video - we are laser-focused on these areas.

2. We are also executing on a business opportunity derived from the Larrabee program and Intel research in many-core chips. This server product line expansion is optimized for a broader range of highly parallel workloads in segments such as high performance computing. Intel VP Kirk Skaugen will provide an update on this next week at ISC 2010 in Germany.

3. We will not bring a discrete graphics product to market, at least in the short-term. As we said in December, we missed some key product milestones. Upon further assessment, and as mentioned above, we are focused on processor graphics, and we believe media/HD video and mobile computing are the most important areas to focus on moving forward.

4. We will also continue with ongoing Intel architecture-based graphics and HPC-related R&D and proof of concepts.
Jawed
 
So,

1. Our top priority continues to be around delivering an outstanding processor that addresses every day, general purpose computer needs and provides leadership visual computing experiences via processor graphics. We are further boosting funding and employee expertise here, and continue to champion the rapid shift to mobile wireless computing and HD video - we are laser-focused on these areas.
More GMA crap. :mad:

2. We are also executing on a business opportunity derived from the Larrabee program and Intel research in many-core chips. This server product line expansion is optimized for a broader range of highly parallel workloads in segments such as high performance computing. Intel VP Kirk Skaugen will provide an update on this next week at ISC 2010 in Germany.
Hmm.., smells nice. The new server line could be something big. Wider SMT? LRB1 being sold as an on-socket accelerator? :LOL:

3. We will not bring a discrete graphics product to market, at least in the short-term. As we said in December, we missed some key product milestones. Upon further assessment, and as mentioned above, we are focused on processor graphics, and we believe media/HD video and mobile computing are the most important areas to focus on moving forward.
No LRB this year for sure.
4. We will also continue with ongoing Intel architecture-based graphics and HPC-related R&D and proof of concepts.
More experimental stuff.
 
So,

More GMA crap. :mad:

To be honest HD graphics on Arrandale/Clarkdale isnt too bad, its very close to 780G/880G and is only beaten by MCP 79/89.

Sandy bridge should offer almost double the performance and IMHO that should be plenty for most people

No LRB this year for sure.
More experimental stuff.

By short term im sure they mean ~5 years. And according to Anand its been canned completely, it simply isnt on the roadmap anymore

Sony might be feeling a little burnt from RSX, but who knows...

What exactly is wrong with RSX? :???:
 
How well does GMA HD perform mm2 for mm2?

Would be funny if Intel massively scaled up their IGP to compete with Fusion, would be an indictment of Larrabee in a way.
 
Charlie holds firm to his previous article somehow I think he is right, the statement is not different than the one made in fall 2009, people are over reading it.
Charlie's paper is here
 
BSN: Why Intel Larrabee Really Stumbled: Developer Analysis

http://www.brightsideofnews.com/new...lly-stumbled-developer-analysis.aspx?pageid=1
This is not a summary by any means. Just a few bits I thought would be thought provoking.

They can take advantage of their existing world-dominating x86 architecture and its incredible performance.
So, you are gaining cache locality for the frame buffer [where you need it less], but losing cache locality for textures and shaders, where it is potentially 1000s of times more critical.
This is called cache coherence, and it requires a lot of on-chip communication. In fact, for large numbers of cores, it requires really quite a huge amount of on-chip communication.
The inter-core communication saturated extremely [transistor-costly] costly 1024-bit bi-directional bus [similar to one ATI used in R600 GPU] and efficiency dropped with each added core. Effectively with Larrabee, Intel created a non-scalable chip architecture.

The biggest shock for Larrabee was the discovery that it takes longer to implement something in software on a multi-core CPU than custom hardware.
 
I for one, would LOVE a LRB as an HPC part if it came with a decent OCL environment. LRB as a GPU is quite different thing.

However, before I even touch it with a 10 feet pole, I would be curious about the longevity of it's architecture. It has to last ~2 years in a market which is clearly not enough to sustain R&D. And the last thing I want to do is code for a dead end architecture like Cell.

It becomes a question of how different LRB3 will be.
 
I for one, would LOVE a LRB as an HPC part if it came with a decent OCL environment. LRB as a GPU is quite different thing.

However, before I even touch it with a 10 feet pole, I would be curious about the longevity of it's architecture. It has to last ~2 years in a market which is clearly not enough to sustain R&D. And the last thing I want to do is code for a dead end architecture like Cell.

It becomes a question of how different LRB3 will be.

Personally I think the "SCC" architecture is probably better fit for HPC market. However, it's probably not that suitable for consumer market (although it'd be interesting to see how a "SCCed Larrabee" performs on GPU workloads).
 
Yeah not sure if I agree with all of the analysis in that article. Certainly some of it is true, but other parts of it are a bit questionable/speculative (for instance, the note about cache coherency of tiling vs immediate-mode rendering misses the critical role that tile sizes play in this trade-off) and I have large issues with the discussion about fixed-function vs. programmable hardware and the "sell" of doing GPU stuff *and* moving CPU stuff to the GPU at the end of the article. That last part misses fundamental points about the algorithmic efficiency of solving various problems on different architectures.

Not sure I buy his scaling arguments either.
 
Back
Top