Intel's Single Chip Cloud Computer

dkanter · May 27, 2010

3dilettante said:
At least in SGEMM, Larrabee at +600mm2 and over stock clock barely edged out RV770...
I'm not sure what other numbers are available for comparison.

How can LRB have a 'stock' clock, when there are no SKUs?

The SGEMM demo had Larrabee overclocked to break a TFLOP, which I would assume meant breaking standard TDP. Perhaps they weren't able to tweak voltage, so that overclock was the best that could be done without that knob.
At any rate, whatever issues Larrabee had limited it to less than half of desired peak, and this assumes more modest clocks than the early 2.5 GHz top end for 24 cores, which would have left an even wider gap with 32 on die.

You're missing my point. I think they probably would have traded off density for better power consumption.

I also don't think it necessarily means breaking TDP. Depends on your cooling. If you cool well, you lower Tj and your transistors run cooler, consume less power, etc. etc.

How much earlier?
RV770 was denser from a transistor standpoint and it was at 55nm. Nvidia's GPUs had a density deficit for that generation compared to AMD, even if process normalized.

Penryn hit markets in late 2007. If LRB hit markets in late '08 (which would have taken a miracle) it probably would have been much better. They would have had a density advantage, a performance advantage, a big power advantage (HKMG + normal process shrink gains) and perhaps, just perhaps, those two advantages could make up for other issues.

However, that's clearly impossible scheduling wise. A more interesting questions is: What if they skipped 45nm and went straight to 32nm at the end of 2010? Again, while their competition is still on 40nm. They'd have a half-node density advantage, mature yields, a substantial transistor performance advantage, etc.

The point is that Intel should have figured out how to leverage their process technology advantages more heavily (to make up for software disadvantages), rather than falling victim to an overly aggressive schedule and coming to market at parity.

David

DavidC · May 27, 2010

EduardoS said:
One or both transistor count are wrong or counted differently, the Brisbane is likely wrong since this value is often quoted for Windsor wich have twice the cache, Propus, on the other hand, have twice more complex cores and is around 300 million.

Windsor is 90nm though.

Yea you are right, the values are wrong.

Brisbane 154 million, 126mm2

Still, not that impressive because Isaiah performs far less than Windsor.

Jawed · May 27, 2010

3dilettante said:
The P54 lineage was from the 94-96 timeframe, and it was asked to run at an order of magnitude higher clocks, which doesn't speak to a high priority.

Yet it appears to be quite important as the basis of quite a few "forward looking" multi-core projects.

Intel has great engineering, and perhaps did dedicate the same kind of resources to Larrabee's implementation as it does to its mass-produced x86s, but that's a lot to put down on an untested and admittedly niche initial product.

Maybe it was a "too simple to fail" ethos?

The old slides had target of of 1 TFLOP DP, and not overclocked.
On SGEMM, other x86s get over 90% of peak (and RV770 thanks to prunedtree).
Either Larrabee's unique architecture means it gets terrible utilization on SGEMM, or it was nowhere near where it should have been when it demoed.

L1 bandwidth is vast, so I suppose it's then a question of L2 bandwidth. EDIT: just noticed that BLAS3 is shown with >50x scaling at 64 cores in figure 20 of Seiler - although that's based on 1GHz cores and we don't know how core clock and ring bus clocks scale, nor what off-chip bandwidth is like. Though that graph is based solely on a simulation.

But yes, it seems inexcusable to have struggled to reach 1 TFLOP single precision like that.

Jawed

Jawed · May 27, 2010

DavidC said:
That could be true, but since Larrabee was originally slated to release early-2009, it was probably beset with massive problems and not what anyone would call elegant at all. Usually the well optimized ones don't have 2+ year delays.

How much of its problems are to do with software (drivers) though?

I get the distinct impression that of the hardware and software bites that Intel took, the software one was a tougher chew. Larrabee is essentially dependent upon per-state optimisation for traditional D3D/OGL pipeline rendering. Separating tasks by category and partitioning categories to cores gets them so far...

Jawed

EduardoS · May 28, 2010

DavidC said:
Still, not that impressive because Isaiah performs far less than Windsor.

Yeah, Isaiah sucks, but I just pointed the density, in special the density of it's cache, VIA needs much more billions to make a decent processor but for the example of density it's ok.

Jawed · May 28, 2010

dkanter said:
A few notes:

1. mainstream x86 CPUs are heavily co-optimized at Intel with circuit technology and process to some extent. That means if Intel believe's Sandybridge or Nehalem needs 'special circuit x', it will happen. That's not true for LRB, since LRB volume was unknown and the circuit design wizards were probably onto designing 22nm or 16nm circuits.

So what you're basically saying is that the chump change it would have cost Intel to do the job properly just wasn't worth doing?

2. transistors/mm2 isn't really a useful metric. The question is about GFLOP/s per mm2, and then the overhead used for rasterization on LRB.

I'm not trying to use transistor density as an absolute metric, just a first approximation. We don't have much of a good idea how big the chip is...

3. density is as much about what process is used as ASIC vs. CPU. Intel heavily uses process/circuit/design co-optimization in creative ways to benefit their high volume applications (e.g. avoiding immersion till 32nm). For example, almost all CPU processes have much more restrictive design rules than ASIC processes; this can reduce density, but increase frequency, yield or lower manufacturing costs/investment. Intel is very public about the fact that their design rules are much more restrictive than TSMC, and that what other fabs might have in the form of 'recommendations', they may require. For instance, Intel tends to emphasize having multiple vias whenever possible. Interestingly enough, who had problems with vias again?

So what you're saying is that Intel can only do CPUs. Everything else is scrap that occupies production lines that would otherwise be rotting.

4. You guys are all forgetting about the fact that LRB was almost certainly power limited. NV certainly is...

Everyone's power-limited. That's no excuse, it's the norm.

5. Time to market matters. LRB was on 45nm and if it had come out earlier, would have had a substantial process technology advantage, which would have made up for architectural issues. It was late and had no density advantages over the competition... in fact, it's possible that the competing 40nm chips were more dense.

I don't see how Larrabee, if it launched summer 2009, could have been competitive in terms of die size/power - it was rumoured to be GTX285 performance, best-case. Also I wouldn't take it as read that it was 45nm. Back then there were real doubts it could be, on the basis that Intel was 45nm capacity constrained.

But I've always had the view that the performance didn't matter until version 3 (as seen by consumers). Intel didn't have to compete for the halo with the first couple of iterations, in my view.

Jawed

dkanter · May 28, 2010

Jawed said:
So what you're basically saying is that the chump change it would have cost Intel to do the job properly just wasn't worth doing?

Um, no offense, but you do realize that skilled circuit designers who can work with a high performance 45nm process don't grown on trees...even in Oregon?

Designing custom circuits takes time, people and money. Money cannot buy you people and it sure as hell can't buy you time. Intel's advanced circuit teams had probably gone off to work on other things by then....like designing SRAM or PLLs for 22nm.

Now what would you do? Take away resources from an essential and undoubtedly profitable project and put them on a questionable one?

I'm not trying to use transistor density as an absolute metric, just a first approximation. We don't have much of a good idea how big the chip is...

So what you're saying is that Intel can only do CPUs. Everything else is scrap that occupies production lines that would otherwise be rotting.

I'm saying that other products have to use the process that is optimized for CPUs, yes. Although with 32nm, there is the SOC process which is much better suited to SOCs. So ironically, a 32nm LRB might have turned out much better.

I don't see how Larrabee, if it launched summer 2009, could have been competitive in terms of die size/power - it was rumoured to be GTX285 performance, best-case. Also I wouldn't take it as read that it was 45nm. Back then there were real doubts it could be, on the basis that Intel was 45nm capacity constrained.

Summer is too early, I was thinking 4Q.

But I've always had the view that the performance didn't matter until version 3 (as seen by consumers). Intel didn't have to compete for the halo with the first couple of iterations, in my view.

Jawed

Intel has to sell LRB at at least a mild profit.

David

DavidC · May 28, 2010

How did we go from SCC to Larrabee? Of course, I won't deny it has a connection. The "development platform" that SCC is probably could have been Larrabee instead. At least some of the lessons they learn from SCC will certainly migrate to Larrabee as well.

Yes, it probably had hardware problems as well. Intel can't always avoid that.

The funny thing is on the SCC they promote cache coherency over software schemes while on Larrabee it was all hardware. I wonder if that will change on future Larrabee?

Just trying to be on topic.

Jawed · May 28, 2010

dkanter said:
Um, no offense, but you do realize that skilled circuit designers who can work with a high performance 45nm process don't grown on trees...even in Oregon?

Designing custom circuits takes time, people and money. Money cannot buy you people and it sure as hell can't buy you time. Intel's advanced circuit teams had probably gone off to work on other things by then....like designing SRAM or PLLs for 22nm.

Now what would you do? Take away resources from an essential and undoubtedly profitable project and put them on a questionable one?

Depends what proportion of the circuit/node optimisation effort is computational, I'd say. I wouldn't expect Intel to be researching new stuff specifcally for Larrabee/SCC, but some kind of libraries for re-use seem unavoidable.

Optimisation and verification should be mostly a computing problem, not a manpower problem, shouldn't it?

I'm saying that other products have to use the process that is optimized for CPUs, yes. Although with 32nm, there is the SOC process which is much better suited to SOCs. So ironically, a 32nm LRB might have turned out much better.

Is there any evidence that 32nm works that well?

Summer is too early, I was thinking 4Q.

So, even less competitive.

I've never honestly expected anyone at Intel to presume the first chip would be competitive as a halo product. A simple graph over time of the performance of ATI and NVidia chips shows the folly. They were not due to slam into the wall for quite a while after the Larrabee project started - though NVidia's considerably closer with its architecture.

Intel has to sell LRB at at least a mild profit.

Eventually, yes.

Anyway, I'm now convinced that discrete GPUs have about 5-7 years left. The lion's share of the revenue growth is in integrated and mobile, and the tide of performance there is rising very rapidly - even if there's a 20x range between HD5870 and the crap on an 890G motherboard. Llano and Sandy Bridge are both going to put a huge dent in that range. So, arguably, Larrabee as a discrete board was doomed anyway.

I certainly don't think Larrabee's principles are doomed - just they'll show up somewhere else. This does reinforce the idea, though, that it was always going to be treated like a runt within Intel.

Atom seems to have been somewhat "accidental". While it was aimed at being low power, it was hardly optimal in its first incarnation, was it?

Jawed

Jawed · May 28, 2010

DavidC said:
How did we go from SCC to Larrabee? Of course, I won't deny it has a connection. The "development platform" that SCC is probably could have been Larrabee instead. At least some of the lessons they learn from SCC will certainly migrate to Larrabee as well.

There's also ring versus mesh and explicit message routing coupled with dedicated message buffers versus cache-coherency-based messaging in Larrabee.

Yes, it probably had hardware problems as well. Intel can't always avoid that.

It seems that Intel's prowess is solely in terms of the cutting edge high-performance consumer CPUs. It's almost as if Intel has painted its fabs into a corner, where it can't do anything else, well, other than these consumer CPUs. (Well, that's where the revenue is, isn't it?...)

The funny thing is on the SCC they promote cache coherency over software schemes while on Larrabee it was all hardware. I wonder if that will change on future Larrabee?

Well Larrabee's scheme is based on hardware steered by cache intrinsics, locked lines/temporal hints at different cache levels etc. It's not pure hardware.

The message passing buffers in SCC could easily live inside locked lines in L2... The MPBs are still software managed (e.g. allocation per node within each local MPB). Though the MPBs are tile-shared, not core-local, so Intel is amortising routing costs over two cores. Why not 4 cores? etc.

Just trying to be on topic.

I started the thread specifically to compare and contrast these chips, not to be solely about SCC.

Jawed

DavidC · May 28, 2010

Jawed said:
Atom seems to have been somewhat "accidental". While it was aimed at being low power, it was hardly optimal in its first incarnation, was it?

Yes, it definitely wasn't optimal, Moorestown will improve it a lot. Though it was a first in line of LPIA products. If we look at Larrabee that way, the follow ups could have been impressive.

I started the thread specifically to compare and contrast these chips, not to be solely about SCC.

Gotcha.

Well Larrabee's scheme is based on hardware steered by cache intrinsics, locked lines/temporal hints at different cache levels etc. It's not pure hardware.

Yes, but they are related in that they are both many core products. SCC is made as a learning platform for future many core products, and Larrabee is targeted towards high flops, and extremely parallel apps. If the future Larrabee derivatives change a lot, it might just adopt SCC's cache coherency scheme. Who knows? Maybe they decide it'll turn up better.

Just my 2 cents.

Blazkowicz · May 28, 2010

The thread title has just shocked me, it's terribly funny nonsense.
I might as well build a Single Computer Computer Network

3dilettante · May 28, 2010

dkanter said:
How can LRB have a 'stock' clock, when there are no SKUs?

If something is overclocked, as the 1 TFLOP demo was stated as requiring, I assumed there was some base clock to go over.

You're missing my point. I think they probably would have traded off density for better power consumption.

I suspect that would have been secondary to the convenience (necessity?) of trying to get a pipeline first designed to run at ~200 MHz to the 1.5-2 GHz range.

I also don't think it necessarily means breaking TDP. Depends on your cooling. If you cool well, you lower Tj and your transistors run cooler, consume less power, etc. etc.

The range of cooling solutions for the target market is what it is. Perhaps as a boutique product, Intel could have afforded to splurge for Larrabee more so than either Nvidia or AMD do, with Fermi being perhaps the highest class stock cooler and most generously expanded case requirements in quite some time.

Penryn hit markets in late 2007. If LRB hit markets in late '08 (which would have taken a miracle) it probably would have been much better. They would have had a density advantage, a performance advantage, a big power advantage (HKMG + normal process shrink gains) and perhaps, just perhaps, those two advantages could make up for other issues.

RV770 was released in mid '08 and it was denser than the Larrabee that was put on display. It was barely edged out by the overclocked Larrabee in SGEMM.
I would have been curious how the 600mm2 chip would have performed, given its 2.5x die size advanage.

However, that's clearly impossible scheduling wise. A more interesting questions is: What if they skipped 45nm and went straight to 32nm at the end of 2010? Again, while their competition is still on 40nm. They'd have a half-node density advantage, mature yields, a substantial transistor performance advantage, etc.

The point is that Intel should have figured out how to leverage their process technology advantages more heavily (to make up for software disadvantages), rather than falling victim to an overly aggressive schedule and coming to market at parity.

Is this more of political and product-based question?
The product lines with first dibs on the leading edge processes did not seem particularly interested in Larrabee coming to the party. The latest anouncement of an HPC initiative including Larrabee is actually a change of pace, since Intel or one portion of it had ruled that out earlier. The proposed socket version of Larrabee was ruled out by one of the execs linked with the Xeon lines.

It would make sense, the idea of a 600mm2 chip sold for less than a fraction of a high-end Xeon with 5-10 times the performance for certain workloads would give Intel's high-margin lines a headache.

DavidC said:
It's possible that the demo version merely had 16 cores enabled.

My interpretation of the spokesman's statements was that this was not the case.

Jawed said:
Yet it appears to be quite important as the basis of quite a few "forward looking" multi-core projects.

Mostly the ones where the core itself is secondary to the primary focus of the design, or perhaps the project was not really high enough priority to warrant the resources to make a new one.
For this latest project, it is the messaging network.

For Larrabee, it was the VPU and cache, and I always got the subjective impression it did not rate that highly (Intel's schizophrenic raytracing/rasterizer statements, high end-no volume-no I mean high end target market, the wandering target performance level, the use of a core at clocks way above its original design envelope).

Maybe it was a "too simple to fail" ethos?

As far as the first product is concerned, maybe it was more of a "do enough to just get it working but not enough to make anything else we make money from look bad" ethos.

cho · May 30, 2010

http://www.xbitlabs.com/articles/cpu/display/ff_intc_201005.html

Npl · May 31, 2010

Hmm, x86 cores without FPU? Means no backwards compatibility with existing code.
Wouldnt it be way more interesting to look into existing solution than Intels research projects (or atleast look into them first). I dont see whats new there except the blessing and curse that the x86 legacy holds.
eg. there isTILEPro64 with 64 (32bit) cores on 90nm.
Tile-Gx100 with 100 (64bit)cores on 45nm scheduled to arrive 2011.
Would love to see a technological breakdown on those before looking at research projects that may or may not materialize someday.

Blazkowicz · May 31, 2010

you can run fpu code on processor with no fpu, such as a 486SX or prior. Exception handling catches fp instruction and integer-based emulation kicks in (I don't know how), successfully running the instruction. (that takes like 2000 cycles).
old stuff.

brain_stew · Jun 1, 2010

Anandtech have a new article up and it seems as though both this and the Larrabee project are merging with the aim to hit the the HPC space with a 22nm chip codenamed "Knight's Corner."

Anand said:
Larrabee as an architecture is not dead however. Intel made it very clear that a variant of the architecture designed for High Performance Computing (HPC) applications was still on track. It’s called the Intel Many Integrated Core (MIC) architecture and it claims to leverage both Larrabee and Intel’s many core research projects. The first Intel MIC product (codenamed Knights Corner) will be built on Intel’s 22nm process and feature more than 50 cores. I’m guessing it’s too early to tell how many cores will yield once the chip is done and Intel doesn’t want to miss any targets so we get a conservative estimate today.

http://www.anandtech.com/show/3749/intel-mic-22nm-50-cores-larrabee-for-hpc-announced

larrabee · Jun 1, 2010

Jawed said:
Depends what proportion of the circuit/node optimisation effort is computational, I'd say. I wouldn't expect Intel to be researching new stuff specifcally for Larrabee/SCC, but some kind of libraries for re-use seem unavoidable.

Optimisation and verification should be mostly a computing problem, not a manpower problem, shouldn't it?

Jawed

optimization of design layouts require a strong intuition from skilled engineers. a lot of time can be wasted trying every layout and a lot of performance can be gained or vice versa. its hard to guess or predict the time for testing and designing a chip and larrabee is very different than anything intel has ever built so it's definitely a lot of work.

spacemonkey · Jun 9, 2010

I wonder if this SCC is the descendant of that 80 core 'Polaris' chip they played with 3 years ago.

http://en.wikipedia.org/wiki/Teraflops_Research_Chip

bbot · Jun 26, 2010

Intel Knights Corner? Why bother making a 50-core chip that has a performance of 1 teraflops, single-precision? Wasn't the cancelled 32-core IBM cell chip going to have the same performance? And doesn't the AMD Cypress(?) GPU already have a performance of 2.72 teraflops, single-precision?

Intel's Single Chip Cloud Computer

Similar threads