Intel's Single Chip Cloud Computer

Jawed

Legend
http://techresearch.intel.com/UserFiles/en-us/File/SCC_Sympossium_Mar162010_GML_final.pdf

I know this isn't a GPU, but I can't help feeling there's a lot of stuff in here that's relevant to Larrabee.

The chip is huge (567mm² on 45nm), has 48 P54c-based x86 cores, 12MB of L2, 1.3 billion transistors and 48 million transistors for 2 cores + gubbins to make a "tile".

Actually one thing really leaps out at me in that list: this chip is bigger than GF100, is on a notionally similar node (45 versus 40nm), but the transistor count in GF100 is ~2.3x higher :oops:

I suppose the key thing is that there's no cache coherency in hardware. Each tile has a 16KB message passing buffer which is shared by the 2 cores and the router gubbins in each tile deals with getting packets from MPB to MPB. Sort of seems like "software DMA" instead of the DMA that Cell has. And dependent upon a mesh rather than a ring.

Jawed
 
This is probably going to be very popular in HPC market, since it's basically a cluster-on-a-chip (you can use normal cluster oriented tool chains).
 
The chip is huge (567mm² on 45nm), has 48 P54c-based x86 cores, 12MB of L2, 1.3 billion transistors and 48 million transistors for 2 cores + gubbins to make a "tile".
TL;DR

So 256KL2 per core and 48 cores.

It's kinda odd that lrb had 32 cores with same cache in similar area. IOW, LRB spent ~2x area on x86 than on vector ALUs. Perhaps, that's something they should also look into fixing.

May be this chip has loads of I/O.
 
http://techresearch.intel.com/UserFiles/en-us/File/SCC_Sympossium_Mar162010_GML_final.pdf

Actually one thing really leaps out at me in that list: this chip is bigger than GF100, is on a notionally similar node (45 versus 40nm), but the transistor count in GF100 is ~2.3x higher :oops:

Jawed

Yeah, but GF100 is clocked at 700MHz for the core and 1401MHz for the shaders. This chip can run at 1300MHz for the CPU cores and 2600MHz for the mesh, maybe even higher.

That, and (45/40)² = 1,27.
 
You mean GF100? :p

Perhaps you'd like to explain why P54c in this is different from the P54c in Larrabee.

What sort of efficiencies in implementation, if this was to become a product, would you expect to see?
 
Thanks for the link, Jawed. That'll make a nice reading for the afternoon sitting outside in the sun. :)
 
Yeah, but GF100 is clocked at 700MHz for the core and 1401MHz for the shaders. This chip can run at 1300MHz for the CPU cores and 2600MHz for the mesh, maybe even higher.

That, and (45/40)² = 1,27.

GPUs are more dense than CPUs. There's even density difference between ATI and Nvidia GPUs.

The cores themselves don't seem to be modified in any way, unlike Larrabee and Atom. The SCC cores are pure P54C cores plus what's required for many core development platform. Larrabee features vector units while Atom has something for better general purpose performance.
 
SRAM (wich there is a lot in ATI's register file) is more dense than logic.

Different process are optimized for different goals as are the masks, it's not about GPUs vs CPUs, VIA Isaiah is a pretty dense CPU too.

I am merely pointing out that direct transistor count and die sizes are not comparable.

The differences in optimizations mostly lie in that GPU and CPU design goals are different. GPUs could probably sacrifice clock speed for higher density while CPUs need to be fast as possible in single threads which one of the easiest to do is clock speeds.

Density isn't that impressive on Isaiah

Via Isaiah 65nm 1MB L2

63mm2
94 million transistors

Athlon X2 64 4200+ 65nm 1MB L2

118mm2
221 million transistors
 
Density varies tremendously between random logic, I/O and analog, and memories. Here's a good example from Tukwila:

4 CPUs: 430M devices, 276mm2
L3 cache: 1.4B devices, 191mm2
sysint: 152M devices, 107mm2
I/Os: 39M devices, 123mm2

One reason that the SCC isn't very dense is that nobody cares about making it dense. It's a research project to look at new circuit techniques and architectures. In contrast, real products should be dense for margins.

As others have pointed out, GPUs are definitely designed on more density focused SOC processes. Intel's process technology is incredibly dense, but if you are aiming for 2.4GHz for a significant portion of a chip, compared to say, 750MHz for most ATI chips, there's going to be a hell of a density difference.

DK
 
As others have pointed out, GPUs are definitely designed on more density focused SOC processes. Intel's process technology is incredibly dense, but if you are aiming for 2.4GHz for a significant portion of a chip, compared to say, 750MHz for most ATI chips, there's going to be a hell of a density difference.
So how many process generations of advantage does Intel need when building "Larrabee" to get remotely close to the density required to be competitive? After taking account of the supposed bandwidth efficiencies of Larrabee architecture...

This appears to be the reason why Larrabee as a discrete GPU appears to have been canned. With discrete GPUs (desktop + mobile) on the cusp of being a shrinking market in favour of processor-integrated graphics, it seems Larrabee architecture will only reappear once it becomes viable in PIGs. In theory the relatively bandwidth-constrained nature of PIGs should make that time earlier, I suppose.

Jawed
 
Someone feel free to correct my figures and math:

45nm Penryn 410M transistors at 107mm2 = 3.8Mtrans/mm2
45nm Larrabee 1.7B transistors at >600mm2 ~ 2.8Mtrans/mm2
40nm Cypress 2.15B transistors at 334mm2 = 6.4Mtrans/mm2
32nm Westmere 1.17B transistors at 240mm2 = 4.9Mtrans/mm2

Larrabee's initial design may not have been as dense in part because it may not have had the resources dedicated to really match Penryn's optimized density, and because Penryn is stacked with really dense cache, whereas Larrabee shifted from the logic/cache ratio CPUs typically have.

Going by the limited data points, we wouldn't expect "parity" of sorts without 1.5-2 nodes of advantage.
Larrabee I may not be a good data point because of its various issues. The chip's rumored target FLOPS level was missed by about a factor of 2, going by the demo.

Whether or not discrete will shrink to the point it cannot support the necessary R&D is a market determination. The implicit argument that "good enough" will win the day makes me suspect the projected performance of a discrete Larrabee versus ASIC competition would not have been favorable for an extended time frame.
 
Larrabee's initial design may not have been as dense in part because it may not have had the resources dedicated to really match Penryn's optimized density, and because Penryn is stacked with really dense cache, whereas Larrabee shifted from the logic/cache ratio CPUs typically have.
At the same time, P54c is "relatively simple" and the entire core is quite compact in comparison with contemporary x86 cores. Additionally 4-way in-order SMT, while adding complexity, is lower in complexity than out-of-order, isn't it? Additionally, because each Larrabee core is stamped out so many times, making each core as small as possible has got to be a top priority.

Going by the limited data points, we wouldn't expect "parity" of sorts without 1.5-2 nodes of advantage.
At the same time processor integrated graphics is where the volume will be - technically it already is, as Intel already has dominance in graphics unit share simply with IGP. So then it's AMD versus Intel PIG, with both capable of revising chips yearly, with nodes that are ~1 year apart, and with AMD's graphics API expertise.

Larrabee I may not be a good data point because of its various issues. The chip's rumored target FLOPS level was missed by about a factor of 2, going by the demo.
You're referring to ~1 TFLOPS single-precision matrix multiply? Well the problem there is: what's the architecture/algorithm theoretically capable of?...

Whether or not discrete will shrink to the point it cannot support the necessary R&D is a market determination. The implicit argument that "good enough" will win the day makes me suspect the projected performance of a discrete Larrabee versus ASIC competition would not have been favorable for an extended time frame.
The other side of the coin is cloud based graphics rendering (well, game execution).

Jawed
 
At the same time, P54c is "relatively simple" and the entire core is quite compact in comparison with contemporary x86 cores. Additionally 4-way in-order SMT, while adding complexity, is lower in complexity than out-of-order, isn't it? Additionally, because each Larrabee core is stamped out so many times, making each core as small as possible has got to be a top priority.
The P54 lineage was from the 94-96 timeframe, and it was asked to run at an order of magnitude higher clocks, which doesn't speak to a high priority.
Intel has great engineering, and perhaps did dedicate the same kind of resources to Larrabee's implementation as it does to its mass-produced x86s, but that's a lot to put down on an untested and admittedly niche initial product.

At the same time processor integrated graphics is where the volume will be - technically it already is, as Intel already has dominance in graphics unit share simply with IGP. So then it's AMD versus Intel PIG, with both capable of revising chips yearly, with nodes that are ~1 year apart, and with AMD's graphics API expertise.
My comparison was for the density gap between a Larabee-based chip and an ASIC. Intel's IGP parts are ASICs, which makes me suspect they could be denser than a CPU, though until Intel puts the IGP on-die it is using older nodes for the IGP.

You're referring to ~1 TFLOPS single-precision matrix multiply? Well the problem there is: what's the architecture/algorithm theoretically capable of?...
The old slides had target of of 1 TFLOP DP, and not overclocked.
On SGEMM, other x86s get over 90% of peak (and RV770 thanks to prunedtree).
Either Larrabee's unique architecture means it gets terrible utilization on SGEMM, or it was nowhere near where it should have been when it demoed.
 
Additionally, because each Larrabee core is stamped out so many times, making each core as small as possible has got to be a top priority.

That could be true, but since Larrabee was originally slated to release early-2009, it was probably beset with massive problems and not what anyone would call elegant at all. Usually the well optimized ones don't have 2+ year delays. :)
 
A few notes:

1. mainstream x86 CPUs are heavily co-optimized at Intel with circuit technology and process to some extent. That means if Intel believe's Sandybridge or Nehalem needs 'special circuit x', it will happen. That's not true for LRB, since LRB volume was unknown and the circuit design wizards were probably onto designing 22nm or 16nm circuits.

2. transistors/mm2 isn't really a useful metric. The question is about GFLOP/s per mm2, and then the overhead used for rasterization on LRB.

3. density is as much about what process is used as ASIC vs. CPU. Intel heavily uses process/circuit/design co-optimization in creative ways to benefit their high volume applications (e.g. avoiding immersion till 32nm). For example, almost all CPU processes have much more restrictive design rules than ASIC processes; this can reduce density, but increase frequency, yield or lower manufacturing costs/investment. Intel is very public about the fact that their design rules are much more restrictive than TSMC, and that what other fabs might have in the form of 'recommendations', they may require. For instance, Intel tends to emphasize having multiple vias whenever possible. Interestingly enough, who had problems with vias again?

4. You guys are all forgetting about the fact that LRB was almost certainly power limited. NV certainly is...

5. Time to market matters. LRB was on 45nm and if it had come out earlier, would have had a substantial process technology advantage, which would have made up for architectural issues. It was late and had no density advantages over the competition... in fact, it's possible that the competing 40nm chips were more dense.
 
2. transistors/mm2 isn't really a useful metric. The question is about GFLOP/s per mm2, and then the overhead used for rasterization on LRB.
At least in SGEMM, Larrabee at +600mm2 and over stock clock barely edged out RV770...
I'm not sure what other numbers are available for comparison.

4. You guys are all forgetting about the fact that LRB was almost certainly power limited. NV certainly is...
The SGEMM demo had Larrabee overclocked to break a TFLOP, which I would assume meant breaking standard TDP. Perhaps they weren't able to tweak voltage, so that overclock was the best that could be done without that knob.
At any rate, whatever issues Larrabee had limited it to less than half of desired peak, and this assumes more modest clocks than the early 2.5 GHz top end for 24 cores, which would have left an even wider gap with 32 on die.

5. Time to market matters. LRB was on 45nm and if it had come out earlier, would have had a substantial process technology advantage, which would have made up for architectural issues. It was late and had no density advantages over the competition... in fact, it's possible that the competing 40nm chips were more dense.
How much earlier?
RV770 was denser from a transistor standpoint and it was at 55nm. Nvidia's GPUs had a density deficit for that generation compared to AMD, even if process normalized.
 
How much earlier?
RV770 was denser from a transistor standpoint and it was at 55nm. Nvidia's GPUs had a density deficit for that generation compared to AMD, even if process normalized.

It's possible that the demo version merely had 16 cores enabled.

Wasn't Larrabee originally supposed to launch early 2009? That would have given 3 quarter advantage before the 40nm variants from AMD were out in high end.

Makes sense if it wasn't delayed. Their big chip/new arch parts usually take 1 year to surface from the time the shrink chips on the new process tech rolls out. Penryn was 2008.
 
Via Isaiah 65nm 1MB L2

63mm2
94 million transistors

Athlon X2 64 4200+ 65nm 1MB L2

118mm2
221 million transistors

One or both transistor count are wrong or counted differently, the Brisbane is likely wrong since this value is often quoted for Windsor wich have twice the cache, Propus, on the other hand, have twice more complex cores and is around 300 million.

BTW, a Brisbane and Isaiah:
thm_AMD_Athlon_64_X2_Brisbane.jpg

VIA_Isaiah_Architecture_die_plot.jpg
 
Back
Top