Analyst: Intel's Larrabee chip needs reality check

I don't need to know what Nvidia is cooking to know there is more than one recipe besides Graphics ala Intel.
You can put less salt and more pepper, the base ingredients are still the same, it's not a matter of how, it's a matter of when. And that's the last crappy metaphor of the day ;)
 
He did say "market", not the "design", or its age. ;)
For all we know, Core 2's design was basically ready during the late "Prescott"/early Pentium D market dominance, and just went on debug/fab process refining mode from there until mid-2006.
That explains the very early appearance of "Conroe" engineering samples out in the open, much like what happened with "Nehalem" (albeit to a much smaller degree, given that the first ones are headed to completely new motherboards/socket infrastructure, and will be limited to the very high end for now).

The point being, there's nothing to suggest that designing a modern x86 takes any less time than designing a modern GPU.
However, its market life is considerably longer, not just because of large sales margins and overwhelming commercial position that limit Intel's willingness to frequently shed profits by cutting production on a given design after just 8 or 10 months, but also because the CPU market itself, being much larger than the one for dedicated GPU's, has a higher consumer/business inertia to even consider the possibility of upgrading with the same frequency as a PC gamer, for instance.

New CPUs are introduced into the market on roughly a quarterly basis. If we aren't going to talk about designs but about releases, I think the CPU market releases new product atleast at the same rate as the GPU market.
 

Well, it's a proven and well-working approach, whereas the huge cache approach is not. CPUs have to work with lots of quite random memory accesses. Huge cache makes a lot of sense there. GPUs on the other hand typically have a steady and predictable flow of data from memory with excellent locality and short life span. A texel you fetch is likely to be reused for neighboring pixels, but unlikely for pixels a dozen pixels away in any direction. The advantage of say 2MB cache over say 4KB cache is pretty small for normal texture access patterns.

Depends on whether you program Larrabee close to the metal or using a JIT of some sort. For D3D graphics a JIT is planned. Ct is JIT based too.

Sure. Which is why JIT makes a lot of sense, and why x86 doesn't. But this product is being marketted for its x86 abilities. People will be told to write to the metal to take advantage of all the power, only to be outperformed by JIT applications in the next generation.

Texels have a dedicated cache in Larrabee.

Why? Wasn't a huge unified cache the way to go?
Of course, dedicated texture cache makes a lot of sense. But then that of course throws much of the argument for the unified cache out of the window.

New CPUs are introduced into the market on roughly a quarterly basis. If we aren't going to talk about designs but about releases, I think the CPU market releases new product atleast at the same rate as the GPU market.

Of course we're talking about designs. Releases are uninteresting. But the GPU market released two new designs during the time Core 2 has been on the market. Of course any design is older than its time on the market, but how long they existed in design phase within the company before being released to the market is uninteresting. Comparing that age would be putting a penalty on companies that have several design teams, even if they release new designs at the same rate as another company, just because they have more products in the pipeline at a given time. It's the throughput, not the latency. The GPU market is running at twice the throughput on designs.
 
Are you financially associated with this in any way?
Yes. If Intel releases a free software renderer that also runs optimally on regular CPUs then I likely have to find another source of income. But here I try to be a sport and I try to argument why I believe Larrabee could work.
These assertions are weird. Nobody in their right mind want to use Larrabee as a vanilla x86 processor.
Why not? All those GPGPU applications not directly related to graphics only reach a fraction of efficiency. Often they are only a couple times faster than when running on a CPU. So imagine running them on a massively parallel CPU like Larrabee at high efficiency. For all I know there are even GPGPU projects that simply don't exist because performance turned out lower than what can be achieved on the CPU.
It sucks as a vanilla x86 processor using just about any metric you'd like to mention - price, size, performance, power draw, price/performance, et cetera et cetera
To build an x86 system with the same performance you're looking at a cluster that costs tens of thousands and needs its own room with dedicated cooling. So how could Larrabee's prize/performance be terrible?
The only reason to use Larrabee is if you want to take advantage of its parallell vector units. And those parallell vector units are NOT part of its x86 legacy. No programmer are well aquainted with the tools necessary to access, analyze and debug this functionality - it doesn't exist yet (outside Intel).
SSE has existed for years, and its performance went up from 2 FLOPS per clock for Pentium III to 32 FLOPS per clock for Core 2 Quad. And they don't plan on stopping there. AVX and FMA will each double the performance again.

So clearly Intel knows something you don't. These advances have been applauded by every developer who cares about floating-point performance. Larrabee simply allows to skip ahead and enter the TFLOPS era.

For the tools the wider vector units are just a detail. What they should allow from day one is to take any exisiting x86 application and instantly port it to Larrabee as a first prototype. That should greatly simplify development compared to writing things from complete scratch on the GPU.
How can you even suggest that writing multi parallell vector code is equivalent to a "hello world" program using Visual C++ or some other tool programmers are already "well aquainted" with? Just what the heck do you mean?
That's not exactly what I'm suggesting. I'm well aware of the challenges of writing multi-threaded vector code. But having fully general-purpose cores for which already a ton of software has been written is about the best starting point you can get. The step to take your multi-threaded vector application from the CPU to Larrabee should be really minor compared to writing GPGPU stuff from scratch.
I don't care at all about Larrabee as a graphics processor, I'm from computational science, and I care about it from a scientific computation angle. It may, depending on quite a few things, be useful there.
But that has nothing whatsoever to do with with being x86 compatible.
Sure. All I'm saying here is that having fully general-purpose cores is not a bad thing and x86 is not the worst choice of ISA.
x86 is there as a market lock in feature. Compare to OpenCL for an alternative take on performance coprocessing for personal computers. There are more ways than one to skin a cat, and it is no great mystery why Intel tries this path.
Stating the obvious.
Sorry about the tone, but you really seem anxious to sell the x86 part of Larrabee, and short of personal gain, I just can't see why anyone would do that. Feel free to explain.
Hey I can't help it that x86 dominates the market. But given that situation the choice seems like an excellent one to me. And in my experience x86 is not half as bad as some think it is. Also note that I'm not saying NVIDIA and AMD are doomed. Just because I'm excited about what Larrabee may bring doesn't mean I'm not interested in other technology.
 
Certainly. But that's not my point. For all I care it could be PowerPC, IA64, SPARC, ARM, MIPS or XYZ that dominates the market. I seriously doubt it would be significantly faster than x86 though. Other factors determine performance nowadays, and even performance / area.

If my memory serves correctly. Intel's x86 architechtures didn't win out in the end due to being intrinsically similar to or even nearly as efficient or performant as those other architechtures.

It was due almost entirely to their manufacturing process. You can afford to be 30-40% less efficient with transistor use if you manufacture your chips on significantly smaller process nodes.

The fact that Microsoft Windows ended up with the Lions share of the OS market didn't hurt any either. Although PPC and Alpha still had a chance with NT. IF they could have transitioned and manufactured their processors on similar process nodes as Intel.

Regards,
SB
 
Originally Posted by Humus
The GPU market is still operating at twice the speed of the CPU market. Core 2 is over two years old now.
so are the designs for GT200 and RV770! It could be argued that the RV770 design is even older than that!

Core2 can trace it's roots back to the P3 a major design change in going with 2 cores and other evolutionary changes. Rv770 can trace its roots back to? I'm sure Core2 is more similar to P3 than Rv770 is to R300.

Heck Core2 is probably more similar to P3 than Rv770 is to R580.

Regards,
SB
 
Well, it's a proven and well-working approach, whereas the huge cache approach is not.
Quake1 (just to make a popular example) says goodbye.
Why? Wasn't a huge unified cache the way to go?
Of course, dedicated texture cache makes a lot of sense. But then that of course throws much of the argument for the unified cache out of the window.
Not really.
The unified cache argument is based on the assumption that your cache provides data to an unified pool of ALUs+control logic, which is flexible&powerful enough to replace many tasks that were previously performed by fixed function units (HiZ, rasterizer, Input Assembly, color and z compression tags, ROPs tile caches, pre and post v-caches, primitive-setup mini cache, etc..).

LRB has TMUs for the simple reason that Intel designers thought texture addressing, sampling and filtering don't map well to LRB gp cores. Hence it's obvious that once you have a fixed function hw block that might need a continuous stream of data you can't starve it just because one has taken the retarded decision of sharing its cache with another, completely independent and data hungry, computational unit (a LRB core in this case).

It's evident that this wouldn't be the case if LRB hadn't any hw TMUs!
 
Core2 can trace it's roots back to the P3 a major design change in going with 2 cores and other evolutionary changes. Rv770 can trace its roots back to? I'm sure Core2 is more similar to P3 than Rv770 is to R300.

XB360 is where RV770 can trace its roots back to.

Heck Core2 is probably more similar to P3 than Rv770 is to R580.

From an available programming model perspective Core2 is much more different to P3 than RV770 is to R580: 64b ISA support, virtualization support, SSE4, etc.

Fundamentally, everyone is at roughly the same rep rates in semiconductor industry because to a first order, everyone is bound to roughly an 18-24 month process rep rate.
 
If my memory serves correctly. Intel's x86 architechtures didn't win out in the end due to being intrinsically similar to or even nearly as efficient or performant as those other architechtures.

It was due almost entirely to their manufacturing process. You can afford to be 30-40% less efficient with transistor use if you manufacture your chips on significantly smaller process nodes.

The fact that Microsoft Windows ended up with the Lions share of the OS market didn't hurt any either. Although PPC and Alpha still had a chance with NT. IF they could have transitioned and manufactured their processors on similar process nodes as Intel.
All true points, when talking about single-threaded scalar CPUs. The reality for Larrabee though is that it being x86 compatible isn't going to cost them 30-40% performance (or area) that they have to compensate entirely with process leadership. The performance comes from the wide vector units, which have their own modern ISA. Together with the cache (which as discussed falls in the same category as having huge register files and numerous buffers/queues) they determine the bulk of the die area. x86 is merely the ISA of the fully generic portion and lowers the entry threshold.

If Larrabee fails to impress at rasterization I believe it will be much rather due to other factors than x86. And if I'm right this also means that future generations can correct those flaws and still reap benefit from being compatible with the dominant CPU ISA.
 
The performance comes from the wide vector units, which have their own modern ISA.
Just how modern?
Each instruction can reference a memory operand, which is one source of complexity that does impact the size of the implementation.

Together with the cache (which as discussed falls in the same category as having huge register files and numerous buffers/queues) they determine the bulk of the die area. x86 is merely the ISA of the fully generic portion and lowers the entry threshold.
The L2 for Atom is twice the size of Larrabee's, and that is a little less than half of Atom's die. So 1 atom core area is greater than 1 L2 area.
Larrabee's L2 is at least numerically half of that L2. Cache scales better in size than logic, so Larrabee's might an even smaller proportion of the overall core+L2 area.
The vector unit is 1/3 of the die right of the bat. It is probably a bit bigger than it would be if it had a load/store ISA, but let's not consider that.

Let's just say 1/4 L2+1/3 VPU. I'll round that area up and say those two take 60% of the Larrabee core+cache area.
40% is the rest.
That leaves the x86 portion's contribution to overall bloat at 12-16% over what a more svelt ISA could bring, not taking int account SRAM scaling better than logic, the uncertainty over how much smaller the VPU could have been, and the fact that Larrabee's front end is a bit heftier due to some of the duplicated hardware and buffers inherent to having more threads.
 
Last edited by a moderator:
Just how modern?
Each instruction can reference a memory operand, which is one source of complexity that does impact the size of the implementation.

It adds one AGU in the VPU which really isn't a lot of work.

It's pretty clear that all the computational oomph is coming from the vector extensions, which are 4-operand, - like any other sensible vector extension would be. The old arcane x86 cruft is essentially only doing house keeping chores, calculating addresses, counting etc. Which it is more than capable of doing.

I'm guessing Larrabee is developed by a skunkworks-like group with limited resources, money and time-wise. They needed a two-way superscalar core (one instruction for the VPU every cycle, one for house work) which could be easily modified to accomodate the new VPU, dusted off the old Pentium architecture and ran with it.

I'm sure we'll see more optimized x86 architectures used in future Larrabee implementations if the concept takes off.


Cheers
 
It adds one AGU in the VPU which really isn't a lot of work.
Supporting all x86 addressing modes needs a bit more than just an AGU, unless vector instructions can never cause an exception on a memory access (and they skipped a few crufty addressing modes).

It's pretty clear that all the computational oomph is coming from the vector extensions, which are 4-operand, - like any other sensible vector extension would be. The old arcane x86 cruft is essentially only doing house keeping chores, calculating addresses, counting etc. Which it is more than capable of doing.
Capable of doing so with a core area penalty that might make the difference between 24 and 30 cores in a roughly the same die area. Other fudge factors might push it to a round 32, or possibly a secondary slim vector pipe could have snuck into the space.

The wattage penalty would be unknown, though using such an old core created before the days of active cooling might have a few issues anyway. (edit: make that early days of active cooling for PCs, I had a fan on my 486)
 
Last edited by a moderator:
GPUs on the other hand typically have a steady and predictable flow of data from memory with excellent locality and short life span.
Texels, yes (and as you now know, there's dedicated texel cache in Larrabee). But caching for Back End is much more interesting. Which is why the Siggraph paper's description of tiled pixel processing, from Setup to OM, is so intriguing. What they've designed (or if you prefer the compromise they've found) looks like it's going to work really nicely.

A texel you fetch is likely to be reused for neighboring pixels, but unlikely for pixels a dozen pixels away in any direction. The advantage of say 2MB cache over say 4KB cache is pretty small for normal texture access patterns.
The scale has changed now - R600 has hundreds of KB of texel cache - though a good part of that usage is down to pre-fetching I guess. The L1s are tens of KB. Larrabee is described as having "32KB per core" of texture cache, but it seems as if pre-fetching is not part of their plan.

The paper also says "The texture units perform virtual to physical page translation and report any page misses to the core, which retries the texture filter command after the page is in memory." This makes me think that texture filtering in Larrabee is intimately controlled by the core and that the filtering hardware is directly associated (private) with the core. So, from the point of view of the context that's trying to get filtered texture results, it sounds like it can get pretty awkward.

I do get a sense that texturing could be relatively weak in Larrabee. It would be sorta ironic, given how Larrabee has dedicated texturing hardware.

Sure. Which is why JIT makes a lot of sense, and why x86 doesn't. But this product is being marketted for its x86 abilities. People will be told to write to the metal to take advantage of all the power, only to be outperformed by JIT applications in the next generation.
That argument applies to graphics. Being able to code to the metal for HPC apps, and having a chip that can run existing HPC apps without change will provide a range of flexibility and smoothness of transition. CUDA et al don't offer anything like that.

Sure, Larrabee in blades (as the only x86 core) is prolly a long way off - but still it's cleaner than the x86 + Cell or x86 + GPU model that people are experimenting with now.

x86 is valuable because it's a commodity. That counts for a lot in non-consumer applications because it results in seriously cheap systems. Larrabee is just another step on that road.

Why? Wasn't a huge unified cache the way to go?
Of course, dedicated texture cache makes a lot of sense. But then that of course throws much of the argument for the unified cache out of the window.
They've gone to the lengths of dedicated texturing hardware so trying to filch cache from the core would prolly be a rod for their own back. Particularly as one of the bullet points on their justification for dedicated texturing hardware is "Loading texture data into the VPU for filtering requires an impractical amount of register file bandwidth".

Though shortly afterwards it says "Larrabee can also perform texture operations directly on the cores when the performance is fast enough in software". Not sure if that is meant to suggest something like fp16/fp32 texture filtering or merely that the cores will do un-filtered fetches of >32b texels directly from memory rather than going via the texture units.

Also, finally, the relatively loose structure of on-die cache (rather than wodges of registers) means the architecture is going to be more open to non-traditional rendering techniques. It's fundamental to Larrabee.

Jawed
 
The L2 for Atom is twice the size of Larrabee's[...]
How much bloat is there in Larrabee's L2 due to the multi-way (at least 8-way) cache coherency with other L2s? Also, how much bloat is there due to the special features? We know that the cache has extra prioritisation features (weightings for LRU?) and line locking.

I haven't the foggiest what sort of "bloat" this stuff would all add.

Jawed
 
I'm guessing Larrabee is developed by a skunkworks-like group with limited resources, money and time-wise. They needed a two-way superscalar core (one instruction for the VPU every cycle, one for house work) which could be easily modified to accomodate the new VPU, dusted off the old Pentium architecture and ran with it.

I'm sure we'll see more optimized x86 architectures used in future Larrabee implementations if the concept takes off.


Cheers
While I'm not able to discuss with you on the technical size, I think that is really interesting.
Intel really have to lend a deal in the console space.
Even if the contract is barely profitable it cold be enought to cover R&D, efforts on the software front would be shared, they would be sure that the product will have some support, shortly it would reassure investors and Intel may be likely to invest more on the project.


I've question and I'm surethat I don't have the technical knowledge to say if it's relevant/possible so if it doesn't make sense tell me ;)

If fact the is related to both Atom and larrabee.
Do Atom and Larrabee really need to be compatible with every X86 lline code ever written
(due to the market segment they are supposed to aim)?
Would it be possible to implement a custom subset of the X86 ISA "compatible enough" implying minor tweaks to the existing tools and allowing for some neat hardware gains?

If so Larrabee and Atom could share this new ISA and should they fuse?
 
The paper also says "The texture units perform virtual to physical page translation and report any page misses to the core, which retries the texture filter command after the page is in memory." This makes me think that texture filtering in Larrabee is intimately controlled by the core and that the filtering hardware is directly associated (private) with the core. So, from the point of view of the context that's trying to get filtered texture results, it sounds like it can get pretty awkward.

From the paper:
"Finally, the on-die inter-processor network allows fixed function
logic agents to be accessed by the CPU cores and in turn to access
L2 caches and memory. As with memory controllers, these would
typically be spread around the ring network to reduce congestion.
"

The wording seems to indicate that the texturing agents can be positioned in different ways.
I'm not sure how to reconcile this with the 32KB per core statement.
 
It might mean that there are N TMUs for M cores, and each TMU has 32kb * M/N texture cache, which might be pre-partitioned and statically assigned to certain cores (in this case a TMU would only serve a subset of all cores).
 
A static assignment of a portion of texture cache for each core would make sense, as this would ally with the core being involved in texture paging, i.e. the core appears to be responsible for fetching pages when they're not present and then re-requesting the filtering operation.

Jawed
 
Anything besides M=N might mean the TMUs are not private to each core.
If I understand Nao properly I think that some TMUs could aggregated into a bigger piece.
You could have for example every 4 cores along the bus a texture unit which aggregate 4TMUs and 128KB of cache splitted into 4 pools.
That what you think about Nao?
 
Back
Top