Analyst: Intel's Larrabee chip needs reality check

In a GPU setup, there might be 2-4 of those.
For Larrabee 16,32,64?
It makes for interesting tradeoffs.
RV770 has 10 clusters. I interpret this to mean it has 10 sequencers (all independent), each of which is running user-defined programs. The ALUs and TUs take control of those instructions in these programs that are targeted at them - the rest of the instructions run on the sequencer (control flow, register manipulation, cache control, etc.)

By the time Larrabee arrives ATI could be at ~32 clusters.

Jawed
 
MIPS has branch delay slots. An architecture wart on the side of a somewhat clean architecture. BDS are in general a very bad design.
Dunno. On the MIPS1 & maybe 2, when there was only a 3 or 4 stage pipeline then it's a significant contribution to keeping your pipelines full, and the whole raison d'etre of the MIPS project was to avoid spending money on silicon when you can spend it on software instead.

By the time of MIPS 3 then there's no doubt it was becoming baggage, but they wanted to maintain compatibility as far as possible and so they were stuck with it.

I've also played with the Jaguar RISC implementation quite a bit, which has a slightly different take on it (the BDS are uninterruptible, so they don't have to jump through quite as many hoops with the interrupts).
 
I expect the 16-way SIMD to be a significant chunk of the logic. If the L2 cache on the other hand is large I count that as mistake number two for a GPU design.
Why? Instead of building a pile of large register files like in RV770 (2.5MB in total, it seems), the memory will be organised as a cache (2MB).

GPUs are riddled with little buffers/caches. Nasty things happen when they fill or run dry. Agglomeration of these dedicated functional blocks into general cache, Larrabee style, looks pretty damn smart to me (given that Larrabee has a cache control protocol that is smarter than typical CPU caching). It's the same argument as unified versus dedicated VS and PS pipes.

Speaking of the 16-way SIMD. If Larrabee II comes with even newer and shinier 32-way SIMD, old apps will likely operate at half the speed of what the chip could potentially do. For ATI and Nvidia this is not a problem.
Depends on whether you program Larrabee close to the metal or using a JIT of some sort. For D3D graphics a JIT is planned. Ct is JIT based too.

Jawed
 
There were many non-technical reasons for x86's success.
Certainly. But that's not my point. For all I care it could be PowerPC, IA64, SPARC, ARM, MIPS or XYZ that dominates the market. I seriously doubt it would be significantly faster than x86 though. Other factors determine performance nowadays, and even performance / area.
A lot of structures that are not in the decoder are affected by supporting x86 at speed.
Such as? Beyond the decoders x86 is really just a RISC architecture as far as I know. And the decoders are only marginally larger than those for other ISAs and are still dwarfed in comparison to other units.
 
It would be quite a gamble [for NVIDIA] to copy Intel when not even Intel knows how well its gamble will pay off once present in final silicon.
Ever since the advent of shaders there has been talk about GPUs becoming more like CPUs. And with multi-core CPUs things even started to converge from the other end as well. So it's not a giant leap to try to see where the convergence will end. Using its expertise, NVIDIA can pretty much ensure that any new architecture is going to perform as expected while still offering many new features at minimal cost. It's a far greater gamble for Intel, who has no point of comparison and no alternatives to evaluate.
 
so are the designs for GT200 and RV770! It could be argued that the RV770 design is even older than that!

He did say "market", not the "design", or its age. ;)
For all we know, Core 2's design was basically ready during the late "Prescott"/early Pentium D market dominance, and just went on debug/fab process refining mode from there until mid-2006.
That explains the very early appearance of "Conroe" engineering samples out in the open, much like what happened with "Nehalem" (albeit to a much smaller degree, given that the first ones are headed to completely new motherboards/socket infrastructure, and will be limited to the very high end for now).

The point being, there's nothing to suggest that designing a modern x86 takes any less time than designing a modern GPU.
However, its market life is considerably longer, not just because of large sales margins and overwhelming commercial position that limit Intel's willingness to frequently shed profits by cutting production on a given design after just 8 or 10 months, but also because the CPU market itself, being much larger than the one for dedicated GPU's, has a higher consumer/business inertia to even consider the possibility of upgrading with the same frequency as a PC gamer, for instance.
 
Dunno. On the MIPS1 & maybe 2, when there was only a 3 or 4 stage pipeline then it's a significant contribution to keeping your pipelines full, and the whole raison d'etre of the MIPS project was to avoid spending money on silicon when you can spend it on software instead.

Exposing micro architecture is generally a bad idea if you intend to build an ISA with any kind of longevity.

Cheers
 
GPUs are riddled with little buffers/caches. Nasty things happen when they fill or run dry. Agglomeration of these dedicated functional blocks into general cache, Larrabee style, looks pretty damn smart to me (given that Larrabee has a cache control protocol that is smarter than typical CPU caching). It's the same argument as unified versus dedicated VS and PS pipes.
Having a general purpose cache, transparent to the user is a very nice feat. However as the paper Intel published at SIGGRAPH clearly shows it just doesn't work for graphics. Larrabee has cache-control instructions because you want cache blocks to be pinned, or you want stores to go to memory without polluting the caches, or recently loaded lines to be evicted immediately, the rasterizer must choose an appropriate tile size depending on the render target to fit the L2, back-end threads are pinned to a specific core so that non-atomic instructions can be used for synchronization between them, etc...

This is not general purpose programming with a general purpose cache, it is 'emulating' hardware buffers and behavior using special purpose instructions and cache-aware code.
 
Since the release of the first C2D, Nehelm hasn't launched yet, what exactly has changed in Intels CPUs other than fsb, cache and speeds? Not much as far as I can tell hopefully someone with a bit more knowledge can explain, mean while during that same time period we have seen 3 generations of GPUs.

http://www.xbitlabs.com/articles/cpu/display/intel-wolfdale_2.html

Not really that much of a difference, but there's more than SSE4.1, faster FSB/clock speeds and larger cache to the Penryn core.
 
Exposing micro architecture is generally a bad idea if you intend to build an ISA with any kind of longevity.
Sure, but back in the 1980s people didn't realise the days of making heavy revisions to the ISA were coming quickly to an end.
 
Why? Instead of building a pile of large register files like in RV770 (2.5MB in total, it seems), the memory will be organised as a cache (2MB).
Why is that better if both designs are going to be using that storage predominantly to house a lot of thread context?

GPUs are riddled with little buffers/caches. Nasty things happen when they fill or run dry. Agglomeration of these dedicated functional blocks into general cache, Larrabee style, looks pretty damn smart to me (given that Larrabee has a cache control protocol that is smarter than typical CPU caching). It's the same argument as unified versus dedicated VS and PS pipes.
The counterargument is that a handful of generally slower and one size fits all cache ports under contention isn't particularly nice either.

Certainly. But that's not my point. For all I care it could be PowerPC, IA64, SPARC, ARM, MIPS or XYZ that dominates the market. I seriously doubt it would be significantly faster than x86 though. Other factors determine performance nowadays, and even performance / area.
Contemporaneous RISC designs to i386 and Pentium could do the same with a 30-40% economy in transistor count and often lead signficantly in performance.
That's an awfully big factor to discount.

Such as? Beyond the decoders x86 is really just a RISC architecture as far as I know. And the decoders are only marginally larger than those for other ISAs and are still dwarfed in comparison to other units.
If Larrabee used the Pentium Pro, your point would be valid. P54 is all CISC,, and the decoders are massively larger than an equivalent RISC decoder.
On top of that, x86 has much more complex addressing modes, flag handling, exceptions, x87, paging and segments, and various other things that add small but measurable amounts of transistor complexity.
 
This is not general purpose programming with a general purpose cache, it is 'emulating' hardware buffers and behavior using special purpose instructions and cache-aware code.
You're bound to a certain cache size and have to adapt the software to it, that's true, but it's still much more flexible and adaptive than what GPUs offer today. For any algorithm you have in mind, the entire cache is available for your temporary storage needs. The sizes of buffers and queues can be adjusted by changing a line of code, changing an application setting, or even adjusted dynamically.
 
You're bound to a certain cache size and have to adapt the software to it, that's true, but it's still much more flexible and adaptive than what GPUs offer today. For any algorithm you have in mind, the entire cache is available for your temporary storage needs. The sizes of buffers and queues can be adjusted by changing a line of code, changing an application setting, or even adjusted dynamically.
Yes, and this would be an advantage as a GPU cannot magically re-allocate its post-transform cache for frame buffer data for example. But a cache also has two significant drawbacks:
- you can thrash it even if the data you store is unrelated, this will effectively reduce its utilization (i.e. effective size)
- it requires coherency-related traffic which can quickly burn quite a bit of your interconnect bandwidth even if the data is not shared

GPU internal buffers do not have both of the above problems, vertex data is not going to push out texel data you might be using soon and passing fragments between the rasterizer and data movement doesn't require additional bandwidth to inform everybody else that you are doing it.

The only way to (partially) work around these problems is complex, non-general purpose code. Look at the back-end in the SIGGRAPH paper: one setup thread and three work threads are pinned to a core, they use an non-shared counter as a semaphore to save on the synchronization overhead. The setup thread uses cache-control instruction to avoid polluting the cache with vertex data which won't be needed soon. Work threads run a loop which has been unrolled by the JIT to work on a certain batch size to cover the execution and texture fetch latencies (which is hard to estimate BTW and may change dynamically potentially causing lower utilization).

This might be cool but it is not exactly general purpose code you find every day. You are using a special purpose CPU with special purpose code to work around problems you don't have on a GPU and try to reap benefits of advantages you have over a GPU. Potentially consuming much more power per work unit I might add.
 
It's about the tool chain. You can create absolutely anything for Larrabee from the day it launches, you just have to write the application (or scale an existing app). You already have the O.S. type functionality, runtime libraries, compilers, powerful debug and profiling tools, frameworks, etc. And developers are already well acquainted with them.

Are you financially associated with this in any way?
These assertions are weird. Nobody in their right mind want to use Larrabee as a vanilla x86 processor. It sucks as a vanilla x86 processor using just about any metric you'd like to mention - price, size, performance, power draw, price/performance, et cetera et cetera
The only reason to use Larrabee is if you want to take advantage of its parallell vector units. And those parallell vector units are NOT part of its x86 legacy. No programmer are well aquainted with the tools necessary to access, analyze and debug this functionality - it doesn't exist yet (outside Intel).
How can you even suggest that writing multi parallell vector code is equivalent to a "hello world" program using Visual C++ or some other tool programmers are already "well aquainted" with? Just what the heck do you mean?

I don't care at all about Larrabee as a graphics processor, I'm from computational science, and I care about it from a scientific computation angle. It may, depending on quite a few things, be useful there.
But that has nothing whatsoever to do with with being x86 compatible.

x86 is there as a market lock in feature. Compare to OpenCL for an alternative take on performance coprocessing for personal computers. There are more ways than one to skin a cat, and it is no great mystery why Intel tries this path.

Sorry about the tone, but you really seem anxious to sell the x86 part of Larrabee, and short of personal gain, I just can't see why anyone would do that. Feel free to explain.
 
Last edited by a moderator:
No one is saying that P54 is a state of the art CPU, but last time I checked NVIDIA has designed ZERO of them, so while I'm sure they are very smart and they can do it, I don't see them being as good as Intel from the get-go.
Nvidia has a few CPU products, so it has at least licensed and implemented a few.

The argument <company X has no experience with Y> has to be a reflexive one.
The qualifier is that one of the two problems is harder and is less established than the other.
There are dozens to hundreds of general-purpose CPU designs, depending on how loose we define the term.
The pool of massively parallel products is smaller, and the number of general-purpose massively parallel products is about nil as of now.

On the other hand you might say that NVIDIA doesn't need to design such general purpose core anyway, but I disagree on this. it will eventually happen, probably sooner than later.
I think we can make a distinction between having to design a general-purpose core and following the model Intel is using.
 
Why is that better if both designs are going to be using that storage predominantly to house a lot of thread context?
I'm not saying it's better, merely another way of doing something similar. I don't like the way Larrabee loses cycles to context switching, for what it's worth.

The counterargument is that a handful of generally slower and one size fits all cache ports under contention isn't particularly nice either.
With a fixed SIMD width and well-defined data types the cache is hardly a match made in hell. Plus high performance graphics programming forces developers to pack their data into "cache line friendly" organisations, whether it's vertex attributes or texture resources or render targets.

And register file porting is still wont to grind to super-low throughput when programming with irregular register accesses (e.g. indexed or "inter-thread" sharing).

GPUs can never fully exercise all their buffers. That means there's always memory going to waste. This is just like the unified shader architecture argument. Sure it costs more to make it work, but once you've scaled up past the break-even mark, the general cache model is going to win.

What we don't know is where the break-even mark is. I'm willing to accept that Larrabee could end up 18 months too early (in terms of performance per mm or per Watt). But this is a marketing war now, CUDA versus Larrabee, so the latter can't come soon enough.

A similar example is the double-precision in CUDA. It's practically a joke it's so slow right now.

Jawed
 
Nvidia has a few CPU products, so it has at least licensed and implemented a few.
Licensing is not exactly designing from the ground-up. On a completely different note they can barely design a working chipset..and they got at least one graphics product completely wrong. They can make mistakes too, even what they are very good at.

The qualifier is that one of the two problems is harder and is less established than the other.There are dozens to hundreds of general-purpose CPU designs, depending on how loose we define the term. The pool of massively parallel products is smaller, and the number of general-purpose massively parallel products is about nil as of now.
Yep, it's harder and less established, and I'd put my money on Intel getting it (more) right the first time than NVIDIA, because it's a major departure from how graphics hardware has been done in the last 5 years, while software graphics pipeline are well understood. We shall see in a couple of years, but again, I expect next NVIDIA architecture to be way similar to LRB than G8x is now, that's the only way forward, there is no coming back (in the short term).

I think we can make a distinction between having to design a general-purpose core and following the model Intel is using.
Unless you know what NVIDIA is cooking I don't see how we can much such a distinction, ISA aside.
 
Back
Top