Larrabee: Samples in Late 08, Products in 2H09/1H10

Nick · Jan 19, 2008

Does anyone have a definite confirmation that the vector isntructions are not SSE?

What if the 16-way SIMD is really 4x4 SIMD? If each core has four SSE units and the threads can issue multiple instrutions in the same cycle to any of these execution units then full x86 compatibility can be combined with high throughput. It's almost like having four mini-cores for every core, except that execution units are shared to ensure that blocked treads don't mean unused pipelines.

It just doesn't make sense to me that they would make the scalar integer code x86 while the SIMD ISA would have been redesigned from scratch. Why not redesign everything if you're going to break compatibility anyway? Either that or don't change it at all...

3dilettante · Jan 19, 2008

ArchitectureProfessor said:
Intel would be *insane* to reuse the same opcodes for the Larrabee vectors and existing instructions. Just as Intel used available opcode space to added MMX and SSEx, they likely did the same with these instructions. One of the nice things about a variable length instruction set, you can always make more opcode space.

If Intel's not reusing the same opcodes, then why isn't it supporting SSE and MMX?
Is there just a huge gap in the opcode space (of a lot of more compact instrutions) that Larrabee can't use?

The downside to using variable-length instructions is that the extra bytes expand the code footprint.
The REX prefix alone is enough to negate the benefit of doubling the register count with x86-64 in a number of cases.

You surely mean patents (not copyright). That aside, I would imagine that Intel would be pretty aggressive about getting patents on the new vector instructions. Nobody without a patent deal with Intel will be able to make a binary-compatible Larrabee.

SSE instructions themselves cannot be patented. If they could, then Transmeta's software emulation would have been illegal.

The hardware or microcode implementation of those instructions is covered under copyright, possibly under circuit topographies. The copyright is only good for a decade. Any x86 from Intel over 10 years old can be freely copied by anybody.

The various cross-licensing agreements for x86 allow the few manufacturers out there to freely share the copyright and ignore patent issues on particular implementation details of any designs that are more recent.

If Larrabee's non-vector ISA is the same as the 10+ year old x86, that part can be freely implemented on any competing design. Larrabee's vector extensions would still be exclusive, but if they are truly GPU-like, it would be a nearly 1 to 1 mapping to GPU functionality anyway.

In particular, AMD could map this at will to some future GPGPU, if it wished.
The x86 segment can reside in the command processor section of the GPU.

In the event there is a 4-chip implementation there would be at least 4 x86-equivalent core segments tied to full GPU SIMD arrays.
The total count of x86 cores would be lower, but in the case of consumer graphics, it is much more likely that the GPU segment is going to be exercised more heavily.

How about this for a bold statement: I predict that NVIDIA will eventually be forced to re-invent itself as a software-only company working on middleware and graphics engines for game developers. It will fail at that and follow SGI into oblivion.

Considering Nvidia doesn't count either of those fields as a main business, that is at the very least a bold statement.

Those extra transistors in the L2 cache won't burn very much power at all. In Intel's 45nm, transistors don't leak that much and not that many transistors in the cache are active (switching) in any given cycle. However, if you took that area and turned it into some fixed-function pipeline, those transistors would be switching like mad each cycle, consuming much more of your power budget than cache SRAM.

But if the extra cache is not necessary, then they don't do anything either.
If a judicious amount of specialized hardware can significantly accellerate some portion of the workload, that means the chip can complete that phase more quickly and then idle the units.

For dynamic power, it isn't the number of transistors. It is the number of times a transistor switches. That is why cache memory is reasonably power efficient. You can increase the size of a second-level cache without it using that much more of your power budget.

That said, you could argue that fixed-function hardware could do the same computation with fewer transistors switching (as compared to Larrabee's vector units). That is likely true for some operations. How much, I really don't know.

This is *not* how cache coherence works. It is much more efficient than that. First, TLBs are totally unaffected by cache coherence.

Entries evicted from the TLB migrate into the caches as data. Once it is in cache, it is subject to the same coherency issues as any other cache line.
AMD just got screwed by this.
I should have been more clear and stated that it would need to update TLB entries, since certain bits can be set as to whether a page has been accessed or written to, which would necessitate additional updates to invalidate stale TLB entries hiding out in other caches.

Second, the cache coherence protocol doesn't update all the other cache in the system. The coherence protocol just enforces the invariant that a given 64B block of data is either (1) read-only in one or more caches or (2) read/write in a single cache. It does this by invalidating the any other copies of the block before a processor is allowed to write its copy of the block. No data is transfered until another processor reads the block (which just becomes a normal miss).

That's broadcast cache coherence. You can't invalidate all possible copies of a line without sending a message to that effect over the ring bus to every cache that might have a copy.

For the read case in the absence of some overarching directory, any cache misses must snoop all 24 other caches to make sure the line isn't already there. That injects unpredictable latency to the operation and it burns bandwidth.

Thank goodness there's the massive ring-bus, only one would hope it has better things to do with its time than pass dozens of coherency packets. At a minimum, it must contain 4 bytes for a 32-bit address, and whatever number of bytes that make up the message packet.
It's almost double that for 64-bit mode, and we don't know if Larrabee is capable of 1 or 2 memory operations a cycle.

This is how cache coherence works in Intel's existing multi-core systems (in fact, this is basically how all cache coherent multiprocessor systems work from AMD, Sun, IBM, HP, etc.)

One need only look at K8's lousy CMP scaling beyond 4-sockets, or how Cell's non-coherent scheme allows it to excel in select workloads to see the downside. There is a bandwidth cost, which the ring-bus should handle in non-pathological cases. There is a latency cost, which is not so easily handled unless the ring bus has a very low latency, even under load

There is no need to "turn it off". Once you've burned the design complexity to implement it, it really is just a win and doesn't really get in the way. If all processors are working on private data, it has basically zero impact. On when true sharing of data happens does it kick in.

No single core knows for certain the status of all other cache lines. There must be some amount of coherency traffic involved.
If there is a lot, it is possible that wildly broadcasting on the ringbus can cause individual ring stops to saturate, which forces delays down each segment.

Why would it need to pass thread contexts? A GPU might, but Larrabee won't. I suspect that most of the algorithms in Larrabee will be work-list type algorithms.

Why would it? Because x86 can do it right now. Threads get thrown every which way today's systems at the whims of some software scheduler.
Larrabee supports x86 system behavior, so it can't ignore that portion of the specification.

In the simplest implementation, each thread just pulls off the next task from the queue, does the task, repeat. In more sophisticated implementations, the work list is hierarchical (one per core, plus a global queue that supports work stealing). This allows for great load balance. Plus, the task information probably fits in a single 64B cache block (a few pointers, maybe a few index or loop counts).

That's closer to a GPU than a full x86 system context.
Larrabee must function as both a GPU and a system CPU. The system half demands a far heftier amount of context and protection that leads to a non-trivial amount of additional cost.

Having a big, fast, shared way for the components of the chip talk to each other sounds like a *good* thing. It is optimized for 64B block transfers, making it easy to transfer data between caches or to and from memory. I don't see a problem here. Remember, cache hits don't touch the ring, so assuming some locality of reference, this should be plenty of bandwidth.

It is plenty of bandwidth. The question is whether it could have been engineered to be smaller and save a lot of fast-switching transistors if it didn't have an outsize worst-case data load.

The more and more we chat back and forth (which I'm really enjoying, BTW), I'm becoming to realize how radical a departure Larrabee seems from what is currently done in GPUs. Like I said earlier, my background is from the general-purpose multi-core side of things. I'm only beginning to realize how GPUs have forced game develops and such to think in one specific way about computation. I really do think that something like Larrabee is going to really spur innovation as those constraints are lifted.

That would be the APIs. The hardware is significantly more flexible.

ArchitectureProfessor · Jan 19, 2008

Nick said:
Does anyone have a definite confirmation that the vector instructions are not SSE?

My source at Intel told me the vectors are new 16-way (64-byte, 512-bit) vectors. Not only are the vectors much wider than SSE2's XMM registers, there are 32 vector registers (versus the 16 XMM registers). In addition, there are special vector instructions tailor specifically for doing the sort of things other GPUs do in fixed-function hardware.

It just doesn't make sense to me that they would make the scalar integer code x86 while the SIMD ISA would have been redesigned from scratch. Why not redesign everything if you're going to break compatibility anyway? Either that or don't change it at all...

I asked the same question. Apparently, it is just about reusing existing tools. Using standard x86 for everything else means they can start with the Intel x86 compiler for scalar code, use the same page table, and the same I/O device interfaces. They can get a simple kernel up and running quickly. They already have tools for tracing, simulating, and verifying x86. To Intel x86 is like that quirky uncle that is more endearing than annoying most of the time. They are so used to its quirks, that they really don't mind them anymore. Strange but true.

Nick · Jan 19, 2008

3dilettante said:
For the read case in the absence of some overarching directory, any cache misses must snoop all 24 other caches to make sure the line isn't already there. That injects unpredictable latency to the operation and it burns bandwidth.

It will just read the old data without any extra latency. x86 has relaxed cache coherency, like practially every other multi-core CPU. This means that you're only guaranteed to see data written by another core after a certain amount of time, and that you will see the writes in the same order as they were issued.

Only instructions with a lock prefix, and memory barrier instructions, ensure full coherence. These obviously have a higher latency. So in a way you could say that (full) cache coherency is "off" by default. Relaxed coherency has practically no influence on threads working with private data.

ArchitectureProfessor · Jan 19, 2008

3dilettante said:
If Intel's not reusing the same opcodes, then why isn't it supporting SSE and MMX?

As they wanted a simple in-order pipeline, the one original idea for Larrabee was to take the Pentium design and actually dust off the actual RTL (basically the hardware's source code) and use it almost unchanged. They knew they would need to add some sort of vectors. I suspect they looked at SSE and said, eh, let's just do something else. It is the lack of inclusion of SSE that has pissed off other design teams. I can see why.

As to why they didn't also implement MMX/SSEx? I'm not sure. It shouldn't have taken up that much extra die area. I don't think it would have impacted any critical paths directly. I dunno. Maybe they were just in such a rush to get it out the door they didn't want to bother. When they started, there was no in-order x86 design that included SSE. Now with Silverthorne close to shipping, they would have had a reasonable design to start with.

The downside to using variable-length instructions is that the extra bytes expand the code footprint.
The REX prefix alone is enough to negate the benefit of doubling the register count with x86-64 in a number of cases.

All true. As long as the code working set fits in the 512KB level-two cache, code size shouldn't hopefully be too much of a problem. With the four threads to hide instruction cache misses, this doesn't seem like too big a deal.

The hardware or microcode implementation of those instructions is covered under copyright, possibly under circuit topographies. The copyright is only good for a decade. Any x86 from Intel over 10 years old can be freely copied by anybody.

I have no idea where you're getting this stuff. Copyright only good for a decade? In the US, copyright lasts a long time. From Wikipedia's article on the 1988 Copyright Term Extension Act:

"The Act also affected copyright terms for copyrighted works published prior to January 1, 1978, also increasing their term of protection by 20 years, to a total of 95 years from publication."

That said, you can only copyright the actual floorplan or RTL for a chip. You can't copyright an operation or any such thing. That is what patents are for.

The various cross-licensing agreements for x86 allow the few manufacturers out there to freely share the copyright and ignore patent issues on particular implementation details of any designs that are more recent.

I think AMD and Intel have some sort of patent truce in which they cross-license all their patents to each other. I know they did this at some point, and I think it is still active (although the agreement could have expired). This actually really worked in Intel's favor when they needed to copy AMD's 64-bit extension to x86...

Considering Nvidia doesn't count either of those fields as a main business, that is at the very least a bold statement.

Hence my comparison to SGI.

Entries evicted from the TLB migrate into the caches as data. Once it is in cache, it is subject to the same coherency issues as any other cache line.

Sure, if the TLB mapping changes, all the processors need to know about it. But why would these mappings change frequently? You need to get this right for correctness, but it isn't a performance issue.

For the read case in the absence of some overarching directory, any cache misses must snoop all 24 other caches to make sure the line isn't already there. That injects unpredictable latency to the operation and it burns bandwidth.

You're 100% right...

...so that is why Larrabee *does* use a full-map directory cache coherence protocol. On a L2 cache miss, the request accesses an on-chip directory (co-located with the on-chip memory controller) to see if it is cache elsewhere on the chip. If the directory reports it is not cached elsewhere, no further messages are sent. If it is cached elsewhere, only those caches with the block are contacted. The directory is banked several ways to prevent it from becoming a bottleneck.

This scheme avoids the wild broadcasts of the AMD Opteron and other systems, allowing it to scale to 32 cores.

Doing this isn't free. I would say the directories are all together around 10% of the area of the second-level caches. That is a pretty steep cost, but it should allow them to scale to 32 cores without the ring or snooping responses becoming a bottleneck.

Why would it? Because x86 can do it right now. Threads get thrown every which way today's systems at the whims of some software scheduler.
Larrabee supports x86 system behavior, so it can't ignore that portion of the specification.

With Larrabee, Intel intends to control both the hardware and the software (much as GPUs companies do). They will likely run in a mode in which the number of heavy-weight software threads is the same as the number of contexts on chip (128 in this case). Each of those threads will use these lightweight tasks to do the load balancing and such. No big OS schedule to trash things about would be used.

That's closer to a GPU than a full x86 system context.

Yep.

It is plenty of bandwidth. The question is whether it could have been engineered to be smaller and save a lot of fast-switching transistors if it didn't have an outsize worst-case data load.

That is a very good question. I assume they did the math and decided they needed that much bandwidth. Another reason is that they might have over-designed it for this generation with the anticipation of going to 64 or 128 cores in the next generation (without needing to reengineer the on-chip ring part of the chip). They will need to so something for the 32nm shrink...

ArchitectureProfessor · Jan 19, 2008

Nick said:
It will just read the old data without any extra latency. x86 has relaxed cache coherency, like practially every other multi-core CPU. This means that you're only guaranteed to see data written by another core after a certain amount of time, and that you will see the writes in the same order as they were issued.

This is a really good point. As x86 doesn't require strict ordering, stores that are waiting to invalidate other copies of a block can just be buffered in a FIFO store buffer, allowing the processor to continue executing subsequent instructions. This hides lots of store latency that cache coherence might otherwise introduce.

This sort of non-strict ordering might sound bad, but the atomic swap operation as part of a lock acquire forces the store buffer to be flushed, meaning a properly synchronized program won't observe any unexpected behaviors. This is the same model as multi-core x86 chips currently use, and it seems very reasonable for programmers to use effectively. The only cases in which the non-strictness can even be observed are pretty pathological.

I might not know much about GPUs, but cache coherence and memory ordering is what I wrote my PhD dissertation on, and it still probably what I know best.

Barbarian · Jan 19, 2008

nAo said:
...
[*]coarse grained rasterizer
[*]fine grained rasterizer
[*]texture address generator
[*]ROPs (a mixed bag, part of the tasks implemented in these units map very well to CPUs)
...

I pretty much agree with the list and the CPU friendly designations.
I just wanted to point out that rasterization requirements could be sidestepped if the rendering implementations goes for primitive subdivision to (micro) pixel level.
Address generation (for textures and ROPS) can be a real pain, so maybe we need a two-dimensional memory access instruction?

dkanter · Jan 19, 2008

ArchitectureProfessor said:
This is a really good point. As x86 doesn't require strict ordering, stores that are waiting to invalidate other copies of a block can just be buffered in a FIFO store buffer, allowing the processor to continue executing subsequent instructions. This hides lots of store latency that cache coherence might otherwise introduce.

This sort of non-strict ordering might sound bad, but the atomic swap operation as part of a lock acquire forces the store buffer to be flushed, meaning a properly synchronized program won't observe any unexpected behaviors. This is the same model as multi-core x86 chips currently use, and it seems very reasonable for programmers to use effectively. The only cases in which the non-strictness can even be observed are pretty pathological.

I might not know much about GPUs, but cache coherence and memory ordering is what I wrote my PhD dissertation on, and it still probably what I know best.

Which group did you work at? Were you from the FLASH/DASH group at stanford?

DK

Demirug · Jan 19, 2008

Nick said:
Does anyone have a definite confirmation that the vector isntructions are not SSE?

What if the 16-way SIMD is really 4x4 SIMD? If each core has four SSE units and the threads can issue multiple instrutions in the same cycle to any of these execution units then full x86 compatibility can be combined with high throughput. It's almost like having four mini-cores for every core, except that execution units are shared to ensure that blocked treads don't mean unused pipelines.

It just doesn't make sense to me that they would make the scalar integer code x86 while the SIMD ISA would have been redesigned from scratch. Why not redesign everything if you're going to break compatibility anyway? Either that or don't change it at all...

It is not necessary redesigned from the scratch as their latest generation of IGPs use a 16x SIMD unit for all the shader calculations.

crystall · Jan 19, 2008

ArchitectureProfessor said:
This is a really good point. As x86 doesn't require strict ordering, stores that are waiting to invalidate other copies of a block can just be buffered in a FIFO store buffer, allowing the processor to continue executing subsequent instructions. This hides lots of store latency that cache coherence might otherwise introduce.

AFAIK x86 does require strict ordering of stores in an SMP environment. I.e. writes issued by a processor are visible by other processors in the same order they were issued. This is the strictest form of SMP memory ordering I can think of.
Architectures which do not enforce memory ordering in an SMP context are Alpha, IA64 and to a lesser extent POWER and SPARC when using the relaxed store ordering model.

This sort of non-strict ordering might sound bad, but the atomic swap operation as part of a lock acquire forces the store buffer to be flushed, meaning a properly synchronized program won't observe any unexpected behaviors.

Actually IIRC x86 is probably the only architecture were properly ordered lockless code runs correctly. BTW that's why in the Linux kernel smp_wmb() is a no-op since writes cannot be seen in a different order by other processors in an SMP system.

crystall · Jan 19, 2008

Demirug said:
It is not necessary redesigned from the scratch as their latest generation of IGPs use a 16x SIMD unit for all the shader calculations.

AFAIK the GMA X3500 has 8 scalar units but I might be wrong on this one. I'm not sure about their future IGPUs though.

Demirug · Jan 19, 2008

crystall said:
AFAIK x86 does require strict ordering of stores in an SMP environment. I.e. writes issued by a processor are visible by other processors in the same order they were issued. This is the strictest form of SMP memory ordering I can think of.
Architectures which do not enforce memory ordering in an SMP context are Alpha, IA64 and to a lesser extent POWER and SPARC when using the relaxed store ordering model.

The whole x86 reordering story is somewhat complicated.

Beside of some string operations (like STOS) the CPU will execute write operations in the order they appear in the code. Other instructions can be reordered. This is the problem as it includes read operations. It is only guaranteed that there will be no read on a address with outstanding writes. This is based from the time were nobody thought about multi processor or even multi core.

To make things even worse the modern x86 CPU with all their caches operates on blocks of memory. Therefore only such blocks are transferred between the CPU and the memory and multiple cores. This can cause interesting effects if your code assumes strict read/write ordering.

Finally if you don’t write native x86 assembler code you will see that modern compiler will do reordering, too.

MTd2 · Jan 19, 2008

ArchitectureProfessor said:
How about this for a bold statement: I predict that NVIDIA will eventually be forced to re-invent itself as a software-only company working on middleware and graphics engines for game developers. It will fail at that and follow SGI into oblivion..

How about this? IA64 instructions are public domain since 1 jan 2008. That's a reality. Mask patents last for ten years, counting the last day of the year it was deposited, and IA64 were deposited in 1997. Check USPTO for that. I am in a hurry now, but I will prove later, if you want. MMX and SSE1 and 2, will be public domain on jan 1st of the years, 2009, 2010, 2010 also, respectively.

NVIDIA can freely release a pretty compatible x86 gpu as compatitor to Larabee in the first day of 2010 withou ANY problems whatsoever with law. More, than can make cross patents with AMD for SS5. That would really make a strong competition agains Intel.

Analyzing from that point, AMD was too wise buying it ATI since they will have an expirience with both GPU and CPU with custumors way before NVIDIA.

If you are doubt, I can provide literature supporting that view. I work at a national patent office, so I just have to go ask at the microelectronics subdivision and ask for the links....

Arun · Jan 19, 2008

ArchitectureProfessor said:
Once Intel does as you suggest and combine some Larrabee core with a traditional x86 core, that will be hard for NVIDIA to compete with. Intel will eat up more and more of the mid-range graphics market, slowly squeezing NVIDIA into a niche high-end role.

While I wouldn't strictly exclude that possibility, you seem to be taking a controversial position just for the sake of it, instead of actually considering what would get us there.

What defines the mid-range is not absolute performance, but relative performance compared to the high-end. What makes a company/architecture able to compete in the mid-range is its performance/dollar. And if you can compete in performance/dollar, let us not forget that graphics is a wonderfully scalable task, so that means you can also compete in the high-end.

Anyone who claims Intel could 'squeeze NVIDIA into a niche end-end role' needs to seriously reconsider the math behind his arguement, and you're certainly not the only one out there with that absolutely senseless arguement. It's simple, really: either Intel beats NVIDIA/AMD everywhere, or they beat them nowhere.

How about this for a bold statement: I predict that NVIDIA will eventually be forced to re-invent itself as a software-only company working on middleware and graphics engines for game developers. It will fail at that and follow SGI into oblivion.

NVIDIA's current market cap is $15B. Why should they consider refocusing on a market that's probably at least two orders of magnitude smaller than their current ones? Given the ambitions of those at the head of the company, I would be more willing to consider bankruptcy than that. And honestly, if anyone told me NVIDIA was going bankrupt in five years right now, I'd laugh him off.

The more and more we chat back and forth (which I'm really enjoying, BTW), I'm becoming to realize how radical a departure Larrabee seems from what is currently done in GPUs. Like I said earlier, my background is from the general-purpose multi-core side of things. I'm only beginning to realize how GPUs have forced game develops and such to think in one specific way about computation. I really do think that something like Larrabee is going to really spur innovation as those constraints are lifted.

I think it's rather... original to say GPUs have forced game developers to think about computation in a specific way. It would be much more accurate to say that the last several decades of graphics research have forced GPU engineers to architect GPUs the way they are. Ironically, those algorithms were often developed with CPUs in mind rather than SIMD machines.

But there definitely are many very interesting algorithms that won't work on semi-traditional SIMD machines like GPUs. Much of these have been known for a very long time, while many others are still being discovered; and if hardware supporting it at much higher speeds became mainstream, then I'm sure the state of the art would advance even quicker there.

Either way, what makes Larrabee really interesting is MIMD and SIMD on the same chip. The fact it's on the same core doesn't even really matter; it has advantages, but it also has disadvantages. I don't know exactly what NV and AMD's roadmaps are, but what I recently realized is that I'd be very surprised if neither had any support whatsoever for scalar MIMD processing in the 2010/DX11 timeframe. However, if it turned out that neither will, then I'd certainly have to take Larrabee even more seriously than I am right now.

EDIT: Just making sure this post isn't deemed overly aggressive - it certainly wasn't meant that way!

ArchitectureProfessor · Jan 19, 2008

dkanter said:
Which group did you work at? Were you from the FLASH/DASH group at stanford?

No, but I know most of those guys. These days they all seem to work for Google or D.E. Shaw's Protein Folding Supercomputing group.

Not many of them went into academia.

I'd like to stay somewhat anonymous, so let's just say that I graduated with my PhD in the last ten years from one of the big-10 schools known for computer engineering (e.g., U of Illinois, U of Wisconsin, U of Michigan, etc.). I'm now a professor at a top-20 computer science department on the East coast.

ArchitectureProfessor · Jan 19, 2008

Demirug said:
The whole x86 reordering story is somewhat complicated.

Yes, memory ordering models is probably these most confusing thing in processor design.

...the CPU will execute write operations in the order they appear in the code. Other instructions can be reordered. This is the problem as it includes read operations. It is only guaranteed that there will be no read on a address with outstanding writes. This is based from the time were nobody thought about multi processor or even multi core.

Intel recently released a document that updates the official x86 memory ordering model. This newest model is more strict than what they have said in the past. The new document explicitly states that load/load re-ordering in x86 is *disallowed*. This really helps clarify the memory ordering rules of x86:

Intel® 64 Architecture Memory Ordering White Paper

So, x86 does allow some memory order relaxations, but not ones that commonly cause problems.

Finally if you don’t write native x86 assembler code you will see that modern compiler will do reordering, too.

Yes, this is quite a problem. Java now has a formal memory model defined that says explicitly what the compiler is and is not allowed to do. C++ is working on a specification, but it isn't official yet. They both boil down to the compiler generally won't re-order instructions across lock acquires and lock releases. The really tricky cases is what happens without strict locking (say, you forget a lock or you're doing lock-free synchronization), that is where everything becomes really ugly in a hurry. For these cases Java (and C++, I suppose), uses reads and writes "volatile" variables as partial memory ordering points. For example, I think some of the rules would prevent re-ordering of two access to volatile variables. The language-level issues of memory ordering are really complicated, so take anything I've said above with a grain of salt (I'm much more familiar with the hardware-level memory ordering models).

ArchitectureProfessor · Jan 19, 2008

Demirug said:
Actually IIRC x86 is probably the only architecture were properly ordered lockless code runs correctly. BTW that's why in the Linux kernel smp_wmb() is a no-op since writes cannot be seen in a different order by other processors in an SMP system.

It is true that writes from the same x86 processor must be observed in order by other processors in the system. This is also true for SPARC's TSO model, so the SPARC chips shipped by Sun, for example, have similar guarantees.

Actually, TSO is a bit stronger than x86 (that is, x86 is less strict in its ordering) when it comes to the order in which stores are observed from multiple processors. That is, if processor 1 stores to location A and processor 2 stores to location B, can processor 3 see the write to A but not B while a forth processor seems only the write to B but not to A. (Sorry, to be confusing, but that's just the way it is). Under x86, the two stores (by two different processors) can be seen in different orders by various processors in the system. Under TSO, this isn't allowed.

The MIPS multiprocessors from SGI, such as the highly-scalable SGI Origin, actually support the most strict memory ordering model (sequential consistency, as it is formally called). In that model, no re-order of memory operations are allowed *at all*. I think HP's PA-RISC also have this strict model.

As was mentioned above, PowerPC, IA-64 (Itanium), Alpha (may it R.I.P.) all have much weaker ordering models that allow for crazy re-orderings: store->store, load->load, store->load, load->store.

ArchitectureProfessor · Jan 19, 2008

Arun said:
While I wouldn't strictly exclude that possibility, you seem to be taking a controversial position just for the sake of it, instead of actually considering what would get us there.

Perhaps. I like taking controversial positions. But I actually believe what I'm saying (Similarly, I've also been saying for four years that we were in a housing bubble, so I've been renting for four years waiting for the bubble to pop. As with the housing bubble, I have a tendency to predict things happening more quickly than they actually do, so perhaps my timing is off.)

What defines the mid-range is not absolute performance, but relative performance compared to the high-end. What makes a company/architecture able to compete in the mid-range is its performance/dollar. And if you can compete in performance/dollar, let us not forget that graphics is a wonderfully scalable task, so that means you can also compete in the high-end.

How does that explain Intel's dominance of the low-end graphics market? Intel has captured much of the low-end graphics market, but yet it hasn't (yet) captured any of the high-end market.

Anyone who claims Intel could 'squeeze NVIDIA into a niche end-end role' needs to seriously reconsider the math behind his arguement, and you're certainly not the only one out there with that absolutely senseless arguement.

Intel has first attacked at the low-end. It will slowly integrate more and more graphics computation on to the main system processor chip. Once all x86 chips shipping have reasonable mid-range graphics, why would anyone buy a mid-range GPU from NVIDIA?

It's simple, really: either Intel beats NVIDIA/AMD everywhere, or they beat them nowhere.

I mostly agree. I think that Intel will take away NVIDIA's volume (by taking away the low-end and then the mid-range). Once you lose volume, it is really hard to keep competing at the high end.

This is just going to be a repeat of how Intel's chips killed most of the workstation vendors low-volume high-cost---but fast---chips (HP's PA-RISC, DEC's Alpha, Sun's SPARC, SGI's MIPS, etc). Intel was able to narrow the performance gap, and once Intel's chips were within a factor of 3x in performance but 1/10th the cost, the high-end workstation market started to die.

This is an even older story. If we look at the history of computing, this happens over and over again. The "killer micros" came along and killed the multi-chip "mini-computers" of the late 1980s. In the mid-1980s, the mini-computer companies were the darlings of wallstreet. Companies such as Digital Equipment Corporation (DEC), IBM, and companies such as Data General (made famous by Tracy Kidder's Pulitzer-Prize-winning book "Soul of a New Machine"). Then, some of the upstarts like MIPS, Sun, SGI, came along and built microprocessors that eventually destroyed these other companies (IBM somehow has managed to survive this transition, as it survived the mainframe->minicomputer->PC transitions).

NVIDIA's current market cap is $15B... if anyone told me NVIDIA was going bankrupt in five years right now, I'd laugh him off.

In 1995 SGI's market cap was $7 billion (maybe $10 billion in real dollar). Just ten years later SGI was de-listed from the stock market and declared bankruptcy. Of course, proof by analogy is a dangerous game. NVIDIA just reminds me of SGI in the mid-1990s. (BTW, who were the early engineers behind NVIDIA and ATI? Engineers jumping ship from SGI...).

I think it's rather... original to say GPUs have forced game developers to think about computation in a specific way. It would be much more accurate to say that the last several decades of graphics research have forced GPU engineers to architect GPUs the way they are. Ironically, those algorithms were often developed with CPUs in mind rather than SIMD machines.

GPUs have been wildly successful in that they have finally cracked the parallelism nut. Parallel computing has been the holy grail of computer design. The combination of highly-parallel graphics algorithms with custom GPU hardware really broke the chicken-and-egg cycle of "no high-volume parallel hardware" and "no high-value parallel computations". Now GPUs have broken that cycle. As a computer architect, that is very exciting.

Either way, what makes Larrabee really interesting is MIMD and SIMD on the same chip. The fact it's on the same core doesn't even really matter; it has advantages, but it also has disadvantages. I don't know exactly what NV and AMD's roadmaps are, but what I recently realized is that I'd be very surprised if neither had any support whatsoever for scalar MIMD processing in the 2010/DX11 timeframe. However, if it turned out that neither will, then I'd certainly have to take Larrabee even more seriously than I am right now.

Here's another big claim. Once GPU/CPUs just become more generic MIMD/SIMD cores, what Microsoft does with DirextX just won't matter as much. You want some new feature, just buy a cross-platform library from a company that does it. No need to wait for Microsoft to bless it. This is why Intel's acquisition of Havok is so interesting...

Once you have enough flops, why bother going through DirectX? Or rather, why DirectX versus some other graphics libraries from some other company?

This is part of what I mean about the "GPU mindset". Those into GPUs think about specific way of doing things, from what sort of computations work well, to what sort of computations don't work well (anything with fine-grain synchronization), to what works and does not ("caches don't work as well for GPUs as for CPUs"), who provides the middleware (DirectX versus whatever else). I'm not saying it is bad. These are all the right assumptions for today. If you're in the trenches actually building games for today (rather than in an ivy tower), that is exactly the right way to think about things. But looking five years down the road, I think many of these ways of thinking will be challenged.

EDIT: Just making sure this post isn't deemed overly aggressive - it certainly wasn't meant that way!

It is hard to strike the right tone in text-only communication. That's why there are so many damn smilies to choose from.

I didn't take any offense to the tone (or context) of your post or find it overly aggressive.

Demirug · Jan 19, 2008

ArchitectureProfessor said:
Intel® 64 Architecture Memory Ordering White Paper

So, x86 does allow some memory order relaxations, but not ones that commonly cause problems.

As a game developer I will not assume that AMD CPUs behave the same way until I see a similar document. Additional when it comes to multithreading I prefer being conservative. It is just too easy to mess your system total with an error there.

ArchitectureProfessor said:
Yes, this is quite a problem. Java now has a formal memory model defined that says explicitly what the compiler is and is not allowed to do. C++ is working on a specification, but it isn't official yet. They both boil down to the compiler generally won't re-order instructions across lock acquires and lock releases. The really tricky cases is what happens without strict locking (say, you forget a lock or you're doing lock-free synchronization), that is where everything becomes really ugly in a hurry. For these cases Java (and C++, I suppose), uses reads and writes "volatile" variables as partial memory ordering points. For example, I think some of the rules would prevent re-ordering of two access to volatile variables. The language-level issues of memory ordering are really complicated, so take anything I've said above with a grain of salt (I'm much more familiar with the hardware-level memory ordering models).

Yes, volatile variables stops at least Microsoft compilers from reordering. But they come not cheap and in the case of the Xbox 360 (little off topic) it doesn’t stop the CPU from doing it anyway.

ArchitectureProfessor · Jan 19, 2008

Demirug said:
As a game developer I will not assume that AMD CPUs behave the same way until I see a similar document. Additional when it comes to multithreading I prefer being conservative. It is just too easy to mess your system total with an error there.

From what I've heard, all older x86 chips from Intel as well as those from AMD do follow these rules. These have been the defacto rules for some time, but Intel finally made it official.

That said, YMMV. I fully support the idea of being conservative. Once these bugs occur, they are a nightmare to debug. Not something to mess with.

The good news: if you use locks to property synchronize everything, you shouldn't need to worry about any of these memory reordering things.

I have a question for you. As a game developer, do you find that you're forced to use lock-free or other techniques that don't use locks to fully synchronize access to shared memory?

Larrabee: Samples in Late 08, Products in 2H09/1H10

Nick

3dilettante

ArchitectureProfessor

Nick

ArchitectureProfessor

ArchitectureProfessor

Barbarian

dkanter

Demirug

crystall

crystall

Demirug

MTd2

Arun

Unknown.

ArchitectureProfessor

ArchitectureProfessor

ArchitectureProfessor

ArchitectureProfessor

Demirug

ArchitectureProfessor

Similar threads