Intel would be *insane* to reuse the same opcodes for the Larrabee vectors and existing instructions. Just as Intel used available opcode space to added MMX and SSEx, they likely did the same with these instructions. One of the nice things about a variable length instruction set, you can always make more opcode space.
If Intel's not reusing the same opcodes, then why isn't it supporting SSE and MMX?
Is there just a huge gap in the opcode space (of a lot of more compact instrutions) that Larrabee can't use?
The downside to using variable-length instructions is that the extra bytes expand the code footprint.
The REX prefix alone is enough to negate the benefit of doubling the register count with x86-64 in a number of cases.
You surely mean patents (not copyright). That aside, I would imagine that Intel would be pretty aggressive about getting patents on the new vector instructions. Nobody without a patent deal with Intel will be able to make a binary-compatible Larrabee.
SSE instructions themselves cannot be patented. If they could, then Transmeta's software emulation would have been illegal.
The hardware or microcode implementation of those instructions is covered under copyright, possibly under circuit topographies. The copyright is only good for a decade. Any x86 from Intel over 10 years old can be freely copied by anybody.
The various cross-licensing agreements for x86 allow the few manufacturers out there to freely share the copyright and ignore patent issues on particular implementation details of any designs that are more recent.
If Larrabee's non-vector ISA is the same as the 10+ year old x86, that part can be freely implemented on any competing design. Larrabee's vector extensions would still be exclusive, but if they are truly GPU-like, it would be a nearly 1 to 1 mapping to GPU functionality anyway.
In particular, AMD could map this at will to some future GPGPU, if it wished.
The x86 segment can reside in the command processor section of the GPU.
In the event there is a 4-chip implementation there would be at least 4 x86-equivalent core segments tied to full GPU SIMD arrays.
The total count of x86 cores would be lower, but in the case of consumer graphics, it is much more likely that the GPU segment is going to be exercised more heavily.
How about this for a bold statement: I predict that NVIDIA will eventually be forced to re-invent itself as a software-only company working on middleware and graphics engines for game developers. It will fail at that and follow SGI into oblivion.
Considering Nvidia doesn't count either of those fields as a main business, that is at the very least a bold statement.
Those extra transistors in the L2 cache won't burn very much power at all. In Intel's 45nm, transistors don't leak that much and not that many transistors in the cache are active (switching) in any given cycle. However, if you took that area and turned it into some fixed-function pipeline, those transistors would be switching like mad each cycle, consuming much more of your power budget than cache SRAM.
But if the extra cache is not necessary, then they don't do anything either.
If a judicious amount of specialized hardware can significantly accellerate some portion of the workload, that means the chip can complete that phase more quickly and then idle the units.
For dynamic power, it isn't the number of transistors. It is the number of times a transistor switches. That is why cache memory is reasonably power efficient. You can increase the size of a second-level cache without it using that much more of your power budget.
That said, you could argue that fixed-function hardware could do the same computation with fewer transistors switching (as compared to Larrabee's vector units). That is likely true for some operations. How much, I really don't know.
This is *not* how cache coherence works. It is much more efficient than that. First, TLBs are totally unaffected by cache coherence.
Entries evicted from the TLB migrate into the caches as data. Once it is in cache, it is subject to the same coherency issues as any other cache line.
AMD just got screwed by this.
I should have been more clear and stated that it would need to update TLB entries, since certain bits can be set as to whether a page has been accessed or written to, which would necessitate additional updates to invalidate stale TLB entries hiding out in other caches.
Second, the cache coherence protocol doesn't update all the other cache in the system. The coherence protocol just enforces the invariant that a given 64B block of data is either (1) read-only in one or more caches or (2) read/write in a single cache. It does this by invalidating the any other copies of the block before a processor is allowed to write its copy of the block. No data is transfered until another processor reads the block (which just becomes a normal miss).
That's broadcast cache coherence. You can't invalidate all possible copies of a line without sending a message to that effect over the ring bus to every cache that might have a copy.
For the read case in the absence of some overarching directory, any cache misses must snoop all 24 other caches to make sure the line isn't already there. That injects unpredictable latency to the operation and it burns bandwidth.
Thank goodness there's the massive ring-bus, only one would hope it has better things to do with its time than pass dozens of coherency packets. At a minimum, it must contain 4 bytes for a 32-bit address, and whatever number of bytes that make up the message packet.
It's almost double that for 64-bit mode, and we don't know if Larrabee is capable of 1 or 2 memory operations a cycle.
This is how cache coherence works in Intel's existing multi-core systems (in fact, this is basically how all cache coherent multiprocessor systems work from AMD, Sun, IBM, HP, etc.)
One need only look at K8's lousy CMP scaling beyond 4-sockets, or how Cell's non-coherent scheme allows it to excel in select workloads to see the downside. There is a bandwidth cost, which the ring-bus should handle in non-pathological cases. There is a latency cost, which is not so easily handled unless the ring bus has a very low latency, even under load
There is no need to "turn it off". Once you've burned the design complexity to implement it, it really is just a win and doesn't really get in the way. If all processors are working on private data, it has basically zero impact. On when true sharing of data happens does it kick in.
No single core knows for certain the status of all other cache lines. There must be some amount of coherency traffic involved.
If there is a lot, it is possible that wildly broadcasting on the ringbus can cause individual ring stops to saturate, which forces delays down each segment.
Why would it need to pass thread contexts? A GPU might, but Larrabee won't. I suspect that most of the algorithms in Larrabee will be work-list type algorithms.
Why would it? Because x86 can do it right now. Threads get thrown every which way today's systems at the whims of some software scheduler.
Larrabee supports x86 system behavior, so it can't ignore that portion of the specification.
In the simplest implementation, each thread just pulls off the next task from the queue, does the task, repeat. In more sophisticated implementations, the work list is hierarchical (one per core, plus a global queue that supports work stealing). This allows for great load balance. Plus, the task information probably fits in a single 64B cache block (a few pointers, maybe a few index or loop counts).
That's closer to a GPU than a full x86 system context.
Larrabee must function as both a GPU and a system CPU. The system half demands a far heftier amount of context and protection that leads to a non-trivial amount of additional cost.
Having a big, fast, shared way for the components of the chip talk to each other sounds like a *good* thing. It is optimized for 64B block transfers, making it easy to transfer data between caches or to and from memory. I don't see a problem here. Remember, cache hits don't touch the ring, so assuming some locality of reference, this should be plenty of bandwidth.
It is plenty of bandwidth. The question is whether it could have been engineered to be smaller and save a lot of fast-switching transistors if it didn't have an outsize worst-case data load.
The more and more we chat back and forth (which I'm really enjoying, BTW), I'm becoming to realize how radical a departure Larrabee seems from what is currently done in GPUs. Like I said earlier, my background is from the general-purpose multi-core side of things. I'm only beginning to realize how GPUs have forced game develops and such to think in one specific way about computation. I really do think that something like Larrabee is going to really spur innovation as those constraints are lifted.
That would be the APIs. The hardware is significantly more flexible.