It will just read the old data without any extra latency. x86 has relaxed cache coherency, like practially every other multi-core CPU. This means that you're only guaranteed to see data written by another core after a certain amount of time, and that you will see the writes in the same order as they were issued.
The question isn't whether a given core is updated within a stretch of time, but whether coherency traffic is ever generated. Coherency packets are of non-zero length and depending on the granularity of the interconnect, even moderate coherency traffic can take a disproportionate amount of the interconnect's transactions/cycle.
All true. As long as the code working set fits in the 512KB level-two cache, code size shouldn't hopefully be too much of a problem. With the four threads to hide instruction cache misses, this doesn't seem like too big a deal.
That would depend on the implementation of its threading. I haven't seen any public source indicating just how Intel's implemented threading.
SMT could plausibly get around an instruction cache miss. FMT would not.
Sure, if the TLB mapping changes, all the processors need to know about it. But why would these mappings change frequently? You need to get this right for correctness, but it isn't a performance issue.
Correctness is greatest enemy of performance.
More to the point, Larrabee's being shown as either an accellerator or as an independent processor means it must have functionality to maintain correctness to a degree GPUs currently do not.
AMD's TLB issue means a single x86 core's bad operation leads to a system-crashing machine check error.
Who knows what kind of crap GPUs get away with on a regular basis, if only because there's a fair amount of undefined or "you're on your own" parts of their specifications.
Possibly, Larrabee in GPU mode could turn off a lot of this error checking, but then it has a fair amount of hardware that is simply never used for that workload.
You're 100% right...
...so that is why Larrabee *does* use a full-map directory cache coherence protocol. On a L2 cache miss, the request accesses an on-chip directory (co-located with the on-chip memory controller) to see if it is cache elsewhere on the chip. If the directory reports it is not cached elsewhere, no further messages are sent. If it is cached elsewhere, only those caches with the block are contacted. The directory is banked several ways to prevent it from becoming a bottleneck.
This makes sense, but I haven't seen any statements to this effect.
Can you source this?
That is a very good question. I assume they did the math and decided they needed that much bandwidth. Another reason is that they might have over-designed it for this generation with the anticipation of going to 64 or 128 cores in the next generation (without needing to reengineer the on-chip ring part of the chip). They will need to so something for the 32nm shrink...
Discussions on R600's internal ring bus indicate there are other reasons for its massive size. A ring bus scales linearly in hardware cost for the number of clients, but it also introduces a significant dynamic component to latency and system behavior.
Pathological cases and heavy loading seriously degrade the effectiveness of the ring bus, which is why they tend to be significantly larger than they need to be in the general case.
If Intel wants to scale to 64 to 128, I wonder if it would do something like increase the bus width further, or maybe divide things up so they aren't all on a single ring.
This is just going to be a repeat of how Intel's chips killed most of the workstation vendors low-volume high-cost---but fast---chips (HP's PA-RISC, DEC's Alpha, Sun's SPARC, SGI's MIPS, etc). Intel was able to narrow the performance gap, and once Intel's chips were within a factor of 3x in performance but 1/10th the cost, the high-end workstation market started to die.
x86 wasn't the sole reason for this success.
It was cheap x86 paired with specialized 3D cards that allowed x86 to even pretend to matter in the 3D workstation market. That's what butchered a multi-billion dollar market the RISCs (and Itanium, early on) were counting on.
In 1995 SGI's market cap was $7 billion (maybe $10 billion in real dollar). Just ten years later SGI was de-listed from the stock market and declared bankruptcy. Of course, proof by analogy is a dangerous game. NVIDIA just reminds me of SGI in the mid-1990s. (BTW, who were the early engineers behind NVIDIA and ATI? Engineers jumping ship from SGI...).
On a side note, one of the final straws that broke the camel's back was the fact SGI was counting on the timely release of Itanium-based products as it transistioned away from its older lines.
Intel's missteps in timeliness in delivering a new and less-established architecture with the features and performance promised are not likely to be forgotten in the consumer graphics market, or anywhere else for that matter.
GPUs have been wildly successful in that they have finally cracked the parallelism nut. Parallel computing has been the holy grail of computer design. The combination of highly-parallel graphics algorithms with custom GPU hardware really broke the chicken-and-egg cycle of "no high-volume parallel hardware" and "no high-value parallel computations". Now GPUs have broken that cycle. As a computer architect, that is very exciting.
I'm not sure it's that revolutionary, rather, the market finally grew a segment for a workload that was embarassingly parallel, lax in correctness, and cost-conscious.
Here's another big claim. Once GPU/CPUs just become more generic MIMD/SIMD cores, what Microsoft does with DirextX just won't matter as much. You want some new feature, just buy a cross-platform library from a company that does it. No need to wait for Microsoft to bless it. This is why Intel's acquisition of Havok is so interesting...
This has been addressed already, but market forces and development concerns will resist this.
As Larrabee is just general purpose x86 cores, the hardware will support whatever version of DirectX. As such, I'm not sure the above quote makes any sense for Larrabee. Intel can (and will) just update the software portion of its drivers or whatever and that will be that. Sure, you could imagine that some of the features of DX11 might be more efficient in later generations of Larrabee hardware, but it would likely run pretty well on the first-generation of Larrabee, too. If Microsoft comes out with some new Physics acceleration standard (as they have been rumored to do), Intel will just write the software and Larrabee will likely do just fine.
That was the same argument Transmeta used for its processors. Just emulating x86 cost Transmeta half of a native implementation's performance.
Features that are implemented at a significant performance deficit simply do not get used, as a number of feature-rich and "future-proof" GPUs have discovered.
GPUs already devote some amount of specialized resources to emulating the graphics pipeline state machine at speed.
You're counting on Larrabee to emulate specialized hardware emulating a graphics pipeline state machine at speed.
It's possible, but can it be done acceptably?
One of the reasons IGPs are slow is they use the system's main DDR (or whatever) DRAM, which has much less bandwidth than GDDRx memory. The GDDR family of memory have really high bandwidth, but it is more expensive per bit. Near-term GPUs will have, what, 50GB/second or 100GB/second (or more) of memory bandwidth?
Last year's R600 has 105 GB/sec. It wasn't used to all that great effect, but it was there.
Oh, there is one more option. Intel has been playing around with die stacking.
...
Consider a future system with four layers of DRAM on top of a CPU.
...
My first thought is that it would be a thermal nightmare. It seems possible to have a layer of low-power DRAM or SRAM layer underneath the die, since the active logic layers are going to have to be directly adjacent to the cooling solution. The depth of the memory stack in this case is still restricted by yields and the measurable amount of heat even SRAM and DRAM can generate, particulary if under a 150W core.
Even that brings up bus signaling, packaging, and mechanical concerns that haven't (to my knowledge) been sorted out just yet.
How will TSMC, IBM, and AMD's 45nm process compare to Intel's? I'm actually not sure. What I do know, is that Intel finally has its ducks in a row (good process technology, 64-bit x86, a return to reasonable micro-architectures, good multi-cores, advanced system interconnects, and on-chip memory controllers). All of Intel's blunders that allowed AMD to catch up (especially 64-bit x86 and on-chip memory controllers), they have basically fixed.
Most likely inferior from a circuit performance perspective, according to some recent comparisons. What does it mean for designs that don't require bleeding-edge circuit performance, however?
Interesting. I've never before know than masks have special rules for them. Are there any rules that say Intel must give anyone their masks? Can't they just keep them fully in house?
It's not clear to me that AMD has much use of the masks themselves, just what they implement. It's not like AMD can use Intel's masks anyway.
In addition, anything that is patented is still under patent, right? Just because the specific mask loses protection, does that invalidate the patent?
The patented item itself, most likely not.
But both Intel and AMD have agreed they won't be going after each other if their x86 implementations happen to infringe somewhere.
AMD and Intel have a IP cross licensing agreement. Yet, AMD is behind Intel right now? Why? Well it isn't for intellectual property reasons.
Maybe a little. Even if AMD weren't resource-limited and was able to execute its design and manufacturing goals, there would be a time delay built into Intel's releasing the specifications of its extensions and when they could be integrated into a design.
Larrabee doesn't have a SW-managed cache. Whether this is good or not is a matter of opinion
. Given careful coding, a SW-managed cache will probably perform better, but a HW-managed cache is much more forgiving of code that doesn't put in that effort.
It transparently expends resources the programmer can't see but that still impact performance.
There are a number of variables I don't know about concerning Intel's ring bus design.
What is its minimum granularity, for example?
The old slides put it at 256 bytes a cycle, it sounds like it would most likely be two 128 byte rings going in opposite directions (otherwise they could have said it had 512 B/cycle).
Is there a side-band bus that contains the packet information, or does packet identifier data take up some of the bandwidth?
This is a non-trivial issue, because it is the difference between a ring bus that can pass 4 cache lines per cycle and 2.
What about coherency traffic?
That is also non-trivial, since even moderate traffic can cause increasing amounts of utilization to be lost, depending on the granularity of the bus transfers.
Larrabee's cache applies more generally too though, so using more temporary storage than you have HW registers doesn't hurt as much as on G80 -- spilling to memory is much cheaper.
I'm unsure about G80, but the thread on R600's design indicates something like this is already done.
Sure, but if Intel can get a, say, 10% higher clock at the same power consumption, they can shrink the die by 10% at the same overall throughput. Even if Intel's process is more expensive, if it is better in terms of power or frequency it might Intel could make chip with the same performance more cheaply than TSMC.
The calculation would be more complicated at the modest clock ceiling of both Larrabee and GPUs.
Dialing back on clock speeds and voltages tends to avoid that really steep power consumption curve at the topmost speed bins.
The upshot is that the lower regions of the power curve for different products aren't as far apart between products and processes.
Slides indicate Larrabee's TDP is north of 150 W, so it most likely will not be sipping power either.
So one thing I forgot to mention. I agree that TSMC's wafers are inexpensive, but the quality of the process is also much lower.
One of my friends (who has designed for a wide variety of foundry and internal processes) once said:
"TSMC hopes that their 65nm process will be as fast as AMD's 90nm process"
So did AMD. Zing!!
(I know Arun covered that one, but I couldn't resist)
I can easily believe this since TSMC does ASICs, not MPUs. Now, given that Intel's process is always faster than AMD's let's just say that generally TSMC in terms of speed is one generation slower than Intel's nominal, i.e. TSMC@65 = Intel@90 for most generations. I'd further guess that Intel's 45nm is more than one generation ahead of TSMC's 45nm, by virtue of the new gate stack.
It's more than one generation ahead if one needs the upper reaches of the speed curve.
Larrabee's not pushing the envelope when it comes to raw clock speeds, and there are elements in G80 and G92 that nip at the lower reaches of Larrabee's clock range.
RE: Software versus hardware managed caches:
SW managed is going to be lower power, since it's effectively behaving like a 2nd level set of registers. You only access the particular cells you need, and you don't do the tag check. It could be faster since you don't have any TLB on the critical path or the tag check.
DK
That and if the software or designer know they can get away with minimal coherency, they can just turn it off for that portion of execution.
But the advantage of hardware-managed caches is that it also support a much more dynamic nature of caching. Such dynamic caching works pretty well in CPUs and provides a model in which you don't need to worry about software-managed caching and the explicit copy-in/copy-out operations. Plus, you can do cache coherence and really fine-grained synchronization using shared memory locations.
I don't want to come out as if I don't value hardware cache coherency, but there are instances where I believe the ability to simply turn it off are much more valuable in a many-core system.
You can skip striding through the cache (then praying one of the other threads doesn't screw it up) and the core doesn't get in the way of its large number of neighbors.
Creatively combining cache locking and software control can allow Larrabee to emulate a local store or private buffer and keep traffic off of the ring bus and directory.
There are known cases where coherency can be minimal to non-existent, and a lot of graphics work apparently gets along fine without it.
Ok, here is a bomb shell tidbid for you all. Rumor has it that Larrabee supports speculative locking. (!!!)
Speculative locking (also know as speculative lock elision, or SLE) is like a limited form for transactional memory (which was mentioned in some posts earlier in this thread) applied to locking. Instead of actually acquiring a lock, the hardware just speculates that it can complete the critical section without interference. It checkpoints its register state and starts to execute the lock-based critical section. If no conflict occur, it just commits its writes atomically. In this non-conflict case, the processor just completed a critical section *without every acquiring the lock*! Conflicts are detected by using cache coherence to determine when some other thread tries to access the same data (and at least one of the accesses is a write; read-read is not a conflict). On a conflict, a rollback occurs by restoring the register checkpoint and invalidating blocks in the cache that were written while speculating.
Does this mean that Larrabee's cores can suppress coherency when speculatively writing to cache?
Writes made in this speculative mode could lead to invalid invalidations or incorrect forwarded values otherwise.
In other words, can Larrabee's cores turn their coherency off?
Will Larrabee processor run Windows, or will it be just be a video card?
Don't know about any modern Windows, but it has been posited as being capable of being either an accellerator or a system processor.