Larrabee: Samples in Late 08, Products in 2H09/1H10

nAo · Jan 22, 2008

TimothyFarrar said:
Lossy z-buffer compression could be used for early optimistic rough granularity z-culling only ... for example hierarchical z is basically a lossy z compression (keeping the min or max of the lower 4 pix per level depending on the depth compare mode).

AFAIK real hw does much more exotic stuff when it comes down to compress Hi-Z data

krychek · Jan 22, 2008

TimothyFarrar said:
Lossy z-buffer compression could be used for early optimistic rough granularity z-culling only ... for example hierarchical z is basically a lossy z compression (keeping the min or max of the lower 4 pix per level depending on the depth compare mode).

That's not really lossy compression, its just a z-test acceleration structure and it is fully* orthogonal to z-buffer compression—you can have one without the other. Lossy z-buffer compression would be unrecoverable loss in z-buffer data.

* I think some compression schemes might use the Hi-Z as part of the Z-buffer compression schemes (still lossless) but its not mandatory.

Heh net is too slow here, wanted to point out what nAo was gonna say but I was too late.

nAo · Jan 22, 2008

krychek said:
That's not really lossy compression, its just a z-test acceleration structure and it is fully orthogonal to z-buffer compression—you can have one without the other. Lossy z-buffer compression would be unrecoverable loss in z-buffer data.

I know a couple of (Hi-Z) implementations that wouldn't work properly without a full resolution uncompressed z-buffer.

3dilettante · Jan 22, 2008

Nick said:
It will just read the old data without any extra latency. x86 has relaxed cache coherency, like practially every other multi-core CPU. This means that you're only guaranteed to see data written by another core after a certain amount of time, and that you will see the writes in the same order as they were issued.

The question isn't whether a given core is updated within a stretch of time, but whether coherency traffic is ever generated. Coherency packets are of non-zero length and depending on the granularity of the interconnect, even moderate coherency traffic can take a disproportionate amount of the interconnect's transactions/cycle.

All true. As long as the code working set fits in the 512KB level-two cache, code size shouldn't hopefully be too much of a problem. With the four threads to hide instruction cache misses, this doesn't seem like too big a deal.

That would depend on the implementation of its threading. I haven't seen any public source indicating just how Intel's implemented threading.
SMT could plausibly get around an instruction cache miss. FMT would not.

Sure, if the TLB mapping changes, all the processors need to know about it. But why would these mappings change frequently? You need to get this right for correctness, but it isn't a performance issue.

Correctness is greatest enemy of performance.

More to the point, Larrabee's being shown as either an accellerator or as an independent processor means it must have functionality to maintain correctness to a degree GPUs currently do not.

AMD's TLB issue means a single x86 core's bad operation leads to a system-crashing machine check error.

Who knows what kind of crap GPUs get away with on a regular basis, if only because there's a fair amount of undefined or "you're on your own" parts of their specifications.
Possibly, Larrabee in GPU mode could turn off a lot of this error checking, but then it has a fair amount of hardware that is simply never used for that workload.

You're 100% right...

...so that is why Larrabee *does* use a full-map directory cache coherence protocol. On a L2 cache miss, the request accesses an on-chip directory (co-located with the on-chip memory controller) to see if it is cache elsewhere on the chip. If the directory reports it is not cached elsewhere, no further messages are sent. If it is cached elsewhere, only those caches with the block are contacted. The directory is banked several ways to prevent it from becoming a bottleneck.

This makes sense, but I haven't seen any statements to this effect.
Can you source this?

That is a very good question. I assume they did the math and decided they needed that much bandwidth. Another reason is that they might have over-designed it for this generation with the anticipation of going to 64 or 128 cores in the next generation (without needing to reengineer the on-chip ring part of the chip). They will need to so something for the 32nm shrink...

Discussions on R600's internal ring bus indicate there are other reasons for its massive size. A ring bus scales linearly in hardware cost for the number of clients, but it also introduces a significant dynamic component to latency and system behavior.
Pathological cases and heavy loading seriously degrade the effectiveness of the ring bus, which is why they tend to be significantly larger than they need to be in the general case.
If Intel wants to scale to 64 to 128, I wonder if it would do something like increase the bus width further, or maybe divide things up so they aren't all on a single ring.

ArchitectureProfessor said:
This is just going to be a repeat of how Intel's chips killed most of the workstation vendors low-volume high-cost---but fast---chips (HP's PA-RISC, DEC's Alpha, Sun's SPARC, SGI's MIPS, etc). Intel was able to narrow the performance gap, and once Intel's chips were within a factor of 3x in performance but 1/10th the cost, the high-end workstation market started to die.

x86 wasn't the sole reason for this success.

It was cheap x86 paired with specialized 3D cards that allowed x86 to even pretend to matter in the 3D workstation market. That's what butchered a multi-billion dollar market the RISCs (and Itanium, early on) were counting on.

In 1995 SGI's market cap was $7 billion (maybe $10 billion in real dollar). Just ten years later SGI was de-listed from the stock market and declared bankruptcy. Of course, proof by analogy is a dangerous game. NVIDIA just reminds me of SGI in the mid-1990s. (BTW, who were the early engineers behind NVIDIA and ATI? Engineers jumping ship from SGI...).

On a side note, one of the final straws that broke the camel's back was the fact SGI was counting on the timely release of Itanium-based products as it transistioned away from its older lines.
Intel's missteps in timeliness in delivering a new and less-established architecture with the features and performance promised are not likely to be forgotten in the consumer graphics market, or anywhere else for that matter.

GPUs have been wildly successful in that they have finally cracked the parallelism nut. Parallel computing has been the holy grail of computer design. The combination of highly-parallel graphics algorithms with custom GPU hardware really broke the chicken-and-egg cycle of "no high-volume parallel hardware" and "no high-value parallel computations". Now GPUs have broken that cycle. As a computer architect, that is very exciting.

I'm not sure it's that revolutionary, rather, the market finally grew a segment for a workload that was embarassingly parallel, lax in correctness, and cost-conscious.

Here's another big claim. Once GPU/CPUs just become more generic MIMD/SIMD cores, what Microsoft does with DirextX just won't matter as much. You want some new feature, just buy a cross-platform library from a company that does it. No need to wait for Microsoft to bless it. This is why Intel's acquisition of Havok is so interesting...

This has been addressed already, but market forces and development concerns will resist this.

ArchitectureProfessor said:
As Larrabee is just general purpose x86 cores, the hardware will support whatever version of DirectX. As such, I'm not sure the above quote makes any sense for Larrabee. Intel can (and will) just update the software portion of its drivers or whatever and that will be that. Sure, you could imagine that some of the features of DX11 might be more efficient in later generations of Larrabee hardware, but it would likely run pretty well on the first-generation of Larrabee, too. If Microsoft comes out with some new Physics acceleration standard (as they have been rumored to do), Intel will just write the software and Larrabee will likely do just fine.

That was the same argument Transmeta used for its processors. Just emulating x86 cost Transmeta half of a native implementation's performance.

Features that are implemented at a significant performance deficit simply do not get used, as a number of feature-rich and "future-proof" GPUs have discovered.

GPUs already devote some amount of specialized resources to emulating the graphics pipeline state machine at speed.

You're counting on Larrabee to emulate specialized hardware emulating a graphics pipeline state machine at speed.
It's possible, but can it be done acceptably?

ArchitectureProfessor said:
One of the reasons IGPs are slow is they use the system's main DDR (or whatever) DRAM, which has much less bandwidth than GDDRx memory. The GDDR family of memory have really high bandwidth, but it is more expensive per bit. Near-term GPUs will have, what, 50GB/second or 100GB/second (or more) of memory bandwidth?

Last year's R600 has 105 GB/sec. It wasn't used to all that great effect, but it was there.

Oh, there is one more option. Intel has been playing around with die stacking.
...
Consider a future system with four layers of DRAM on top of a CPU.
...

My first thought is that it would be a thermal nightmare. It seems possible to have a layer of low-power DRAM or SRAM layer underneath the die, since the active logic layers are going to have to be directly adjacent to the cooling solution. The depth of the memory stack in this case is still restricted by yields and the measurable amount of heat even SRAM and DRAM can generate, particulary if under a 150W core.

Even that brings up bus signaling, packaging, and mechanical concerns that haven't (to my knowledge) been sorted out just yet.

ArchitectureProfessor said:
How will TSMC, IBM, and AMD's 45nm process compare to Intel's? I'm actually not sure. What I do know, is that Intel finally has its ducks in a row (good process technology, 64-bit x86, a return to reasonable micro-architectures, good multi-cores, advanced system interconnects, and on-chip memory controllers). All of Intel's blunders that allowed AMD to catch up (especially 64-bit x86 and on-chip memory controllers), they have basically fixed.

Most likely inferior from a circuit performance perspective, according to some recent comparisons. What does it mean for designs that don't require bleeding-edge circuit performance, however?

ArchitectureProfessor said:
Interesting. I've never before know than masks have special rules for them. Are there any rules that say Intel must give anyone their masks? Can't they just keep them fully in house?

It's not clear to me that AMD has much use of the masks themselves, just what they implement. It's not like AMD can use Intel's masks anyway.

In addition, anything that is patented is still under patent, right? Just because the specific mask loses protection, does that invalidate the patent?

The patented item itself, most likely not.
But both Intel and AMD have agreed they won't be going after each other if their x86 implementations happen to infringe somewhere.

AMD and Intel have a IP cross licensing agreement. Yet, AMD is behind Intel right now? Why? Well it isn't for intellectual property reasons.

Maybe a little. Even if AMD weren't resource-limited and was able to execute its design and manufacturing goals, there would be a time delay built into Intel's releasing the specifications of its extensions and when they could be integrated into a design.

armchair_architect said:
Larrabee doesn't have a SW-managed cache. Whether this is good or not is a matter of opinion . Given careful coding, a SW-managed cache will probably perform better, but a HW-managed cache is much more forgiving of code that doesn't put in that effort.

It transparently expends resources the programmer can't see but that still impact performance.

There are a number of variables I don't know about concerning Intel's ring bus design.
What is its minimum granularity, for example?
The old slides put it at 256 bytes a cycle, it sounds like it would most likely be two 128 byte rings going in opposite directions (otherwise they could have said it had 512 B/cycle).
Is there a side-band bus that contains the packet information, or does packet identifier data take up some of the bandwidth?

This is a non-trivial issue, because it is the difference between a ring bus that can pass 4 cache lines per cycle and 2.
What about coherency traffic?
That is also non-trivial, since even moderate traffic can cause increasing amounts of utilization to be lost, depending on the granularity of the bus transfers.

Larrabee's cache applies more generally too though, so using more temporary storage than you have HW registers doesn't hurt as much as on G80 -- spilling to memory is much cheaper.

I'm unsure about G80, but the thread on R600's design indicates something like this is already done.

ArchitectureProfessor said:
Sure, but if Intel can get a, say, 10% higher clock at the same power consumption, they can shrink the die by 10% at the same overall throughput. Even if Intel's process is more expensive, if it is better in terms of power or frequency it might Intel could make chip with the same performance more cheaply than TSMC.

The calculation would be more complicated at the modest clock ceiling of both Larrabee and GPUs.
Dialing back on clock speeds and voltages tends to avoid that really steep power consumption curve at the topmost speed bins.
The upshot is that the lower regions of the power curve for different products aren't as far apart between products and processes.

Slides indicate Larrabee's TDP is north of 150 W, so it most likely will not be sipping power either.

dkanter said:
So one thing I forgot to mention. I agree that TSMC's wafers are inexpensive, but the quality of the process is also much lower.

One of my friends (who has designed for a wide variety of foundry and internal processes) once said:

"TSMC hopes that their 65nm process will be as fast as AMD's 90nm process"

So did AMD. Zing!!
(I know Arun covered that one, but I couldn't resist)

dkanter said:
I can easily believe this since TSMC does ASICs, not MPUs. Now, given that Intel's process is always faster than AMD's let's just say that generally TSMC in terms of speed is one generation slower than Intel's nominal, i.e. TSMC@65 = Intel@90 for most generations. I'd further guess that Intel's 45nm is more than one generation ahead of TSMC's 45nm, by virtue of the new gate stack.

It's more than one generation ahead if one needs the upper reaches of the speed curve.
Larrabee's not pushing the envelope when it comes to raw clock speeds, and there are elements in G80 and G92 that nip at the lower reaches of Larrabee's clock range.

RE: Software versus hardware managed caches:

SW managed is going to be lower power, since it's effectively behaving like a 2nd level set of registers. You only access the particular cells you need, and you don't do the tag check. It could be faster since you don't have any TLB on the critical path or the tag check.

DK

That and if the software or designer know they can get away with minimal coherency, they can just turn it off for that portion of execution.

ArchitectureProfessor said:
But the advantage of hardware-managed caches is that it also support a much more dynamic nature of caching. Such dynamic caching works pretty well in CPUs and provides a model in which you don't need to worry about software-managed caching and the explicit copy-in/copy-out operations. Plus, you can do cache coherence and really fine-grained synchronization using shared memory locations.

I don't want to come out as if I don't value hardware cache coherency, but there are instances where I believe the ability to simply turn it off are much more valuable in a many-core system.
You can skip striding through the cache (then praying one of the other threads doesn't screw it up) and the core doesn't get in the way of its large number of neighbors.
Creatively combining cache locking and software control can allow Larrabee to emulate a local store or private buffer and keep traffic off of the ring bus and directory.
There are known cases where coherency can be minimal to non-existent, and a lot of graphics work apparently gets along fine without it.

ArchitectureProfessor said:
Ok, here is a bomb shell tidbid for you all. Rumor has it that Larrabee supports speculative locking. (!!!)

Speculative locking (also know as speculative lock elision, or SLE) is like a limited form for transactional memory (which was mentioned in some posts earlier in this thread) applied to locking. Instead of actually acquiring a lock, the hardware just speculates that it can complete the critical section without interference. It checkpoints its register state and starts to execute the lock-based critical section. If no conflict occur, it just commits its writes atomically. In this non-conflict case, the processor just completed a critical section *without every acquiring the lock*! Conflicts are detected by using cache coherence to determine when some other thread tries to access the same data (and at least one of the accesses is a write; read-read is not a conflict). On a conflict, a rollback occurs by restoring the register checkpoint and invalidating blocks in the cache that were written while speculating.

Does this mean that Larrabee's cores can suppress coherency when speculatively writing to cache?
Writes made in this speculative mode could lead to invalid invalidations or incorrect forwarded values otherwise.
In other words, can Larrabee's cores turn their coherency off?

MTd2 said:
Will Larrabee processor run Windows, or will it be just be a video card?

Don't know about any modern Windows, but it has been posited as being capable of being either an accellerator or a system processor.

Gubbi · Jan 22, 2008

ArchitectureProfessor said:
Ok, with all that said, let me respond to your specific comments:

Write buffers are just used to cache blocks while they are gathering permissions from the coherence protocol. (It one of the hardware optimizations that causes problems in re-ordering instructions and the memory ordering model, which were discussed early in this thread.) There aren't that many of them, but they aren't a specifically scarce resource. That said, even the paper you reference does talk about the possibility of buffering the speculative state in the cache (rather than a separate buffer).

My point was that write buffers are a limited resource. One that is micro architecture specific. You could end up in a situation where you have more stores in your speculative critical section that you have write buffer entries.

The other idea about using the caches for speculative state doesn't seem all that hot to me. You would have to write back dirty cache lines prior to doing a speculative store. You then store and end up with a cache line in a "speculative dirty" state, which upon correct completion will have to be written to memory again at some point. You also need to be able to track which cachelines are in a speculative dirty state and clear the speculative bits upon correct exit automagically.

ArchitectureProfessor said:
Non-temporal stores bypass the primary *data cache*. They are just a hint to the hardware to not bother caching them.

No, on a lot of current CPUs (P4, Core 2) nontemporal stores end up in the posted memory write buffer (write-combine buffer) at the memory interface, -bypassing the ordinary write buffers of the CPU core (no reason to support store-to-load forwarding when we told the core we won't be needing the data anymore). You can of course conservatively demote non-temporal stores into regular ones, but at a performance penalty.

ArchitectureProfessor said:
Nope. If you only need the block in read-only (aka, "shared") state if you only read it. Only the blocks that are *written* speculatively need to be in exclusive state.

If you have a read-only copy of the block, before some other thread writes it, it must invalidate you. That is the fundamental rule of coherence. As such, you will see an invalidation (allowing you to detect a conflict) if some other processor attempts to write the block.

You are of course right. Brainfart on my part. The read cacheline would go from shared to invalid upon a write from another core. You would still have to be able to track these so that you can detect meddling and abort+retry the critical section.

I still don't see it as a huge win. Entering the critical section speculatively has almost the same cost as just test-and-setting the damn lock, - and has worse performance under contention (like LLSC has). You also need microarchitectural support for detecting writes to cachelines that has been touched speculatively, which will cost somewhere else once the die-diet kicks in.

Cheers

Bob · Jan 23, 2008

AP said:
As long as your bandwidth utilization is, say, less than 50% and you have a reasonably random distribution of addresses, the latency should be largely independent of the number of threads your throw at the system. Sure, there may be bursts, but because the system is a closed system (a thread won't generate another miss until it has been serviced, there is a limit to the number of misses that can be queued at any one time).

If you only use <50% of your framebuffer bandwidth, then your framebuffer is oversized by a factor of at least 2. In other words, it's wider than it should be.

Since pads and other memory controller logic are a non-trivial amount of area (and power!), you end up spending quite a lot of $ to oversize your FB bandwidth capacity, all of which to make up for the fact that you're not handling your memory traffic right.

You're also not maximizing your bandwidth, since you're leaving a factor of 2 performance on the table.

Am I missing something about the internals of DRAM? I have to admit, I've never worked with a super-detailed DRAM model, but from what I know about DRAM, I can't think of something that would cause such problems.

(Disclaimer: there may be mistakes in my math below - read at your own risk)

Here's an easy example: Let's pick Samsung K4U52324QE GDDR4, since Samsung is kind enough to post specs publically. GDDR4 is BL8 DRAM. That means that any memory read/write will occupy 8 data cycles, regardless of if you fill these up with work or not. That puts your minimal data granularity at 32-bits * 8 = 32 Bytes.

Let's say you are finely interleaving reads and write of 64 Bytes, so each read/write operation you want to do requires 2 transactions. This is not out of the ordaniry, as you are both texturing and writing out pixels at the same time. Moreover, let's also assume everything hits the same DRAM page, avoiding the overhead of changing pages.

Write-to-write and read-to-read commands take up (BL/2) * tCK each.

Write-to-read command switches take up [WL + (BL/2)]* tCK + tCDLR, and read-to-write take [ CL + (BL/2) + 2 - WL] * tCK (table on page 63).

Fill in the numbers:
BL=8
WL = 6
CL = 18 (for high-clock GDDR4)
tCDLR = 8 tCK
tCK = 1.6ns

Gives you:
Read-to-read / write-to-write time = 6.4 ns
Read-to-write = 28.8 ns
Write-to-read = 28.8 ns

Thus, doing your 64 B reads or writes in that manner takes 35.2 ns, solely sitting and waiting for the DRAMs. At 1.5 GHz, that's 52 clocks. That doesn't include the time to compute the address, return the data, re-schedule the stalled thread, etc.

Moreover, even though your DRAMs are 100% busy, your FB efficiency is at a pitiful ((64 B / 35.2 ns) / (32 B / 6.4 ns)) = 36%.

If you're hitting different pages (which would be the case if you have any reasonable kind of cache), you can easily cripple yourself down to 20% efficiency and 50+ ns of latency just in the DRAMs.

Clearly, there is a need to be smarter about your memory traffic. But being smarter either means more complex code needs to run (which steals FLOPS from shading), or area dedicated to complicated fixed-function logic, or you need to deal with significantly more latency and the resources associated with that.

ArchitectureProfessor · Jan 23, 2008

Gubbi said:
My point was that write buffers are a limited resource. One that is micro architecture specific. You could end up in a situation where you have more stores in your speculative critical section that you have write buffer entries.

In which case you just fall back on actually acquiring the lock. In such a case, you didn't really gain much, but you didn't really lose anything either (you can just write the lock and then commit all the speculative state and keep going with the critical section).

The other idea about using the caches for speculative state doesn't seem all that hot to me. You would have to write back dirty cache lines prior to doing a speculative store.

Sure. This, of course, doesn't stall the processor (as the store can be buffered while the writeback of the block occurs). So, the only cost is the additional L1->L2 traffic. When you consider all the L1<->L2 traffic for cache misses and writebacks of dirty data, the extra traffic introduced by writing blocks speculatively is basically negligible. Of course the exact number depends on the amount of time spent in critical section, etc., but the data I've seen indicates it isn't a big problem.

If you're really worried about the writeback traffic, you can use a hybrid to buffer some writes in the store buffer and then overflow into the cache when that fills. That has the advantage of reducing the write-back traffic for critical sections with a small number of stores while still allowing larger critical sections.

You then store and end up with a cache line in a "speculative dirty" state, which upon correct completion will have to be written to memory again at some point.

I think you're double-counting here. Of course you have to write back a dirty block at some point. If you write it speculatively once, you just have one extra writeback (not two).

You also need to be able to track which cachelines are in a speculative dirty state and clear the speculative bits upon correct exit automagically.

An SRAM with flash-clear support changes a 6-transistor SRAM cell into an 8-transistor SRAM cell (or maybe 10 transistors). Sure, it won't be a highly optimized standard cell, but it won't be more than 2x the size of a regular SRAM bit. In fact, you also might want a "conditional clear" cell, which would likely take an additional four transistors. Of course, as you just need eight bits (two bits thread) per 512-bit cache block. That is less than 2% overhead for the L1 cache. As the L1 cache area is at most 25% of the chip, we're talking something less than 1% of the chip area (at the most).

The read cacheline would go from shared to invalid upon a write from another core. You would still have to be able to track these so that you can detect meddling and abort+retry the critical section.

That is included in the less than 1% overhead I mentioned above.

Entering the critical section speculatively has almost the same cost as just test-and-setting the damn lock...

There is one key difference. To test-and-set the lock you need *write* permission to the block with the lock. To start a speculation on the lock, you just need *read* permission. As everyone can have read permission to the lock, all the hot locks will cached read-only by all the caches. In such cases, starting a speculation doesn't cause a cache miss! In contrast, for a test-and-set lock, the last processor to release the lock will have the block with read/write permissions. That means the next processor to acquire the lock (in non-speculative locking) is going to take a cache miss.

This makes speculative locking faster than regular locking even *without* any lock contention.

...and has worse performance under contention (like LLSC has).

All you would need to do is give up on speculating when you fail frequently and go back to locking. You could do this on a per-lock location to dynamically capture which locks are poor candidates for speculation. Alternatively, there have been proposals to resolve conflicts in a smarter manner that basically queue up all contended critical section and then efficiently hands off the lock from one to the next. This can actually be faster than baseline locking under contention.

I still don't see it as a huge win.

Up to this point, this message has tried to address some of the concerns you raised with the cost of the idea (not the benefit). Let's explore the benefit of the scheme a bit further.

Try this experiment: go up to someone that has written a non-trivial parallel multi-core program. Tell them you want them to extend it to 128 threads on 32 cores (and watch the blood drain away from their face; they might even curl up into the fetal position and start rocking back and forth on the floor).

Then tell them that you're going to give them a hardware widget that will both (1) reduce lock contention and (2) make uncontended locks cheap to acquire. I assure you they won't have a hard time seeing that as a big win.

ArchitectureProfessor · Jan 23, 2008

Bob said:
Here's an easy example: Let's pick Samsung K4U52324QE GDDR4, since Samsung is kind enough to post specs publically. GDDR4 is BL8 DRAM. That means that any memory read/write will occupy 8 data cycles, regardless of if you fill these up with work or not. That puts your minimal data granularity at 32-bits * 8 = 32 Bytes.

Sure.

The concern I have with your post is that you're failing to take into account the pipelining of requests to a DRAM. Just because the latency of a request to the DRAM is k cycles doesn't mean you can't issue other requests to it in parallel. I think your posts confuses throughput issues with latency issues in several places.

Let's say you are finely interleaving reads and write of 64 Bytes, so each read/write operation you want to do requires 2 transactions. This is not out of the ordaniry, as you are both texturing and writing out pixels at the same time. Moreover, let's also assume everything hits the same DRAM page, avoiding the overhead of changing pages.

If changing from read mode to write mode is so costly, a smart memory controller would defer the writes in a buffer (as writes aren't on the critical path) and then burst all the pending reads together. Seems like you picked an unrealistic worst case situation.

Once you can burst some reads and write together, you can pipelined the request with the responses. A quote from the GDDR4 document you reference:

Data from any READ burst may be concatenated with data from a subsequent READ command. A continuous flow of data can be maintained. The first data element from the new burst follows the last element of a completed burst. The new READ command should be issued x cycles after the first READ command.... Full-speed random read accesses within a page (or pages) can be performed as shown in Random READ accesses figure.

It can get full speed because this DRAM uses separate pins for address/control from data signals. You can be pipelining new addresses to it in parallel with receive data.

Write-to-read command switches take up [WL + (BL/2)]* tCK + tCDLR, and read-to-write take [ CL + (BL/2) + 2 - WL] * tCK (table on page 63).

Hang on, much of this latency can be overlapped with other transfers. The "Read to Write" diagram on page 33 clearly shows only 2 "dead" clocks cycles between a read and a write. This is only a couple of nanoseconds of dead time, not the 28ns you calculated in your last post.

Thus, doing your 64 B reads or writes in that manner takes 35.2 ns, solely sitting and waiting for the DRAMs. At 1.5 GHz, that's 52 clocks. That doesn't include the time to compute the address, return the data, re-schedule the stalled thread, etc.

So, I'm confused. Are you talking about uncontended latency of 35ns? If so, if you look back in my post, I used an estimate of 50ns for uncontended latency, so my estimate seems conservative based on what you're saying. If you're talking about throughput, they you're not considering the small dead time between reads and writes (or that consecutive read or write bursts can be done at full DRAM bandwidth).

Clearly, there is a need to be smarter about your memory traffic. But being smarter either means more complex code needs to run (which steals FLOPS from shading), or area dedicated to complicated fixed-function logic, or you need to deal with significantly more latency and the resources associated with that.

Or just a better memory scheduler (which is part of every high-performance DRAM memory controller built today).

Yet, what does this have to do with the difference between Larrabee and a GPU? It seems like all of the above issues are equally true for both systems? I'm really confused what your point is about all this.

My original point was that Larrabee's 128 threads was likely enough to saturate its available DRAM bandwidth. Do you still disagree with that assertion?

Furthermore, since my original comment, we've now concluded that Larrabee likely has vector scatter/gather support. If that is the case, Larrabee will have no problem generating enough outstanding misses to swamp any reasonable GDDR3/4 memory system.

ArchitectureProfessor · Jan 23, 2008

3dilettante said:
The question isn't whether...

And I though I wrote long posts...

I'll try to respond to your point below.

3dilettante said:
The question isn't whether a given core is updated within a stretch of time, but whether coherency traffic is ever generated. Coherency packets are of non-zero length and depending on the granularity of the interconnect, even moderate coherency traffic can take a disproportionate amount of the interconnect's transactions/cycle.

Agreed. Relaxed memory ordering just tolerates latency, it doesn't reduce coherence traffic (or at least by any signifiant amount).

I would like to point out, however, that request messages are on the order of 8 bytes (a 40+ physical address, source ID, command bits, etc.). A data response has some multi-byte header plus a 64-byte payload. You can send 8 request message for the cost of one data response (or writeback). As most coherence operations involve transferring data, the traffic due to data can dominate even in broadcast-based systems.

That would depend on the implementation of its threading. I haven't seen any public source indicating just how Intel's implemented threading. SMT could plausibly get around an instruction cache miss. FMT would not.

The whole point of multithreading in today's machines is to tolerate instruction and data cache misses. If one thread blocks on a cache miss, the other ones fill in. This is true for the Pentium 4 hyperthreading, IBM's coarse-grained RS-IV PowerPC core (form the mid-1990s), IBM's Power5, etc.

Correctness is greatest enemy of performance.

Oh, I like that quote. Well said.

More to the point, Larrabee's being shown as either an accellerator or as an independent processor means it must have functionality to maintain correctness to a degree GPUs currently do not.

Yes, Larrabee is implementing a much stricter semantics than the ill-defined and loosely synchronized GPUs. The question then becomes: (1) how much is the cost and then (2) can the more expressive programming and memory model allow the software to make up for (or exceed) the difference?

This makes sense, but I haven't seen any statements to this effect. Can you source this?

No, I don't have a published source that I can point to that describes Larrabee's directory-based coherence protocol. You'll just need to take my word for it. Sorry.

If Intel wants to scale to 64 to 128, I wonder if it would do something like increase the bus width further, or maybe divide things up so they aren't all on a single ring.

This seems reasonable. For some reason, Intel seems to have a ring-based interconnect fetish. They just love them for on-chip interconnects. Certainly beyond a certain number of cores, doing something like a mesh or grid seems like it would make more sense.

x86 wasn't the sole reason for this success.

In fact, I would say they succeeded in spite of x86.

It was cheap x86 paired with specialized 3D cards that allowed x86 to even pretend to matter in the 3D workstation market. That's what butchered a multi-billion dollar market the RISCs (and Itanium, early on) were counting on.

I'm not sure of the relative role of the lose of the 3D CAD workstation vs servers in the decline of the RISC chips. Certainly, lots of customers were buying the high-end RISC chips for non-3D applications. I do agree that in the 3D area, chip PC-based GPUs really hurt some of the RISC vendors, most notably SGI/MIPS.

On a side note, one of the final straws that broke the camel's back was the fact SGI was counting on the timely release of Itanium-based products as it transistioned away from its older lines.
Intel's missteps in timeliness in delivering a new and less-established architecture with the features and performance promised are not likely to be forgotten in the consumer graphics market, or anywhere else for that matter.

Itanium was a mess from the start. In addition, I want to re-iterate that Intel isn't one big hive mind. The mistakes the guys in Santa Clara, CA made a decade ago doesn't really reflect much on a project being led by the Oregon group.

That was the same argument Transmeta used for its processors. Just emulating x86 cost Transmeta half of a native implementation's performance.

Transmeta is an interesting analogy to consider. I agree that their technical approach didn't seem to pan out for them. Part of the reason was that they were banking on a Itanium-like VLIW core to save the day. Turns out VLIW isn't so great compared with a dynamically scheduled super-scalar core (at least for common CPU workloads). This same reality is what doomed Itanium.

Yes, Transmeta's failure does give pause. I think the domains are different (Larrabee isn't really translating anything), but I agree that it could fail as spectacularly as Transmeta did.

(As an aside, I think of the things that really hurt Transmeta in the end was that their translation-based approach was pretty hard to adapt to a multi-core setting. When the handwriting was on the wall that multi-core was going to happen, I think it was the last nail in the coffin for Transmeta.)

You're counting on Larrabee to emulate specialized hardware emulating a graphics pipeline state machine at speed.
It's possible, but can it be done acceptably?

I think the key will be what sort of special-purpose vector instructions Larrabee uses. Perhaps one way of thinking of Larrabee's vectors is as really fine-grained fixed-function blocks that just happen to operating on 64-byte register inputs as controlled by a CPU.

My first thought is that it would be a thermal nightmare. It seems possible to have a layer of low-power DRAM or SRAM layer underneath the die, since the active logic layers are going to have to be directly adjacent to the cooling solution. The depth of the memory stack in this case is still restricted by yields and the measurable amount of heat even SRAM and DRAM can generate, particulary if under a 150W core.

Yes, the thermal issues are the key issues with die stacking (that and yield issues cause by mis-stacking and that you can't fully test all the die before they are in the package). I think the thermal issues can be solved, but it will take innovation. One thing to point out is that it doesn't increase the overall power used by the card, it just concentrates it more. So the box-level and machine-room-level cooling issues isn't made any worse. In fact, putting things in a die stack will actually help the operations per watt metric.

Creatively combining cache locking and software control can allow Larrabee to emulate a local store or private buffer and keep traffic off of the ring bus and directory.

Especially in a multithreaded design, I could see why some sort of cache locking or other mechanisms might be useful. I think the Xbox 360's Xenon has such support in it (but I might be thinking of a different IBM chip).

Does this mean that Larrabee's cores can suppress coherency when speculatively writing to cache? Writes made in this speculative mode could lead to invalid invalidations or incorrect forwarded values otherwise. In other words, can Larrabee's cores turn their coherency off?

This isn't so much as turning off cache coherence. It is more that it allows some speculative values to be in the cache. If any other processor requests any of those blocks, it detects a conflict and recovers to prevent any incorrect value forwarding.

ArchitectureProfessor · Jan 23, 2008

Killer-Kris said:
For what ever reason I don't think that lends much credibility to the project, especially when you realize that it's what's left of the Prescott team.

Intel has two really top-notch design groups with proven track records: Oregon and Israel. Oregon did the Pentium Pro, Willamette/Northwood (which was the fastest CPU of its day), and yes, Prescott (which wasn't so good). They are also doing Nehalem, which looks on-track. So, they have a pretty good track record. FYI, the Israel team did the Pentium with MMX, the Pentium M (the low-power Pentium III), and then Core and Core2. All very nice chips. Did I mention that both Itanium and the short-lived Pentium core were both done in Santa Clara?

My main point of the original comment, however, was that Larrabee isn't being done by whatever group does Intel's current IGP.

TimothyFarrar · Jan 23, 2008

ArchitectureProfessor said:
There is one key difference. To test-and-set the lock you need *write* permission to the block with the lock. To start a speculation on the lock, you just need *read* permission. As everyone can have read permission to the lock, all the hot locks will cached read-only by all the caches. In such cases, starting a speculation doesn't cause a cache miss! In contrast, for a test-and-set lock, the last processor to release the lock will have the block with read/write permissions. That means the next processor to acquire the lock (in non-speculative locking) is going to take a cache miss.

This makes speculative locking faster than regular locking even *without* any lock contention.

Pardon for disrupting this excellent hardware discussion, but from a programming perspective, it is easy to get around needing "writes" on locks (and even locks themselves) with the following method (which is supported on anything with an atomic swap instruction),

A = read value at address M
do work
B = atomic read at address M and set at address M = A (single atomic swap opcode)
if (B != A) then handle contention

Often address M is simply a pointer to an object (or a reference counter), where contention means another thread updated the object, and objects are lazy freed long after possible pending threads would be reading from them ...

Basically no new hardware widget is needed, bypass locking, and use smart algorithm design to remove the possible number of contentions.

The ALU to memory access ratio is bound to only increase, favoring less memory accesses (and higher latencies) and more ALU work on a smaller local set. This trend grows faster the number of cores you add. I don't think the Larrabee model will scale well as this trend continues ... ultimately the same thinking needed to optimize for classic MPI (cluster computing, optimization / limiting of message passing / communication between computing nodes) will be needed to effectively extract the performance from highly parallel machines. NVidia's CUDA/PTX seems to offer much better long term scalability.

BTW, one awesome current example of a scalable parallel algorithm for GPGPU, which is ideal in not needing locking, is Gernot Ziegler's HistoPyramid. Allows for all sorts of stuff traditionally thought of as tough to do with only a SM3.0 feature set, like stream compaction, etc. The basic concepts involved, parallel reduction and independent logarithmic searching via gather, can be used for all sorts of related algorithms (like dynamic allocation using a fragment shader only, ie no point rendering for scatter).

The one real advantage that Larrabee might have is ability to do a cached scatter, however I have a feeling that before Larrabee is released, that NVidia and AMD will have hardware support for scatter (read+write) using an optimized 2D cache, and that this will be in future DX and GL standards, probably first used in a programmable ROP, and would sure help in rendering smaller and smaller triangles ... NVidia's PTX already has scatter (read/write to surface cache) in the spec for future hardware.

Nick · Jan 23, 2008

Test and test-and-set is an obvious way to avoid cache misses on contended lock, and can also be used to avoid convoying. Just test the lock, and if it's already aquired then let the thread do some other work instead of spin-looping.

This is an interesting option for task scheduling. If the lock can be aquired, fill a lock-free queue with new tasks ready for execution. If it can't be aquired, skip and just pop a task off the queue. Only when the queue is empty you'd start convoying, but that's a matter of task granularity.

Anyway, speculative locking looks unbeatable for things like shared containers, where a lock on the whole container would kill concurrency and a lock per element would have too much overhead.

TimothyFarrar · Jan 23, 2008

ArchitectureProfessor said:
I'm not sure of the relative role of the lose of the 3D CAD workstation vs servers in the decline of the RISC chips. Certainly, lots of customers were buying the high-end RISC chips for non-3D applications. I do agree that in the 3D area, chip PC-based GPUs really hurt some of the RISC vendors, most notably SGI/MIPS.

As a graphics company suppling Irix/Unix people, SGI made the mistake of going NT/Windows and alienated at least one major movie special effects company. Not the first time Microsoft has helped a competitor to get in and later put them (nearly) out of business (SEGA). Alpha was awesome, nearly an ideal classic RISC ISA (well lets exclude the floating point exception handling argument). Too bad Alpha got killed for IA-64 after DEC got ate by Compaq, the HP merger didn't help Alpha either.

crystall · Jan 23, 2008

ArchitectureProfessor said:
Write buffers are just used to cache blocks while they are gathering permissions from the coherence protocol.

There's something I don't get about this discussion of using the write buffer for enabling speculative lock elision on Larrabee. Write buffers as somebody already mentioned are a scarce resource in modern processors for a reason: they are on the critical path because of store-to-load forwarding in the load/store units.

Besides >10% of the stalls in modern out-of-order processors can be caused by the write buffer being full. Implementing speculative lock elision using the write buffer to cache speculative writes can be successful only if the number of writes between the lock and unlock operations is very small which is unlikely in real code. This is because writes to non-shared data will need space in the write buffer and cannot be distinguished from writes to shared data (that's a problem with transactional memory too as transactions can be inflated by non-shared accesses happening during the transaction).

So in order to implement effective speculative lock elision you need two things: an out-of-order core which supports speculative execution and a large write buffer. Larrabee cores as far as they have been currently described lacks both.

The only other two architectures which are said to implement speculative lock elision are Sun's Rock and Azul's specialized Java cores. Little information is available on Azul's offerings but since they do work only on Java monitors they have a lot of advantages compared to traditional architectures when implementing speculative lock elision, even more so as many enterprise-grade JVMs already implement it in software so the problem is well understood.
Rock on the other hand is an in-order processor which uses scout threading (i.e. speculative execution with checkpointing) to reap most of the benefits of OoO speculative execution without the associated costs. How it will actually perform remains to be seen but the fact that it implements a processor state checkpointing mechanism makes implementing speculative lock elision a natural extension.

Back to Larrabee, creating checkpoints on a machine with large vector registers (and potentially four sets of them for each core) sounds very unlikely. Still, I'd be glad to be surprised by Intel's engineers devising new methods around the problem

dkanter · Jan 23, 2008

ArchitectureProfessor said:
Intel has two really top-notch design groups with proven track records: Oregon and Israel. Oregon did the Pentium Pro, Willamette/Northwood (which was the fastest CPU of its day), and yes, Prescott (which wasn't so good). They are also doing Nehalem, which looks on-track. So, they have a pretty good track record. FYI, the Israel team did the Pentium with MMX, the Pentium M (the low-power Pentium III), and then Core and Core2. All very nice chips. Did I mention that both Itanium and the short-lived Pentium core were both done in Santa Clara?

My main point of the original comment, however, was that Larrabee isn't being done by whatever group does Intel's current IGP.

Actually I'd say Intel has 4. Intel SC did Pentium and Itanium 1.

Intel (nee HP) Fort Collins did Itanium 2, which as it turns out was the fastest or 2nd fastest 180nm processor and was beating all comers for a while except for....

The alpha design team, which can be counted as one (admittedly there are two teams, but I believe intel combined them). They of course, did the EV6, EV7 and then the EV8. The EV7 was also one of the two fastest 180nm CPUs and it's been speculated that if it wasn't for HP not wanting the red headed step child to outshine the fair haired child (IPF), that it would have easily beat all other 180nm CPUs.

Of course, the Fort Collins guys aren't working on x86, and I don't think the alpha guys are either.

Intel also has teams in Folsom, Costa Rica and some other places, but I'd agree that they aren't teams that have done flagship processors by themselves.

There also might still be some people at Intel in Austin, but I'm not sure...

David

3dilettante · Jan 23, 2008

ArchitectureProfessor said:
And I though I wrote long posts... I'll try to respond to your point below.

I was away from the internet for a few days, so I had to do some catching up.

I would like to point out, however, that request messages are on the order of 8 bytes (a 40+ physical address, source ID, command bits, etc.). A data response has some multi-byte header plus a 64-byte payload. You can send 8 request message for the cost of one data response (or writeback). As most coherence operations involve transferring data, the traffic due to data can dominate even in broadcast-based systems.

That depends on the granularity of the interconnect.
If the bus is capable of fitting packets together in an arbitrary arrangement, then the impact is minimal.
If for simplicity's sake it is optimized to a certain granularity, the utilization penalty is larger than the packet size would suggest.
This already assumes that the ring-bus is packetized.
I can think of a number of increasingly interesting (but perhaps ultimately pointless) formulations for the ring bus, given all that I know is that Intel hoped for 256 bytes/cycle.

The whole point of multithreading in today's machines is to tolerate instruction and data cache misses. If one thread blocks on a cache miss, the other ones fill in. This is true for the Pentium 4 hyperthreading, IBM's coarse-grained RS-IV PowerPC core (form the mid-1990s), IBM's Power5, etc.

SMT is capable of hiding one thread's failure to fetch instructions in a timely fashion by continuing execution on another transparently.
Fine-grained methods such as round-robin execution would not be able to hide a cache miss for whatever number of cycles that thread comes up for execution. In Larrabee's case, it's a matter of 10/4 cycles if it has to go to the L2. It's a small penalty, one that might not justify a more complex hybrid scheme unless a fetch misses the L2.

I haven't seen any Intel statement that has set down what exact scheme Larrabee uses.

This seems reasonable. For some reason, Intel seems to have a ring-based interconnect fetish. They just love them for on-chip interconnects. Certainly beyond a certain number of cores, doing something like a mesh or grid seems like it would make more sense.

Ring buses are conceptually simpler, and they have been used in other architectures to provide high-bandwidth internal interconnect.
As part of Intel's research into many-core design, it might be that the conceptual and physical simplicity of the ring bus meant that it could reach the point of productization first.

Polaris showed they've been working on more complex topologies, but sometimes simplicity wins.

I think the key will be what sort of special-purpose vector instructions Larrabee uses. Perhaps one way of thinking of Larrabee's vectors is as really fine-grained fixed-function blocks that just happen to operating on 64-byte register inputs as controlled by a CPU.

I'm curious about how it will implement scatter and gather operations.
If working on gathering operands for a 16-wide vector, an increasing level of arbitrariness to where those elements can be pulled from can lead to increasing levels of waste.
Worst-case with fully generalized gather is pulling 16 elements from 16 different locations, necessitating 16 64 byte cache lines to be read into the cache for just one 32-bit value from each.
Perhaps gathers will bypass the cache?

Yes, the thermal issues are the key issues with die stacking (that and yield issues cause by mis-stacking and that you can't fully test all the die before they are in the package). I think the thermal issues can be solved, but it will take innovation. One thing to point out is that it doesn't increase the overall power used by the card, it just concentrates it more. So the box-level and machine-room-level cooling issues isn't made any worse. In fact, putting things in a die stack will actually help the operations per watt metric.

Temperature can increase leakage current, so a chip with internals that have a higher average temperature might consume more power, depending on how other portions of the chip are altered.

Coupled with the challenges of less than ideal thermal behavior, a chip with higher power density can necessitate more complex cooling or just more energy devoted to cooling.

I think the Xbox 360's Xenon has such support in it (but I might be thinking of a different IBM chip).

Xenon does support cache locking.

This isn't so much as turning off cache coherence. It is more that it allows some speculative values to be in the cache. If any other processor requests any of those blocks, it detects a conflict and recovers to prevent any incorrect value forwarding.

Unless there's a massive speculative buffer, I'm curious as to what keeps the speculative work from having more far-ranging side effects.

Writing the speculative values to cache could theoretically force a write invalidate broadcast if the cache controller tried to treat it like any other write. I suppose reads can be allowed to go on as usual, though there could be some hiccups when speculation pushes other cache lines to shared status and then rolls back to the time prior to the accesses.

If Larrabee has some kind of forwarded cache line state like in the CSI interconnect, it could theoretically force a rollback on a line in Forwarding status, leaving no cache line anywhere in that state. Then there's the obligatory reference to a corner case involving undoing updates to the page table descriptor.

It seems that the ability to support speculation in the cache is a subset of the possibilities available once you give a thread the ability to toggle portions of the cache controller's state machine.

ArchitectureProfessor · Jan 23, 2008

TimothyFarrar said:
Pardon for disrupting this excellent hardware discussion, but from a programming perspective, it is easy to get around needing "writes" on locks (and even locks themselves) with the following method (which is supported on anything with an atomic swap instruction)...

OK, let me play devil's advocate here again. If this is so easy, why do *any* programs use locks? Just to the magic lock-free stuff you talked about. <sarcasm>Easy. No problem. That is why most major systems such as operating systems and database management systems don't use locks internally. That is why malloc doesn't use locks.</sarcasm>

Oh, wait. All of the above systems *do* use locks extensively. The standard implementations of malloc use locks, too. I agree that if you're building a simple queue, then by all means, use a lock-free queue! But it gets really complicated after that.

Let me give you an really concrete example. In 2000 some well-know top-notch researchers published a paper titled "DCAS-based concurrent deques". This paper used double-compare-and-swap to build a deque (which is basically a double-ended queue/stack). Yet, later a bug was found in their algorithm. This bug and some other issues is discussed in a follow-on paper by a similar set of authors in 2004: "DCAS is not a silver bullet for nonblocking algorithm design". The abstract of this second paper begins: "Despite years of research, the design of efficient nonblocking algorithms remains difficult.".

Another example. There is a paper from 2006 titled "Split-ordered lists: Lock-free extensible hash tables" that presents the first lock-free implementation of an extensible hash table (one that can grow over time as the number of elements in the hash table increases). If making a new lock-free hash table is a publishable results, I can assure you it isn't obvious how to do it.

Let me sum it up this way: a data structure with a coarse grain lock is a CS101 assignment. A data structure with fine-graned locking is perhaps a Junior or Senior level project for a CS major. A good lock-free algorithm can earn you a PhD. Why not build speculative locking hardware to turn a PhD-level problem into a CS101 project?

...and objects are lazy freed long after possible pending threads would be reading from them ...

I think this is harder to guarantee than it sounds.

Basically no new hardware widget is needed, bypass locking, and use smart algorithm design to remove the possible number of contentions.

As I've said earlier in this thread, lock-free algorithms are great! I just don't think they are quite and general or easy to create as you do.

I don't think the Larrabee model will scale well as this trend continues ... ultimately the same thinking needed to optimize for classic MPI (cluster computing, optimization / limiting of message passing / communication between computing nodes) will be needed to effectively extract the performance from highly parallel machines. NVidia's CUDA/PTX seems to offer much better long term scalability.

This is just the whole "message passing vs shared memory" debate all over again. That issues has been debated to death, and I'm not sure I want to rehash it here. In the end, the discussion boiled down to the fact that you could implement message passing really well over coherent shared memory (just build in-memory queues for passing messages). Building shared-memory over message passing was possible, though less efficient. So, in the end, if you *do* have shared memory, you're welcome to make the software as efficient or scalable as you want, be that using threads-and-locks, a worklist-and-tasks model, or message passing.

BTW, one awesome current example of a scalable parallel algorithm for GPGPU, which is ideal in not needing locking, is Gernot Ziegler's HistoPyramid.

I don't know this algorithm. But let me ask you a question about it. How well would his map to Larrabee (in terms of the structure and communication patterns)? I suspect it will map reasonably well under a task-and-queue model.

The one real advantage that Larrabee has is ability to do a cached scatter, however I have a feeling that before Larrabee is released, that NVidia and AMD will have hardware support for scatter (read+write) using an optimized 2D cache, and that this will be in future DX and GL standards, probably first used in a programmable ROP, and would sure help in rendering smaller and smaller triangles ... NVidia's PTX already has scatter (read/write to surface cache) in the spec for future hardware.

Certainly Larrabee is after a moving target. It will be interesting to see what GPUs look like in 2010.

ArchitectureProfessor · Jan 23, 2008

Nick said:
Test and test-and-set is an obvious way to avoid cache misses on contended lock, and can also be used to avoid convoying. Just test the lock, and if it's already aquired then let the thread do some other work instead of spin-looping.

Everything I said above is equally true for a test-and-test-and-set lock. In fact, you'd be sort of crazy to use just a test-and-set lock. The overhead for the extra test is so small (versus the really bad behavior of a test-and-set lock under contention), it is no-brainer to use a test-and-test-and-set lock over a test-and-set lock in almost all cases.

The reason you have a cache miss when acquiring a contended lock for test-and-test-and-set is that whichever processor last locked or unlocked the lock will have that block in its cache in exclusive (and thus it won't be cached anywhere else). As speculative locking can totally avoid this write to the lock itself, speculatively locking doesn't have this problem.

So let me say something explicitly that I only implicitly said before. I should probably have said this earlier: under speculative locking, it is sufficient to just "speculatively read" the lock variable and find it "unlocked". If you can do your entire speculative region while the lock remains in the "speculatively read" state in your cache, you know no other processor acquired the lock, so you can commit your speculation atomically. It is counter-intuitive that it actually works, but pretty clever, too.

Anyway, speculative locking looks unbeatable for things like shared containers, where a lock on the whole container would kill concurrency and a lock per element would have too much overhead.

I certainly agree with that.

ArchitectureProfessor · Jan 23, 2008

TimothyFarrar said:
Too bad Alpha got killed for IA-64 after DEC got ate by Compaq, the HP merger didn't help Alpha either.

Yea. The demise of Alpha is certainly sad. I once heard at a conference: "There is an inverse relationship between the cleanliness of an ISA and its commercial success". Sad, but seemingly true (with x86 and PowerPC being the prime examples). ARM might be the exception.

One thing I miss about the Alpha group is that they were very engaged in the academic community, and they wrote nice papers describing what they actually did. I actually use the Alpha 21164->21264->21365 (ev5->ev6->ev7) papers to illustrate the progression from in-order to out-of-order to integrated coherence. Intel is also engaged in the academic community (less so for AMD), but pretty tight lipped from time to time about exactly how the internals of their actual chips work. For example, try to find something about Intel's branch predictors. From what I understand, it is the secret sauce or the Coke formula of computer architecture.

ArchitectureProfessor · Jan 23, 2008

crystall said:
There's something I don't get about this discussion of using the write buffer for enabling speculative lock elision on Larrabee. Write buffers as somebody already mentioned are a scarce resource in modern processors for a reason: they are on the critical path because of store-to-load forwarding in the load/store units.

I agree that for strongly ordered systems like x86 and SPARC TSO, the size of the store buffer (aka write buffer) *is* a scarce resource. It needs to support multiple stores to the same location, byte-level valid bits, and it needs to maintain FIFO order. For weaker ordering models (PowerPC, Itanium, Alpha), the store buffer has much more flexibility allowing it to be set-associative, single-versioning, and non-FIFO entry freeing. The results is a much more effective write buffer than can be made larger, too.

Ok, but we're talking x86 for Larrabee.

One option would just be to add a little "speculation cache" off to the side of the L1 (accessed in parallel with the L1 and store buffer, but only during speculation). It could have 32 64B entries for a total of 2KBs of storage (much smaller than the L1 cache). This would allow writing a significant amount of data within a critical section (recall that stores are less frequent that loads). As empirical evidence says that many critical sections are small (at least for well-tuned programs). For example, a lookup in a O(log n) tree structure or accessing a hash table bucket. I dunno, I don't think this is too bad a cost to pay.

This is because writes to non-shared data will need space in the write buffer and cannot be distinguished from writes to shared data (that's a problem with transactional memory too as transactions can be inflated by non-shared accesses happening during the transaction).

It is certainly true that non-shared data can take up space in these structures. Yet, you have to roll back all changes you make if you're forced to abort, so you do need to buffer them.

So in order to implement effective speculative lock elision you need two things: an out-of-order core...

Why do you need out-of-order execution?

...which supports speculative execution and a large write buffer.

Or it can do in-cache versioning.

The only other two architectures which are said to implement speculative lock elision are Sun's Rock and Azul's specialized Java cores.

Both of which are basically in-order core, as I recall, FWIW.

Rock on the other hand is an in-order processor which uses scout threading (i.e. speculative execution with checkpointing) to reap most of the benefits of OoO speculative execution without the associated costs. How it will actually perform remains to be seen but the fact that it implements a processor state checkpointing mechanism makes implementing speculative lock elision a natural extension.

Rock is an interesting design. I'm looking forward to hearing more about it. From what I understand, what you say is exactly right. They already had scout threads, so adding speculative locking and such was easier for them (and perhaps trivial).

Back to Larrabee, creating checkpoints on a machine with large vector registers (and potentially four sets of them for each core) sounds very unlikely.

You don't have to take the checkpoint all at once. You can do it lazily. For example, when you start a checkpoint you have a bit-vector that lists all the registers that have been checkpointed (initially empty). The first time an instruction writes to a register in the speculative region, you insert a micro-operation that does a store of that register to a reserved region of memory. This region of memory could even be locked into the cache if you'd like. By doing this incremental checkpoint, you only checkpoint what you need (and it doesn't happen all at once). On a checkpoint restore, you can just walk the bit-vector and restore just the registers that were actually over written.

Now, I have no idea how Larrabee does this checkpointing. But the above approach is how I would implement it.

Larrabee: Samples in Late 08, Products in 2H09/1H10

nAo

Nutella Nutellae

krychek

nAo

Nutella Nutellae

3dilettante

Gubbi

Bob

ArchitectureProfessor

ArchitectureProfessor

ArchitectureProfessor

ArchitectureProfessor

TimothyFarrar

Nick

TimothyFarrar

crystall

dkanter

3dilettante

ArchitectureProfessor

ArchitectureProfessor

ArchitectureProfessor

ArchitectureProfessor

Similar threads