Larrabee: Samples in Late 08, Products in 2H09/1H10

Arun · Jan 21, 2008

liolio said:
As gpu hardware become more and more flexible do products provide by clearspeed could be interesting for graphics tasks? (I mean as a shader core)

Well, their current product is certainly very interesting for certain workloads, but its bandwidth is really minimalist *and* it doesn't have advanced latency hiding ala GPUs (one thread/core only). The CSX600 uses only a 64-bit DDR2 memory interface for 25 DP Gflops! This has prompted some people to say, perhaps injustly, that it's a 'toy'. I don't really agree with that, as I'm sure some workloads wouldn't be too horribly limited by that, but it's still an issue.

One issue is that if they used something else than DDR2, power consumption for a reasonable amount of storage may actually be higher than their chip's power consumption! At least increasing the bus width would have been a wise thing to do though, I suspect, and ideally moving to DDR3 sooner rather than later... Anyway!

so the low clock ~200MHz and the product is optimised for DP computations (overkill for graphics).

I think their optimisation goal is pretty clear: they know that they can charge a lot of money per chip in the HPC market, so they optimize for perf/watt exclusively rather than perf/mm2. That's a smart thing to do in that space, but it obviously wouldn't be a very reasonable thing to do in the consumer space.

I feel like they have an interesting technology and granted access to bigger silicon budget to better process and aiming at a slightly higher power envelop they could provide an interesting option.

Yeah, they definitely could do some cool stuff for the HPC market if they improve their current hardware iteration. As for graphics, I'm more skeptical; some of the design trade-offs should be very different and they're too small to properly focus on that market at the same time, imo. Could they do it with enough money and time? Sure, but that's the case for many companies out there, doesn't mean it'll ever happen.

feel free to ignore, I'm really sccared as stated before to disrupt this discussion.

Heh, don't be, I don't think anyone would complain if there were more questions like that in this thread/forum, quite on the contrary!

Nick · Jan 21, 2008

crystall said:
Compared to discrete offerings IGPs will always have a fraction of the bandwidth and will share it with more and more CPU cores if the current trends continue. Especially when many techniques used in modern games tend to be very heavy on bandwidth.

I think it's obvious that the multi-core revolution has focussed attention on bandwidth again. As far as I kow there's not really anything technical limiting them from reaching bandwidths much closer to what we see on discrete graphics cards.

Arun · Jan 21, 2008

dkanter said:
One of my friends (who has designed for a wide variety of foundry and internal processes) once said:
"TSMC hopes that their 65nm process will be as fast as AMD's 90nm process"

I like that kind of insight and anecdotal insider information a lot, but I just can't resist pointing out the incredible irony of that comment given that AMD's 65nm process seems slower (based on Brisbane and Barcelona's clock speeds) than their 90nm process!

I can easily believe this since TSMC does ASICs, not MPUs.

So can I, although I'm curious what process variant at TSMC we're talking about here. GPUs tend to be made on the general-purpose variant nowadays fwiw (there have been instances of GPUs using high-speed variants, but let's just say those aren't the most impressive chips ever...)

Now, given that Intel's process is always faster than AMD's let's just say that generally TSMC in terms of speed is one generation slower than Intel's nominal, i.e. TSMC@65 = Intel@90 for most generations. I'd further guess that Intel's 45nm is more than one generation ahead of TSMC's 45nm, by virtue of the new gate stack.

I'd be willing to concede that, sure. Although performance really hasn't gone up that much on 65/45nm at Intel AFAICT, but maybe that's just a wrong impression I have based on their chip roadmap. Do you have any idea how much faster Intel's 45nm is compared to their 65nm node?

So I could envision Intel using a smaller die and achieving their cost cutting that way. Of course, the way Intel usually plans things is that on the first generation, they max out the die size (think Itanium, P4, P6 here) and then make money once they shrink it. I bet that the latter is what Intel will do.

Amusingly, it's what NVIDIA does too nowadays. G70->G71, G80->G92 and I have got no reason to believe they won't adopt that exact same strategy for the 55->45nm transition. I have actually seen AIB employees saying that NVIDIA told them they adopted Intel's tick-tock strategy...

Arun · Jan 21, 2008

Nick said:
I think it's obvious that the multi-core revolution has focussed attention on bandwidth again. As far as I kow there's not really anything technical limiting them from reaching bandwidths much closer to what we see on discrete graphics cards.

My understanding is that the burst size on graphics memories tends to be higher than on standard memories, which obviously isn't a huge problem but it should still be considered. In addition to that, there's no way in hell we're going to see GDDR5 DIMMs - and the difference in per-chip bandwidth between DDR3 and GDDR5 is about 3.5-4.5x as far as I can tell.

For IGPs, the 'easy' solution is embedded memory in my opinion (ideally Z-RAM, but hey, I'll survive if it's 'just' eDRAM). Of course, knowing me, I'd be capable of degenerating this thread into an embedded memory lovefest, so I'll just keep my mouth shut here instead...

Nick · Jan 21, 2008

ArchitectureProfessor said:
One thing in the CPUs manufactures favor is that it is much easier for programmers to migrate from two core to four cores than from one core to two cores. That is, once you already have a program that exploits multiple cores, tuning it to run under more cores is easier than the initial conversion.

...

I suspect that programmers will either find the locks-and-threads model familiar enough or that the various GPGPU languages can be compiled to a multi-core CPU just as well. GPUs will likely still have the edge in peak performance, but real-world performance might go to the CPUs.

I don't think scaling from two cores to four is much easier than going from one core to two cores. After the initial learning curve your framework will likely allow to just use four threads instead of two, but the hard part is keeping good performance behavior. With dual-core you only have one extra thread to worry about. You can split a job in two equal parts and expect it to offer a pretty good speedup. Try to split it in four, and the slowest thread will be substantially slower than the fastest thread. So coarse-grain works ok for a very low number of cores but rapidly offers no advantage at all, especially if there are extra dependencies between the tasks. Fine-grain threading works well with a high number of cores, but has too much overhead for a small number of cores. Valve found this out the hard way. Till the day that everybody has an 8+ core CPU, developers have to support single-core, dual-core, quad-core, etc. which all behave differently enough to require different approaches.

Basically, with dual-core you can get away with a lot of algorithms that scale very badly. They're often more efficient than the complicated algorithms that do scale to many cores. So right after implementing dual-core support you have to go back to the drawing board to support quad-core, and you become aware of the changes that will be required to support even more cores...

So I don't think that the average programmer will ever become familiar enough with locks-and-threads. Heck, using a lock for every resource and a thread for every concurrent task can easily make things slower than staying single-threaded. It requires more low-level knowledge to keep things running smoothly, something a high-level programmer doesn't want to deal with.

In my vision the near future is frameworks that abstract the locks-and-threads into dependencies-and-task. The distant future is compilers that analyse dependencies in software and run it on multiple threads, or new languages that are explicitely parallel (inspired by hardware description languages and/or functional languages).

Demirug · Jan 21, 2008

Nick said:
Easy, no, but is there anything that makes you believe it's harder than designing hardware and writing drivers? I wrote a 'fast' DirectX 9 renderer while I was still studying (fast in relative terms - given the hardware). Frankly, I believe writing a DirectX 10 renderer is simpler because you don't have to worry about fixed-function pipelines. A lot of responsabilities shift to the hands of the game developers.

The advantage of software on a general-purpose CPU is that it's extremely modular and you can work incrementally. Whenever an approach isn't as efficient as expected, you can rewrite ánd test it in a matter of hours (no matter if it's a high-level or low-level algorithm).

I know swiftshader and it is a great proof of concept that it can be done. Something I never deny. But as far as I know (I didn’t checked lately) it isn’t feature complete. Hard was more meant in the way that it will be time consuming as Direct3D 10 requires that you implement anything.

But I am sure that Intel already working on this problem.

Nick · Jan 21, 2008

Demirug said:
I know swiftshader and it is a great proof of concept that it can be done. Something I never deny. But as far as I know (I didn’t checked lately) it isn’t feature complete.

Once you have an actual graphics pipeline (including dynamically generated shaders using SIMD) and it runs real games, adding more features is more or less just a matter of pasting them on. It's a steep learning curve to learn how the pieces fit together but I'm sure that the people working on Larrabee are long past that and can concentrate on "just coding", in the practical software engineering sense.

Hard was more meant in the way that it will be time consuming as Direct3D 10 requires that you implement anything.

But I am sure that Intel already working on this problem.

Yes, I would definitely primarily describe it as time consuming instead of hard, once you've already reached a certain level. In software development in general, the last 10% typically takes just as long to implement as the first 90%, but shouldn't be no more complicated. It's true that Direct3D 10 is especially unforgiving about getting things exactly right. But this counts for NVIDIA and AMD as well. Futhermore, Intel has had efficient software vertex processing implementations for ages. I lost track of whether or not their latest IGP's pass Direct3D 10 conformance, but they should be able to tap experience from there as well...

Anyway, I believe ArchitectureProfessor was actually talking about other things than graphics, like physics. But precisely because it's x86 compatible they can likely take any existing library and adapt it for Larrabee in a matter of months.

Demirug · Jan 21, 2008

Nick said:
I lost track of whether or not their latest IGP's pass Direct3D 10 conformance, but they should be able to tap experience from there as well...

Unfortunately there is still no Direct3D 10 driver available fort his IGPs.

ArchitectureProfessor · Jan 21, 2008

Scatter-gather vector operations?

TimothyFarrar said:
ArchitectureProfessor, care to speculate as to what types of memory access instructions Larrabee will have with its vector instructions? ... Even with such small vectors (2 DP, 4 SP) SSE is often performance limited by the amount of instructions needed to gather/scatter from memory, swizzle, and other non-ALU instructions. With such a large vector, one would think that with Larrabee they would definatly address these issues.

dkanter said:
If you assume that Intel's engineers are fairly smart, which they are, I think the most likely answer is that they will have support for scatter-gather.

I agree that Intel is smart and thus they might address this issues by adding scatter/gather support to their vectors.

However, I'm not sure that scatter/gather maps that well to the block-based caches that Larrabee has. As a scatter gather can have arbitrarily bad (or good) banking behavior, they are just really hard to make fast in general. If you're going to touch 16 different cache blocks (and access the cache 16 different times) it might not be any faster than doing 16 individual load instructions.

I would guess at the minimum that Intel would have thought about this problem and added instructions that make packing/unpacking of Larrabee vectors more efficient than packing/unpacking SSE registers.

I think the big thing here is how the cache (or local store in an NVIDIA GPU) is organized. If you're doing a 64B (512-bit) read, does it have one address decoder such that all 64B must be from consecutive addresses in memory? Or does it have a decoder for each 4B column (bank) of the data array, allowing it to access many different locations. Oh, this would also require 16 tag lookup and comparisons, too. In such a organization, a scatter or gather still might take many cycles, but it would be faster than 16 individual instructions. Yet, I'm not sure that having hardware support for that many decoders and that many tag checks would be worth it.

Maybe the above is a good argument against hardware-managed caches (and in favor of scratch pad memories).

Let me ask a question. I can see how scatter-gather can be useful for GPGPU-type computations. Can someone say more about what graphics algorithms use scatter-gather? I could also imagine that if Intel could transform the algorithms to avoid scatter-gather, perhaps they won't need hardware support for it.

ArchitectureProfessor · Jan 21, 2008

ArchitectureProfessor said:
I agree that Intel is smart and thus they might address this issues by adding scatter/gather support to their vectors.

Here is an interesting snippet I just found after Googling around. From the Intel whitepaper on Ct:

In the future, Ct's adaptive compilation strategy will play an even more important role when new, throughput-oriented ISA extensions emerge (such as mask, cast/conversion, swizzle/shuffle, and gather/scatter).

As Ct is being used by the Larrabee team, this might be a strong hint at what Larrabee's vectors support. I have heard that Larrabee supports vector masks, so that is further evidence that they might be talking about Larrabee here.

Even if they don't support scatter/gather directly in a single instruction (which they could), it does indicate that Larrabee will at least have some way to perform scatter/gather more efficiently than earlier vector ISA extensions.

ArchitectureProfessor · Jan 21, 2008

Nick said:
In my vision the near future is frameworks that abstract the locks-and-threads into dependencies-and-task.

The task-based model of parallelism is a really good one. As larger and large multicores come along, I think we'll see lots of parallel code being written with task-based parallelism. Intel (and others) have certainly investigated and supported such models.

The distant future is compilers that analyse dependencies in software and run it on multiple threads, or new languages that are explicitely parallel (inspired by hardware description languages and/or functional languages).

I once heard the quip: "The two most used parallel programming languages are SQL and OpenGL". Of course, the point here is that neither is actually a parallel programming language. Yet, both are used efficiently to execute in highly parallel environments. This isn't inconsistent with your long-term view, but it isn't quite the same as well.

I think domain-specific specification of parallelism is generally more successful that something too general-purpose (like functional languages). I do like your idea of using hardware description languages (or something like it) to write parallel programs. It is exactly the limitations of what you can't do in a hardware description language that makes it more likely to be successful in parallelizing it.

For multi-core CPU computations (which may or may not include GPGPU), I think that API calls that say "go uncompress this JPEG" or "go blend this" or whatever allow task-based parallelization under the covers in a way that most programmers won't need to think too hard about. Apple has led the way in this by providing APIs that can be accelerated on a multi-core or GPU, and I expect we'll see more of that over time. Sort of "parallelism as an API-based service" or something like that. Something like that bodes well for both multi-core and GPGPU.

ArchitectureProfessor · Jan 21, 2008

Nick said:
So I don't think that the average programmer will ever become familiar enough with locks-and-threads. Heck, using a lock for every resource and a thread for every concurrent task can easily make things slower than staying single-threaded. It requires more low-level knowledge to keep things running smoothly, something a high-level programmer doesn't want to deal with.

Ok, here is a bomb shell tidbid for you all. Rumor has it that Larrabee supports speculative locking. (!!!)

Speculative locking (also know as speculative lock elision, or SLE) is like a limited form for transactional memory (which was mentioned in some posts earlier in this thread) applied to locking. Instead of actually acquiring a lock, the hardware just speculates that it can complete the critical section without interference. It checkpoints its register state and starts to execute the lock-based critical section. If no conflict occur, it just commits its writes atomically. In this non-conflict case, the processor just completed a critical section *without every acquiring the lock*! Conflicts are detected by using cache coherence to determine when some other thread tries to access the same data (and at least one of the accesses is a write; read-read is not a conflict). On a conflict, a rollback occurs by restoring the register checkpoint and invalidating blocks in the cache that were written while speculating.

Let me give a concrete example. You have a 10,000 node hash table, but with a single lock on the whole table. This single lock is easier to program than having a lock on each bucket and it makes it easy to write the code for infrequent operations such as growing the hash table and such. With a normal lock, if a bunch of threads were all trying to do bunch of updates to the hash table, they would be totally serialized and the lock would bounce all around the system (which is slow). Under speculative locking, all the threads go ahead and do their updates speculatively. As the odds of any two threads updating the same bucket at the same time is small, you'll likely get almost parallel speedup out of this hash table with a single lock on it!

It also had the advantage of avoiding the write to the lock, and thus the overhead of obtaining write permission to it. Instead, all the processor just cache the lock in read-only mode and quickly start speculating when they would otherwise be waiting for a writeable copy of the block with the lock in it.

Anyway, this speculative locking idea is probably one of the best to come out of the research community in a long time. In parallel with the academic research, a small startup company called Azul Systems has actually implemented it in their highly parallel Java offload engines (big boxes that make Java go fast). It apparently works pretty well in that context.

If Larrabee really is going to implement this, it could really help programs in writing much more efficient and scalable parallel code.

Of course, you need coherent caches for this to work. Speculative locking is just another reason to support the unified shared memory with hardware cache coherence..

Gubbi · Jan 21, 2008

ArchitectureProfessor said:
Speculative locking (also know as speculative lock elision, or SLE) is like a limited form for transactional memory (which was mentioned in some posts earlier in this thread) applied to locking. Instead of actually acquiring a lock, the hardware just speculates that it can complete the critical section without interference. It checkpoints its register state and starts to execute the lock-based critical section. If no conflict occur, it just commits its writes atomically.

Isn't this just hardware support for LL-SC schemes ? Upon attempting to store modified state, you detect if somebody else worked on the same data and retry on conflict.

Cheers

ArchitectureProfessor · Jan 21, 2008

Gubbi said:
Isn't this just hardware support for LL-SC schemes ? Upon attempting to store modified state, you detect if somebody else worked on the same data and retry on conflict.

It is similar, but speculative locking is much more general. In load-link/store-conditional (LL/SC), you're acting on a single address, and checking that single address before performing the conditional store. In addition, some ISA (such as Alpha but not PowerPC) don't even allow any other memory operations between the LL and SC, so it is really limited to constructing simple atomic operations (swap, compare-and-swap, atomic increment), the kind that are primitives under other ISA such as x86.

In speculative locking (and transactional memory), you can start speculating and perform hundreds or thousands of instructions on hundreds of cache blocks. For example, a search of a binary tree or a lookup in some more sophisticated linked-list or graph structure. If any other processor modified something you've read (or read something you've modified) it detects it and recovers by doing the right thing.

The key observation is that locks are mostly not needed, but they have to be used just in case.

Maybe a good analogy is a traffic signal or stop sign. Most of the time we could all just ignore the traffic signal and no accident would occur. We could all just drive right through. Yet, we use traffic signals to slow us down, even in the common case of no other traffic, just in case to prevent "conflicts" in the intersection (i.e., accidents). Speculative locking and transactional memory (TM) is like adding a magic abilities for cars to "roll back and recover" from a crash. If a crash occurs in the intersection, the cars that crashed just revert back in time to their pre-crash state and they try again.

Of course, if the traffic is so high at the intersection that you'll always crash, falling back on the traffic signals is probably the right thing to do (which is exactly what speculative locking can do; it always has the option of falling back on acquiring the lock in such cases).

Nick · Jan 21, 2008

ArchitectureProfessor said:
I once heard the quip: "The two most used parallel programming languages are SQL and OpenGL". Of course, the point here is that neither is actually a parallel programming language. Yet, both are used efficiently to execute in highly parallel environments. This isn't inconsistent with your long-term view, but it isn't quite the same as well.

OpenGL and such is actually what I had in mind for the short-term frameworks. Shaders are tasks, and the dependencies are between the input and output buffers of multiple 'passes' you execute on the data.

Application programmers might actually find it easier to use a software implementation of OpenGL or Direct3D to do some multi-threaded SIMD optimized processing than to roll their own framework. Furthermore, people are already getting familiar with using shaders (like you say, it's one of the most used parallel programming languages). They just lack faster alternatives to texture sampling (and CPU's lack scatter/gather).

Introducing new languages and programming paradigms that are inherently parallel is going to be a lot harder and will take much longer to introduce (that's why I consider it the long-term future), but it's the only option to keep things scaling into the poly-core era.

I do like your idea of using hardware description languages (or something like it) to write parallel programs. It is exactly the limitations of what you can't do in a hardware description language that makes it more likely to be successful in parallelizing it.

Yes, I had this idea when working with SystemC. It's not that hard to program with and it could easily scale to hundreds of threads. If it was a C++ extension instead of an actual C++ framework and the overhead was a lot lower, it would be an attractive way to do multi-core programming. I've seen interesting movement of C# towards mult-core programming as well.

Nick · Jan 21, 2008

ArchitectureProfessor said:
Speculative locking (also know as speculative lock elision, or SLE) is like a limited form for transactional memory (which was mentioned in some posts earlier in this thread) applied to locking. Instead of actually acquiring a lock, the hardware just speculates that it can complete the critical section without interference. It checkpoints its register state and starts to execute the lock-based critical section. If no conflict occur, it just commits its writes atomically. In this non-conflict case, the processor just completed a critical section *without every acquiring the lock*! Conflicts are detected by using cache coherence to determine when some other thread tries to access the same data (and at least one of the accesses is a write; read-read is not a conflict). On a conflict, a rollback occurs by restoring the register checkpoint and invalidating blocks in the cache that were written while speculating.

That sounds truly fascinating! I wonder though, how does this compare to lock-free software algorithms? They too don't aquire a real lock and check for 'collisions' before committing their results, or roll back and try again. It has some overhead and is harder to program than having SLE support, but I can't imagine that the hardware cost is nearly for free. Also, since the programmers of Larrabee are very experienced I'm not sure if it's necessary to make it software friendly to that degree (or will it be opened to the public)?

And if you have control over the thread scheduler like I assume is the case on Larrabee, why not ensure that threads don't get preempted when they're in a contested critical section? I can see some benefit of SLE with tons of tiny shared resources, but there must be some kind of compromise like locking them in batches or hierarchical locking or something like that? With the right granularity you can have few locks that are only lightly contested, without complicated hardware support.

digitalwanderer · Jan 21, 2008

/me nominates ArchitectureProfessor as the member of the month. :yep2:

TimothyFarrar · Jan 21, 2008

Nick said:
So I don't think that the average programmer will ever become familiar enough with locks-and-threads. Heck, using a lock for every resource and a thread for every concurrent task can easily make things slower than staying single-threaded. It requires more low-level knowledge to keep things running smoothly, something a high-level programmer doesn't want to deal with.

Perhaps not now, but the average programmer of the future (? years from now) will have to. Even in single processor machines, it hasn't been a good idea to stay single-threaded (or single processing) since the advent of time sharing. Eventually your single thread/process is going to block on IO (perhaps even just a virtual memory page read), and the CPU will go idle (when you could still be doing useful work).

While threaded (or multi-processing via fork) programming might be a relatively "new" topic for game developers, it has been a way of life for most "server" programmers for a long time. There is a definite trend towards lighter weight locking also. Linux for example provides the futex system call, where locking is done in pure userland (read extremely fast), and only invokes an actual system call (read slow) on contention. BTW, while unix related, a great overview on methods for concurrent programming, http://www.kegel.com/c10k.html.

Back to Larrabee,

When you say "speculative locking" are you not just referring to "speculative multithreading/thread level speculation" in Intel's "Mitosis" system? I wonder how much extra die space gets taken by the adjustments to the L1 cache to support all the record keeping necessary.

So we are talking about tagging cache lines when they are read/written from a given thread, and if later when the caches are checked for concurrency, if a conflict occurs, invalidating all the cache lines held by any threads which have ordering conflicts and terminating execution of those threads. ISA changes are required to add instructions to start a speculative thread and then later validate that a thread did not get squashed.

ArchitectureProfessor · Jan 21, 2008

TimothyFarrar said:
Perhaps not now, but the average programmer of the future (? years from now) will have to.

I agree 100%. We've actually been trying to multithreaded/parallel programming into our undergraduate curriculum somewhere. It seems that if it is taught at the university level, it is message passing using MPI over a cluster. That might be good for the small niche of scientific computing, but I really think that students should be exposed to the sort of shared-memory multithreaded programming. This is not only the default model for multi-core CPUs, it is also the model that has been used successfully in the server space for a long time (as you pointed out), and students should gets some hands-on experience with it.

Sun has actually given us (and several other universities) some Niagara T2000 machines (eight threads, four hardware threads each) to give the students something with enough cores on which they can really practice multithreaded programming. Right now it looks as if I'm just going to shoehorn it into the undergraduate architecture course, which is already over full with material.

Oh, at the graduate level we have a course that teaches GPU programming (taught mostly from the point of view of graphics rather than GPGPU). NVIDIA is nice enough to donate GPU boards for that class (and fund some other graphics research at my university, too).

When you say "speculative locking" are you not just referring to "speculative multithreading/thread level speculation" in Intel's "Mitosis" system?

Thread-level speculation (TLS) is actually an older idea than speculative locking, and they do use similar mechanisms. I think the key difference is that speculative locking uses the familiar threads-and-locks model, but makes it faster by (1) reducing lock contention and (2) reducing lock acquisition overhead. In TLS, you're actually trying to take a sequential program (such as a loop that isn't obviously parallel for all inputs) and running it in parallel speculatively. The TLS, the programer's model is still just a single-threaded application. Both are cool, and hybrids are possible. However, I think Larrabee is going after the speculative locking (SLE) rather than TLS.

I wonder how much extra die space gets taken by the adjustments to the L1 cache to support all the record keeping necessary.

It isn't trivial, but it likely isn't outrageous. You track conflicts on a per-block (64B for Larrabee) basis, so you only need a pair of bits (per thread context) for each 512-bit worth of data in the cache. That is only a few percent. But you do need a way to flash-clear these bits (on aborts and commits), so it means they aren't just totally standard SRAM cells (but probably not as expensive as a CAM cell). Other than that, I think the biggest cost is in terms of design complexity and design verification.

ISA changes are required to add instructions to start a speculative thread and then later validate that a thread did not get squashed.

The original SLE paper had the idea of inferring the locks from the executing programs at runtime (by looking for common lock idioms). That is, the original SLE proposal was completely transparent to the software. No extra instructions needed. In practice, the software might need to use a lock that it knows the hardware will recognize and thus optimize, but it wouldn't actually require new instructions. Of course, you could add a "acquire lock" and "release lock" instructions, but that isn't strictly necessary.

crystall · Jan 21, 2008

ArchitectureProfessor said:
However, I'm not sure that scatter/gather maps that well to the block-based caches that Larrabee has. As a scatter gather can have arbitrarily bad (or good) banking behavior, they are just really hard to make fast in general.

Scatter/gather vector operations do not need to be made 'fast'. They are not there for that purpose, they are there to make extracting memory level parallelism cheap. With a 16-slot vector register you can issue up to 16 parallel memory loads or stores at the decoding/tracking cost of a single instruction. 16 independent loads or stores would require significantly much more resources for decoding, executing and tracking them.

If you're going to touch 16 different cache blocks (and access the cache 16 different times) it might not be any faster than doing 16 individual load instructions.

It might not be faster but it's certainly much cheaper both from a power and transistor-budget perspective. Besides you do not really need to execute all of those loads/stores at once, you can execute them in a pipelined fashion possibly making them overlap ALU operations. Vector processors have always worked this way offering >512 slot vector registers with a significantly narrower execution path both for ALU and memory operations.

I think the big thing here is how the cache (or local store in an NVIDIA GPU) is organized. If you're doing a 64B (512-bit) read, does it have one address decoder such that all 64B must be from consecutive addresses in memory? Or does it have a decoder for each 4B column (bank) of the data array, allowing it to access many different locations.

Most modern GPUs can sustain a large number of outstanding memory operations, that's why they are able to extract huge amount of bandwidth while covering the very high latencies of GDDRx memory.

Let me ask a question. I can see how scatter-gather can be useful for GPGPU-type computations. Can someone say more about what graphics algorithms use scatter-gather?

In a word: texturing. Texture access patterns are not regular especially when using higher levels of anisotropic filtering.

Larrabee: Samples in Late 08, Products in 2H09/1H10

Arun

Unknown.

Nick

Arun

Unknown.

Arun

Unknown.

Nick

Demirug

Nick

Demirug

ArchitectureProfessor

ArchitectureProfessor

ArchitectureProfessor

ArchitectureProfessor

Gubbi

ArchitectureProfessor

Nick

Nick

digitalwanderer

wandering

TimothyFarrar

ArchitectureProfessor

crystall

Similar threads