Larrabee: Samples in Late 08, Products in 2H09/1H10

silent_guy · Jan 20, 2008

ArchitectureProfessor said:
When would a software managed cache be faster? (an honest question)

When you optimize the SW managed cache to optimally use the complete cache size and don't use the cache for other memory fetches. In this case, you prevent pollution of the performance critical data / the data for which coherency is known.

A lot of CUDA algorithms use exactly this principle: explicitly load a blob of data in the shared memory, process it while using external memory for unpredictable, non-coherent fetches (either via regular global memory fetches or by using the texture cache), then write out the result back to memory and continue on the next blob.

By SW managed cache, we basically mean: just a shared memory that can be accessed by all threads of a multi-processor. There's nothing in HW really that makes it look like a cache.

armchair_architect · Jan 20, 2008

Right, the idea is a SW-managed cache *isn't* a cache, so it can have lower latency by avoiding the tag comparisons and other work that goes into fetching data from a real cache. Nvidia claims the PDC is "as fast as registers", which appears to be true enough to use as a guideline anyway.

This requires that the algorithm be such that the location of a datum can be known a priori -- otherwise you have to perform some kind of address mapping in SW which would be slower than the fixed-function lookups provided by real caches.

liolio · Jan 20, 2008

I'm scared as I wouldn't want my comment to disrupt such a interested discussion between people with so much knowledge.

Anyway a question burns my Boeotian lips

We're speaking about x86 core doing gpu jobs and the other way around.
My short question is
As gpu hardware become more and more flexible do products provide by clearspeed could be interesting for graphics tasks? (I mean as a shader core)
one of their chip is able of DP 33Gflops while the transistor budget is 128 millions.
they aim for real low power consumption so the low clock ~200MHz and the product is optimised for DP computations (overkill for graphics).
More the process they use so far is 130 nm so thy have some legs here given slighty higher power envelop.

I feel like they have an interesting technology and granted access to bigger silicon budget to better process and aiming at a slightly higher power envelop they could provide an interesting option.

feel free to ignore, I'm really sccared as stated before to disrupt this discussion.

MTd2 · Jan 21, 2008

ArchitectureProfessor said:
A. Are there any rules that say Intel must give anyone their masks? Can't they just keep them fully in house?

B. In addition, anything that is patented is still under patent, right? Just because the specific mask loses protection, does that invalidate the patent?

C. AMD and Intel have a IP cross licensing agreement. Yet, AMD is behind Intel right now? Why? Well it isn't for intellectual property reasons.

A. What is the idea behind the patent? Simple! You forfeit a secret of something, the subject of the patent, and the governament gives you a monopole over this idea for a number of years. So, when you patent something, you explain the process or idea on your application filling.

To get an idea, when you patent a protein, you can't state the structure of such structure in a paper. It's too much information. What you do is to send semples to places where it is going to be stored, in several places in the world. If you have a serious intention to study that invention, a sample can be provided to you.

B. Yes, after those 10 years you don't have any rights over that mask. A patent will be under scrutinity on each respective national patent Institute, and if it doesnt show any inovation, or just reproduces something on public domain, you wont get that right over protection. Of course, we don't expect Intel to come with crackpot ideas.

BTW, it is 10 years after the filling of the application that it becomes public domain. And the right of monopole is just granted after a few years after the filling due to the characteristic slowness of the examination. But the right to license a technology goes up to the day of the filling. That means that after you get your right, you can sue everyone that copied your idea. No one assumes that an Intel, IBM or AMD patent won't be granted.

C. To tell you the truth, I don't really know the details of such agreements. I don't know if there is any legislation that makes it illegal to ask for patent royalties after the expiration of a patent, in the case of a previous to expiration agreement. I will have to check that. But one thing is sure, certainly third parties are not under this obligation. In the case of NVIDIA, they surely can have their little x86-64 inside their GPUs begining Jan. 1st 2010, in case their filling was done in 1999, but I am not certain about that date. I guess it was even earlier.

dkanter · Jan 21, 2008

Arun said:
Obviously, the reason why I gave that low-k example though is that presumably NVIDIA concluded the extra performance/power efficiency wasn't worth the cost. Whether that's right or not is another question, but they certainly took their sweet time to adopt it - in fact, they only did when it TSMC made it mandatory!

You're perfectly right that the performance advantage has to be included in the cost-efficiency calculations though, which I forgot to do in my previous post (oops!)

I mostly agree with your skepticism there, so let me take another example: the 111mm² Allendale used in the $113 and $133 E4400 and E4500 SKUs. AFAIK, the E2xxx Series uses a >=80mm² chip with even less cache, so I don't think there are many (if any) other SKU using Allendale.

Assuming ASPs of $120 and gross margins of 60% (the latter is highly optimistic, I suspect it's nearer 50%, but so be it...), that gives us 40% lower costs. Which is still ~3x higher...

I agree with the ~1.3x frequency advantage (although I'm not sure 'at the same power is accurate but no matter that), but I'd wager a 2x cost benefit from better defect management is really optimistic in this case.

With Allendale we're talking of a relatively small chip on an extremely mature process (more so than TSMC's 65nm) with still a fair bit of cache and it isn't really pushing the envelope in terms of clock speeds or TDP binning. In addition to that, my 2x estimate was (as I mentioned when I gave it) probably substantially too optimistic.

All in all, I'd certainly expect the cost difference to be much smaller than my on-the-back-of-the-envelope 5x calculation. Perhaps in the '20-50% less expensive' kind of range for TSMC wafers, excluding the performance disadvantage. This is all highly approximative and it's really hard to tell though, so take this with a lot of salt and please don't kill me if you disagree with those numbers!

So one thing I forgot to mention. I agree that TSMC's wafers are inexpensive, but the quality of the process is also much lower.

One of my friends (who has designed for a wide variety of foundry and internal processes) once said:

"TSMC hopes that their 65nm process will be as fast as AMD's 90nm process"

I can easily believe this since TSMC does ASICs, not MPUs. Now, given that Intel's process is always faster than AMD's let's just say that generally TSMC in terms of speed is one generation slower than Intel's nominal, i.e. TSMC@65 = Intel@90 for most generations. I'd further guess that Intel's 45nm is more than one generation ahead of TSMC's 45nm, by virtue of the new gate stack.

So I could envision Intel using a smaller die and achieving their cost cutting that way. Of course, the way Intel usually plans things is that on the first generation, they max out the die size (think Itanium, P4, P6 here) and then make money once they shrink it. I bet that the latter is what Intel will do.

RE: Software versus hardware managed caches:

SW managed is going to be lower power, since it's effectively behaving like a 2nd level set of registers. You only access the particular cells you need, and you don't do the tag check. It could be faster since you don't have any TLB on the critical path or the tag check.

DK

ArchitectureProfessor · Jan 21, 2008

silent_guy said:
When you optimize the SW managed cache to optimally use the complete cache size and don't use the cache for other memory fetches. In this case, you prevent pollution of the performance critical data / the data for which coherency is known.

Once you have hardware caches, you can always just "block" your algorithm to use a consecutive block of memory (just smaller than the size of the cache). After one iteration the hardware cache will basically act like a software-managed cache. If you want, you can even pre-walk the block of memory to quickly sequentially fetch it into the primary data cache. So, if the hardware cache has almost the same latency of a software-managed scratch pad memory (see my next post), then you can always use it like a scratchpad memory.

But the advantage of hardware-managed caches is that it also support a much more dynamic nature of caching. Such dynamic caching works pretty well in CPUs and provides a model in which you don't need to worry about software-managed caching and the explicit copy-in/copy-out operations. Plus, you can do cache coherence and really fine-grained synchronization using shared memory locations.

ArchitectureProfessor · Jan 21, 2008

armchair_architect said:
Right, the idea is a SW-managed cache *isn't* a cache, so it can have lower latency by avoiding the tag comparisons and other work that goes into fetching data from a real cache. Nvidia claims the PDC is "as fast as registers", which appears to be true enough to use as a guideline anyway.

dkanter said:
SW managed is going to be lower power, since it's effectively behaving like a 2nd level set of registers. You only access the particular cells you need, and you don't do the tag check. It could be faster since you don't have any TLB on the critical path or the tag check.

Intel has lots of experience building fast and low-power caches. Let me address these two issues separately (although they are intertwined).

First, latency. The full tag check can be done off the critical path. That is, the tag array lookup is done in parallel with the data array lookup. In a direct-mapped cache, this is easy to do. For a set-associative cache you need to know where to look (or do a *really* wide lookup in the data array, which is not low power). One way to do this is have a partial tag array. A partial tag array keeps just enough bits to usually know where in the data array to look in most cases. This partial tag array is really small, so it can accessed really quickly before the data array lookup. Such a structure can be highly accurate and low power. You still need to do the fully tag and TLB lookups, but these can be done in parallel with the data array access.

Another option is to just add a cycle of latency to the cache lookups (but still have it fully pipelined). Between smart compiler scheduling (putting independent instructions in the shadow of the load) plus the hardware multithreading, I suspect the CPUs will still be fully utilized even if the cache has an extra cycle of latency (remember, GPU is about overall throughput and not so much about the latency of an individual thread).

Ok, what about the power implications? As tags are only a few dozen bits, doing a tag lookup and comparison is small in comparison to reading out a 64-byte (512-bit) vector register out of the cache. The same is true with the TLB lookup. Especially as these lookups can be done off the critical path, the transistors used for those lookups can be slower (and thus use less dynamic and static power).

So, yes, a hardware-managed cache might have a little higher latency and use just a little more power. I think this price is a reasonable one to pay to avoid explicit memory transfer operations and to have the option of the programming models enabled by hardware caches and global on-chip cache coherence.

ArchitectureProfessor · Jan 21, 2008

Arun said:
In addition to that, my 2x estimate was (as I mentioned when I gave it) probably substantially too optimistic.

Yea, the 2x for defect tolerance was too optimistic (but optimistic in a way that makes your earlier estimate more conservative, which I appreciate).

All in all, I'd certainly expect the cost difference to be much smaller than my on-the-back-of-the-envelope 5x calculation. Perhaps in the '20-50% less expensive' kind of range for TSMC wafers, excluding the performance disadvantage. This is all highly approximative and it's really hard to tell though, so take this with a lot of salt and please don't kill me if you disagree with those numbers!

I think we've done a good job of exploring some of the process issues between TSMC and Intel. I suspect there is a lot more known about TSMC's process (just because of its foundry business model), where Intel is likely a bit more vague about the specifics of the 45nm. For example, Intel won't say what metal alloy they use for their metal gates in 45nm.

I really don't think Intel wants to use their most advanced fab process for GPUs. But if push comes to shove, they can likely make some GPUs on their top-end fab with some extreme binning just to prove they can make a leading-edge part.

dkanter · Jan 21, 2008

ArchitectureProfessor said:
When I said "big-10", I actually meant schools in the Big Ten. I realize it was ambiguous.

That said, your list of two schools is a good list of some of the top CE schools. I wouldn't include Harvard, but I would include Princeton. For computer architecture specifically, I also include UC-San Diego (UCSD) and Georgia Tech.

That's right, I forgot about the southern campus of the UW and Georgia Tech ; )

DK

dkanter · Jan 21, 2008

ArchitectureProfessor said:
Intel has lots of experience building fast and low-power caches. Let me address these two issues separately (although they are intertwined).

First, latency. The full tag check can be done off the critical path. That is, the tag array lookup is done in parallel with the data array lookup. In a direct-mapped cache, this is easy to do. For a set-associative cache you need to know where to look (or do a *really* wide lookup in the data array, which is not low power). One way to do this is have a partial tag array. A partial tag array keeps just enough bits to usually know where in the data array to look in most cases. This partial tag array is really small, so it can accessed really quickly before the data array lookup. Such a structure can be highly accurate and low power. You still need to do the fully tag and TLB lookups, but these can be done in parallel with the data array access.

A cache for the cache controller, that would naturally work. And TLB has to hit before you access the address in physically addressed caches...

Another option is to just add a cycle of latency to the cache lookups (but still have it fully pipelined). Between smart compiler scheduling (putting independent instructions in the shadow of the load) plus the hardware multithreading, I suspect the CPUs will still be fully utilized even if the cache has an extra cycle of latency (remember, GPU is about overall throughput and not so much about the latency of an individual thread).

Ok, what about the power implications? As tags are only a few dozen bits, doing a tag lookup and comparison is small in comparison to reading out a 64-byte (512-bit) vector register out of the cache. The same is true with the TLB lookup. Especially as these lookups can be done off the critical path, the transistors used for those lookups can be slower (and thus use less dynamic and static power).

My understanding is pretty much the opposite of what you've said. Could you explain a little more?

TLB lookup is almost always a single cycle and is therefore, almost always on the critical path (for L1 of any processor, GPUs can probably get off easier). The only time it isn't is when you use a cache that can be addressed virtually due to the associativity*page size trick.

L2 TLB probably has some more flexibility given the much longer access times.

Keep in mind that TLBs are CAMs, not SRAM arrays, so are much much more expensive to access.

So, yes, a hardware-managed cache might have a little higher latency and use just a little more power. I think this price is a reasonable one to pay to avoid explicit memory transfer operations and to have the option of the programming models enabled by hardware caches and global on-chip cache coherence.

So I totally agree with you, I just think the costs are slightly higher than you stated.

Bob · Jan 21, 2008

AP said:
The back of the envelope calculation that I gave was just an estimate. Such a memory system can be modeled in more detail using a simple closed queuing model. This handles just that interaction between latency and bandwidth that you're worried about. The number of customers is the number of threads in the system, the queue resource is the DRAM bandwidth. In fact, this model is likely a worst case estimate, because with memory scheduling, the more threads that are queued up, the more efficient the memory scheduling (various banks, open pages, etc.).

I fear we may be talking past each other here.

The point I was trying to convey was that, even if the memory latency for 1 thread is 50 ns, you cannot assume that the latency for N threads is also 50 ns (per thread), for large-ish N. If you're not careful about how you construct your access patterns (between threads!) while trying to maximize bandwidth, you can easily be off by an order of magnitude in terms of latency.

And when you add latency, you need to add yet more threads to cover that latency, which can in turn add more latency and so on.

A simple queuing model will not necessarily show this, but a simple DRAM model will.

Building a system to deal with ~1000 threads is somewhat more expensive than one for ~100 threads.

dkanter · Jan 21, 2008

Bob said:
I fear we may be talking past each other here.

The point I was trying to convey was that, even if the memory latency for 1 thread is 50 ns, you cannot assume that the latency for N threads is also 50 ns (per thread), for large-ish N. If you're not careful about how you construct your access patterns (between threads!) while trying to maximize bandwidth, you can easily be off by an order of magnitude in terms of latency.

And when you add latency, you need to add yet more threads to cover that latency, which can in turn add more latency and so on.

A simple queuing model will not necessarily show this, but a simple DRAM model will.

Building a system to deal with ~1000 threads is somewhat more expensive than one for ~100 threads.

Here's a good question:

In general, CPUs tend to have very bursty behavior - i.e. long idle periods, followed by a lot of activity over the bus.

Is that the case for GPUs, or do the larger number of threads effectively average everything out?

I'd suspect the latter, but I don't have the monitoring tools necessary to investigate.

David

ArchitectureProfessor · Jan 21, 2008

dkanter said:
TLB lookup is almost always a single cycle and is therefore, almost always on the critical path (for L1 of any processor, GPUs can probably get off easier). The only time it isn't is when you use a cache that can be addressed virtually due to the associativity*page size trick.

The associativity*(page size) trick is used quite a bit. For those that don't know the trick, if the cache's associativity * the page size is less than or equal to the size of the cache, in that case, the "index" into the cache is the same for both the virtual and physical addresses. That is, it uses just the bits from the "page offset" of the address, which is independent of the virtual->physical page mapping. In such a system, you can just index the cache with the physical address without any worries.

For example, I think Intel's Core 2 Duo has an 8-way set associative, 32-Kbyte cache, and the x86 uses 4kB pages.. Does it really need to be 8-way for reducing conflict misses? Likely not, but by making it eight-way you can play the trick you mentioned.

Even if you can't play this trick, you can still do a data array lookup in parallel. As with the "partial tag" (or "way predictor" as it is sometimes called) I described above, you can pretty accurately grab the correct block from the cache. In parallel with that is where you do the TLB access and tag lookup. You then use the information from both the tags and the TLB to see if your prediction was correct. Of course, it really isn't a guess. It isn't a predictor as much as a lookup table that can be wrong from time to time.

This sort of speculatively indexed cache can hide the tag and TLB latency. I'm not actually sure which (if any) processors use such tricks. I think the Pentium 4 might have (it used tons of tricks like this to get high clock frequency), but perhaps the Core 2 Duo just does the brute force thing and does the tag lookup first and then accesses the cache array. A tag array for a 32KB cache with 32-byte blocks cache is only 4KBs or so. Only 2KBs for the 64-byte blocks used by Larrabee. Perhaps at a Ghz or so in 45nm it just doesn't add that much latency. I dunno.

L2 TLB probably has some more flexibility given the much longer access times.

Do you mean the TLB at the L2 cache? Or the second-level TLB used by some x86 cores? If you mean the TLB at the L2 cache, there is no such beast. Once you've translated a virtual address to a physical address as part of the first-level cache access, you don't need to do any sort of translation at the second level. If you meant the second-level TLB, then sure, but it isn't accessed that frequently anyway.

Keep in mind that TLBs are CAMs, not SRAM arrays, so are much much more expensive to access.

Under x86, TLBs aren't actually architected state. They are just caching page table entries managed completely by the hardware. That means that each x86 chip can cache these entries any way they want, including using set-associative TLBs. That is, a TLB just looks like a smaller cache (that is, it uses SRAMs and not CAMs). I think the Opteron has a four-way set associative TLB, but I can't recall what the Core 2 does. This flexibility also allows the use of a two-level TLB, in which the second-level TLB is larger, but preventing a more expensive hardware page table walk.

Note that under some systems with software-managed TLBs, the size and configuration of the TLB is visible to the operating system. Interestingly, the SPARC chips have gradually migrated from software-managed TLBs (in which the software walks the page tables) to a hardware-managed TLB (in which the hardware walks the page tables). Most of the RISC ISAs (for example, MIPS, Alpha, SPARC) used software managed TLBs. At the time, x86's hardware managed table was considered a bad idea. But now hardware TLBs are back in style again.

So I totally agree with you, I just think the costs are slightly higher than you stated.

I may be underestimating the costs of such caches. They certainly aren't free. However, I would say that most of the readers of this board greatly underestimate the benefit of dynamic caching of data and hardware-managed cache coherence. I think this design decision is at the heart of Larrabee, and it is one of Larrabee's key advantages over everything else that has come before it in this space.

dkanter · Jan 21, 2008

ArchitectureProfessor said:
The associativity*(page size) trick is used quite a bit. For those that don't know the trick, if the cache's associativity * the page size is less than or equal to the size of the cache, in that case, the "index" into the cache is the same for both the virtual and physical addresses. That is, it uses just the bits from the "page offset" of the address, which is independent of the virtual->physical page mapping. In such a system, you can just index the cache with the physical address without any worries.

For example, I think Intel's Core 2 Duo has an 8-way set associative, 32-Kbyte cache, and the x86 uses 4kB pages.. Does it really need to be 8-way for reducing conflict misses? Likely not, but by making it eight-way you can play the trick you mentioned.

Even if you can't play this trick, you can still do a data array lookup in parallel. As with the "partial tag" (or "way predictor" as it is sometimes called) I described above, you can pretty accurately grab the correct block from the cache. In parallel with that is where you do the TLB access and tag lookup. You then use the information from both the tags and the TLB to see if your prediction was correct. Of course, it really isn't a guess. It isn't a predictor as much as a lookup table that can be wrong from time to time.

This sort of speculatively indexed cache can hide the tag and TLB latency. I'm not actually sure which (if any) processors use such tricks. I think the Pentium 4 might have (it used tons of tricks like this to get high clock frequency), but perhaps the Core 2 Duo just does the brute force thing and does the tag lookup first and then accesses the cache array. A tag array for a 32KB cache with 32-byte blocks cache is only 4KBs or so. Only 2KBs for the 64-byte blocks used by Larrabee. Perhaps at a Ghz or so in 45nm it just doesn't add that much latency. I dunno.

ISTR that the EV6/7 and K7/8 use way prediction.

Do you mean the TLB at the L2 cache? Or the second-level TLB used by some x86 cores? If you mean the TLB at the L2 cache, there is no such beast. Once you've translated a virtual address to a physical address as part of the first-level cache access, you don't need to do any sort of translation at the second level. If you meant the second-level TLB, then sure, but it isn't accessed that frequently anyway.

I meant the 2nd level TLB, since as you pointed out there is only one TLB for the system.

Under x86, TLBs aren't actually architected state. They are just caching page table entries managed completely by the hardware. That means that each x86 chip can cache these entries any way they want, including using set-associative TLBs. That is, a TLB just looks like a smaller cache (that is, it uses SRAMs and not CAMs). I think the Opteron has a four-way set associative TLB, but I can't recall what the Core 2 does. This flexibility also allows the use of a two-level TLB, in which the second-level TLB is larger, but preventing a more expensive hardware page table walk.

Note that under some systems with software-managed TLBs, the size and configuration of the TLB is visible to the operating system. Interestingly, the SPARC chips have gradually migrated from software-managed TLBs (in which the software walks the page tables) to a hardware-managed TLB (in which the hardware walks the page tables). Most of the RISC ISAs (for example, MIPS, Alpha, SPARC) used software managed TLBs. At the time, x86's hardware managed table was considered a bad idea. But now hardware TLBs are back in style again.

I may be underestimating the costs of such caches. They certainly aren't free. However, I would say that most of the readers of this board greatly underestimate the benefit of dynamic caching of data and hardware-managed cache coherence. I think this design decision is at the heart of Larrabee, and it is one of Larrabee's key advantages over everything else that has come before it in this space.

I agree it's a huge advantage, I'm just not sure to what extent the decision to be CC was made because its the right idea, versus "it's what's available".

DK

ArchitectureProfessor · Jan 21, 2008

Bob said:
I fear we may be talking past each other here.

We certainly may be. I think we agree, but perhaps disagree over the magnitude of the problem.

The point I was trying to convey was that, even if the memory latency for 1 thread is 50 ns, you cannot assume that the latency for N threads is also 50 ns (per thread), for large-ish N.

As long as your bandwidth utilization is, say, less than 50% and you have a reasonably random distribution of addresses, the latency should be largely independent of the number of threads your throw at the system. Sure, there may be bursts, but because the system is a closed system (a thread won't generate another miss until it has been serviced, there is a limit to the number of misses that can be queued at any one time).

If you're not careful about how you construct your access patterns (between threads!) while trying to maximize bandwidth, you can easily be off by an order of magnitude in terms of latency.

Larrabee likely has something like a 256-bit or 512-bit memory interface. As GDDR3 or GDDR4 are in the range of 2 to 4 gigabits per pin, that would be something around 128 GB/second, which is likely what Larrabee (and other GPUs of the day) will support.

Ok, that means that Larrabee could decide to fetch an entire 64B cache block in parallel by using all its memory controllers in parallel (I think they are actually likely doing block interleaving of the memory controllers, but stay with me for a second here). In this parallel cache block fetch context, there are no bank conflicts. In essence, the access pattern wouldn't matter. In such a system, I really don't see how you could get an order-of-magnitude slowdown.

In fact, if there are 128 threads in the system (without any prefetching), the most you can wait for is 127 threads in front of you. With 128GB/second of bandwidth, the system can service a request in 0.5 ns on average. That means the most queueing you'd expect to see would be 64 ns. This 2x slowdown should be close to the worst case you'd see.

Yes, I know that real DRAMs have other overheads, and they need to open and close pages and such. Yes, in a real system it would probably have banking on a block interleaving, mean you could have the unlikely situation in which all the threads bang on the same memory controller. I guess I don't see the problem. More specifically, why would the problem be any better or worse for a GPU? I guess the GPU could access data at sub-block granularity, but it isn't clear that is much of a win with the way DRAM controllers work (they like large chunks to fetch).

And when you add latency, you need to add yet more threads to cover that latency, which can in turn add more latency and so on.

Actually, that was my point. Tolerating latency doesn't matter if you're already bandwidth bound. Once bandwidth is your bottleneck, your only choice is to buy more bandwidth, make bigger caches, or change the software to have better locality.

Once you have enough threads (say 128), you're totally memory bound. You could add threads, but if you're pegging out your DRAM bandwidth at 100%, adding threads won't actually help.

A simple queuing model will not necessarily show this, but a simple DRAM model will.

Am I missing something about the internals of DRAM? I have to admit, I've never worked with a super-detailed DRAM model, but from what I know about DRAM, I can't think of something that would cause such problems.

TimothyFarrar · Jan 21, 2008

ArchitectureProfessor, care to speculate as to what types of memory access instructions Larrabee will have with its vector instructions?

With CUDA each thread/component of the warp/vector can individually address during a memory read/write (component gather). Where as SSE, and perhaps (not) Larrabee, has to emulate gather (and scatter as well) with multiple instructions and registers. Of course with CUDA there is a penality depending on access pattern, but I believe the hardware handles this (incuring extra latency) while another warp/vector is processing.

Even with such small vectors (2 DP, 4 SP) SSE is often performance limited by the amount of instructions needed to gather/scatter from memory, swizzle, and other non-ALU instructions. With such a large vector, one would think that with Larrabee they would definatly address these issues.

CUDA exposes different memory interfaces: cached read only constants, high latency read/write global device memory, banked+broadcast read/write of shared local memory, and read only 2D texture cache. As I noted in my previous post, read/write access to a 2D surface cache is in the CUDA PTX, but not yet exposed by the drivers or hardware (not in Compute 1.1). BTW, CUDA Compute 1.1 (G92 and higher) does offer an atomic memory instruction.

While probably just a driver issue, perhaps the bigest limitation of CUDA right now (according to what I've read on the CUDA message forums), is startup latency and the fact that only one kernel/program can run at one time (kernels/programs are serialized). Undoubtly these issues will be addressed in 2008.

From a programming standpoint Larrabee could have an advantage in regards to running multiple different programs on the hardware at one time.

Lets hope with Larrabee that if Intel is going to change the instruction set that they go in and make the integer registers symmetric and remove the BP/EBP/RBP and SP/ESP/RSP address mode exceptions...

Nick · Jan 21, 2008

ArchitectureProfessor said:
As Larrabee is just general purpose x86 cores, the hardware will support whatever version of DirectX.

No matter how Larrabee's performance turns out, I believe this is its number one advantage. It could have DirectX 12 support the day Microsoft closes the specification, while NVIDIA and AMD would still be about a year away from mass producing new cards with stable drivers.

Also, in my experience games that use a newer version of DirectX aren't necessarily slower because of it. On the contrary, newer DirectX versions allow to do things more efficiently. So while your Larrabee hardware wouldn't get any faster when it's updated to DirectX 12, it won't suck at running next generation games. Again, there are no inherent bottlenecks.

silent_guy · Jan 21, 2008

ArchitectureProfessor said:
Once you have hardware caches, you can always just "block" your algorithm to use a consecutive block of memory (just smaller than the size of the cache). After one iteration the hardware cache will basically act like a software-managed cache. If you want, you can even pre-walk the block of memory to quickly sequentially fetch it into.

I understand all that. It's just that CUDA doesn't have it... So caching in SW is all there is. (except for constants and texture caching, but the latter still with relatively high latency and no way to flush, so only for RO data also.)

dkanter · Jan 21, 2008

TimothyFarrar said:
ArchitectureProfessor, care to speculate as to what types of memory access instructions Larrabee will have with its vector instructions?

With CUDA each thread/component of the warp/vector can individually address during a memory read/write (component gather). Where as SSE, and perhaps (not) Larrabee, has to emulate gather (and scatter as well) with multiple instructions and registers. Of course with CUDA there is a penality depending on access pattern, but I believe the hardware handles this (incuring extra latency) while another warp/vector is processing.

Even with such small vectors (2 DP, 4 SP) SSE is often performance limited by the amount of instructions needed to gather/scatter from memory, swizzle, and other non-ALU instructions. With such a large vector, one would think that with Larrabee they would definatly address these issues.

CUDA exposes different memory interfaces: cached read only constants, high latency read/write global device memory, banked+broadcast read/write of shared local memory, and read only 2D texture cache. As I noted in my previous post, read/write access to a 2D surface cache is in the CUDA PTX, but not yet exposed by the drivers or hardware (not in Compute 1.1). BTW, CUDA Compute 1.1 (G92 and higher) does offer an atomic memory instruction.

While probably just a driver issue, perhaps the bigest limitation of CUDA right now (according to what I've read on the CUDA message forums), is startup latency and the fact that only one kernel/program can run at one time (kernels/programs are serialized). Undoubtly these issues will be addressed in 2008.

From a programming standpoint Larrabee could have an advantage in regards to running multiple different programs on the hardware at one time.

Lets hope with Larrabee that if Intel is going to change the instruction set that they go in and make the integer registers symmetric and remove the BP/EBP/RBP and SP/ESP/RSP address mode exceptions...

If you assume that Intel's engineers are fairly smart, which they are, I think the most likely answer is that they will have support for scatter-gather.

It's very difficult to keep high utilization for any vectors much larger than those in SSE without that particular functionality.

David

Nick · Jan 21, 2008

Demirug said:
In theory. But even writing a fast Direct3D 10 software rasterizer isn’t an easy job.

Easy, no, but is there anything that makes you believe it's harder than designing hardware and writing drivers? I wrote a 'fast' DirectX 9 renderer while I was still studying (fast in relative terms - given the hardware). Frankly, I believe writing a DirectX 10 renderer is simpler because you don't have to worry about fixed-function pipelines. A lot of responsabilities shift to the hands of the game developers.

The advantage of software on a general-purpose CPU is that it's extremely modular and you can work incrementally. Whenever an approach isn't as efficient as expected, you can rewrite ánd test it in a matter of hours (no matter if it's a high-level or low-level algorithm).

Larrabee: Samples in Late 08, Products in 2H09/1H10

silent_guy

armchair_architect

liolio

Aquoiboniste

MTd2

dkanter

ArchitectureProfessor

ArchitectureProfessor

ArchitectureProfessor

dkanter

dkanter

Bob

dkanter

ArchitectureProfessor

dkanter

ArchitectureProfessor

TimothyFarrar

Nick

silent_guy

dkanter

Nick

Similar threads