Larrabee: Samples in Late 08, Products in 2H09/1H10

ArchitectureProfessor · Jan 23, 2008

dkanter said:
Actually I'd say Intel has 4

As you mentioned, Intel does have more than two design groups. In addition to Oregon and Israel, Intel does also have the design teams in Santa Clara, the group outside of Boston (the former Alpha group), and Fort Collins (the old HP PA-RISC design group). I would say that right now the shining star groups are Israel followed by Oregon. The other groups are also really good, they just haven't had a chance to really deliver in the past few years.

Perhaps with Tukwila, the former Alpha group will shine again. Yet, I know of some key box-leads from the old Alpha group that have since left Intel, so I'm not sure if this group is still as good as it once was. I think the cancelation of EV8 really hurt the morale of the group.

I agree that the Itanium 2 from the guys at Fort Collins was floating point performance monster. I'm not sure of the status of the Texas design group (which you mentioned). Intel also has teams in India (focusing on validation, I think), Russia, and perhaps Germany as well?

It is actually amazing how many different groups Intel has.

Intel (nee HP) Fort Collins did Itanium 2, which as it turns out was the fastest or 2nd fastest 180nm processor and was beating all comers for a while except for....

Good on spec benchmarks, partly due to the *huge* on-chip cache. Not as good on real workloads, from what I've heard. If you gave an x86 core of the day the same on-chip cache and higher memory bandwidth, the gap would be much smaller.

I've actually heard rumors that Intel told the x86 designers were told that x86 wasn't allowed to have the same memory bandwidth as the Itanium (IPF) chips. x86 parts of the day had about half the memory bandwidth. Intel also prevented the x86 guys from rolling out 64-bit x86. Rumor has it they actually had partial 64-bit support in some versions of the Pentium 4 (using a similar but incompatible 64-bit x86 extension). This support is what allowed Intel to add support for AMD's 64-bit extensions reasonably quickly. It wasn't until AMD did 64-bit and had lots of memory bandwidth that Intel took the shackles off of the x86 guys. Ironically, right around the time of Prescott.

The EV7 was also one of the two fastest 180nm CPUs and it's been speculated that if it wasn't for HP not wanting the red headed step child to outshine the fair haired child (IPF), that it would have easily beat all other 180nm CPUs.

I've heard this as well. I think HP actually had internal benchmark results showing EV7 being much faster than the IPF processors (for example, on the TPCC database workload benchmark). They just never released any of them. EV7 was a really great design. Interestingly, they had a paper design for a Alpha chip with vectors added to it (using the EV7 memory system). If you squint a bit, Larrabee looks a lot like an EV7+vectors with all the cores on the same chip (ok, except for the ring interconnect business).

Of course, the Fort Collins guys aren't working on x86, and I don't think the alpha guys are either.

As far as I know, they are both still doing IPF.

3dilettante · Jan 23, 2008

Something just occurred to me.
Wouldn't speculative locking require a thread ID on all in-core memory accesses?

Otherwise, one of the other three threads running in the core might access one of speculative cache lines and it would be indistinguishable from the speculating thread.

MTd2 · Jan 23, 2008

Architectureprofessor,

everything you say is fine, but what I think it most strikes me badly, it is there isn't any numeric estimative for performance whatsoever, in anypaper. Generaly performance is in general the prime reason why a processor is successeful or not.

There isn´t any whatsoever estimative or any kind of performance for Larrabee on this thread. Everything is merely qualitative. Of course, this is a technology discussion subforum, but the market doesnt work with just aesthetics at all. If that was the case, Phenom would be a winner because a lot of people think it's architecure is beautiful, but we know its performance is really subpar.

This thread is really informative, in a sense of a generic computer science perspective. But saying that Larrabee has this or that cute or "powerful" detail doesnt help at all bring up the issue of a real world performance.

So, if you can elaborate on estatistic-numerical estimative, using all of what you present as supposedely present, that would be really informative. Otherwise, I think that this will turn as a sociological and philosophical discussion to computers, while we definetely know that they are really highly precise and mundane dispositives.

Daniel,

dkanter · Jan 23, 2008

ArchitectureProfessor said:
As you mentioned, Intel does have more than two design groups. In addition to Oregon and Israel, Intel does also have the design teams in Santa Clara, the group outside of Boston (the former Alpha group), and Fort Collins (the old HP PA-RISC design group). I would say that right now the shining star groups are Israel followed by Oregon. The other groups are also really good, they just haven't had a chance to really deliver in the past few years.

Perhaps with Tukwila, the former Alpha group will shine again. Yet, I know of some key box-leads from the old Alpha group that have since left Intel, so I'm not sure if this group is still as good as it once was. I think the cancelation of EV8 really hurt the morale of the group.

Tukwila is designed by the Fort Collins guys. There was actually a big internal debate at Intel over whether to use the DEC designed Tukwila or the FTC designed one. The deciding factor was HP's feedback, which, surprise surprise was for the FTC design (4 core version of Montecito with better system architecture).

There's a great discussion of that on RWT actually...

I agree that the Itanium 2 from the guys at Fort Collins was floating point performance monster. I'm not sure of the status of the Texas design group (which you mentioned). Intel also has teams in India (focusing on validation, I think), Russia, and perhaps Germany as well?

It is actually amazing how many different groups Intel has.

Good on spec benchmarks, partly due to the *huge* on-chip cache. Not as good on real workloads, from what I've heard. If you gave an x86 core of the day the same on-chip cache and higher memory bandwidth, the gap would be much smaller.

I've actually heard rumors that Intel told the x86 designers were told that x86 wasn't allowed to have the same memory bandwidth as the Itanium (IPF) chips. x86 parts of the day had about half the memory bandwidth. Intel also prevented the x86 guys from rolling out 64-bit x86. Rumor has it they actually had partial 64-bit support in some versions of the Pentium 4 (using a similar but incompatible 64-bit x86 extension). This support is what allowed Intel to add support for AMD's 64-bit extensions reasonably quickly. It wasn't until AMD did 64-bit and had lots of memory bandwidth that Intel took the shackles off of the x86 guys. Ironically, right around the time of Prescott.

Actually I don't think the issue was memory bandwidth, but cache. I know for a fact that the guys designing the P4 were explicitly told to only target desktop workloads, which is why there was no MP version for such a long time (till Tulsa). They wanted to add more cache but were told no.

So the x86-64 support was actually built into Prescott without management knowing, and it wasn't till much later in the life cycle of the project that they told anyone about it and then argued that it needed to be turned on. There's a great UW paper talking about this, it's sort of a post-mortem of why x86-64 succeeded, and the answer is: Microsoft.

I've heard this as well. I think HP actually had internal benchmark results showing EV7 being much faster than the IPF processors (for example, on the TPCC database workload benchmark). They just never released any of them. EV7 was a really great design. Interestingly, they had a paper design for a Alpha chip with vectors added to it (using the EV7 memory system). If you squint a bit, Larrabee looks a lot like an EV7+vectors with all the cores on the same chip (ok, except for the ring interconnect business).

Tarantula was the design, they recycled the EV4 or EV5 and bolted on a vector unit.

As far as I know, they are both still doing IPF.

My understanding is that the DEC teams (both in CA and MA) worked on CSI heavily and Tukwila v1 which was cancelled. I think they are now working on Poulson.

David

TimothyFarrar · Jan 23, 2008

ArchitectureProfessor said:
OK, let me play devil's advocate here again. If this is so easy, why do *any* programs use locks? Just to the magic lock-free stuff you talked about. <sarcasm>Easy. No problem. That is why most major systems such as operating systems and database management systems don't use locks internally. That is why malloc doesn't use locks.</sarcasm>

Let me sum it up this way: a data structure with a coarse grain lock is a CS101 assignment. A data structure with fine-graned locking is perhaps a Junior or Senior level project for a CS major. A good lock-free algorithm can earn you a PhD. Why not build speculative locking hardware to turn a PhD-level problem into a CS101 project?

Thanks for the previous links, they should be an interesting read.

100% agree with that one. Larrabee will be ideal for easy parallelism.

BTW, I wasn't saying that you completely remove locks, just minimize them so that they are no longer a performance problem in the critical paths. Taking your obvious malloc example, anyone using malloc in a critical path in a multiprocessing algorithm might want to have a serious look at pre-allocating single thread owned objects (no locking then needed to allocate and free).

Speaking of experts, one thing we might be able to take a few notes from is Michael Abrash and Mike Sartain's highly optimized Pixomatic software DX7 renderer. Seems like they efficiently managed to do just about everything but advanced texture filtering (mipmap, trilear filtering) and programmable shaders in software. However according to that site, "Pixomatic is roughly an order of magnitude slower than leading-edge consoles and PC graphics cards as we write this in 2003".

Looking back at 2003 gives us the GeForce FX 5900 Ultra at 450 Mhz with 3 vertex shaders, 8 pixel pipes, and 4 ROPs. For x86, 2003 gives a 3.0 GHz P4 (with hyper-threading). So the most highly optimized software renderer in 2003 was 10 times slower than a GeForce FX 5900 Ultra. Now take the GeForce 8800 Ultra (G80 chip, 2007), just in ALU, 1512 Mhz with 128 stream processors, at WORST CASE is easily over 4 times faster than the GeForce FX 5900 Ultra. Using a little grade school logic, even if you could "plop" (ie, grade school) 40 Intel P4's clocked at 3GHz on a chip, you still wouldn't be able to match a single G80 for rendering.

Now Larrabee with only 16-24 cores at 2.5GHz with in-order 4 way "hyper-threading". Seems like even with hardware texture gather, a software renderer on Larrabee wouldn't match a 8800 Ultra, unless it had other serious dedicated GPU features such as triangle setup and some kind of ROP. And in that case you are back to the embarrassing parallel workflow for rendering which isn't going to need SLE or a coherent cache.

I'm just saying that for something that is getting marketed as a GPU, Larrabee is offering a CPU like feature set (cache, SLE, etc) which really only make sense if they are replacing key dedicated GPU piping with software which simply doesn't seem like a good idea from a performance perspective.

ArchitectureProfessor said:
I don't know this algorithm. But let me ask you a question about it. How well would his map to Larrabee (in terms of the structure and communication patterns)? I suspect it will map reasonably well under a task-and-queue model.

Actually you wouldn't want to map it to Larrabee, it's a solution to a problem that Larrabee doesn't have.

ArchitectureProfessor · Jan 23, 2008

3dilettante said:
If for simplicity's sake the bus is optimized to a certain granularity, the utilization penalty is larger than the packet size would suggest... I can think of a number of increasingly interesting (but perhaps ultimately pointless) formulations for the ring bus, given all that I know is that Intel hoped for 256 bytes/cycle.

Yes, that is true, if it really is four rings, that is 64Bytes per ring. Sending smaller requests on the ring one at a time would likely waste bandwidth. That is a good point.

Fine-grained methods such as round-robin execution would not be able to hide a cache miss for whatever number of cycles that thread comes up for execution.... I haven't seen any Intel statement that has set down what exact scheme Larrabee uses.

I don't think any recent multithreaded high-end CPUs have used round-robin multithreading (including Intel). They either use SMT, which dynamically steers individual instructions to ALUs, perhaps even in the same cycle. Or they use "switch on event" multithreading (CMT) in which one thread runs until it is stalled, then it switch to another thread. I would expect that Larrabee would use one of these approaches.

Ring buses are conceptually simpler, and they have been used in other architectures to provide high-bandwidth internal interconnect.... Polaris showed they've been working on more complex topologies, but sometimes simplicity wins.

I agree that rings are simpler. I guess I'm just surprised they scaled up to 32 cores!

I'm curious about how it will implement scatter and gather operations.
If working on gathering operands for a 16-wide vector, an increasing level of arbitrariness to where those elements can be pulled from can lead to increasing levels of waste.
Worst-case with fully generalized gather is pulling 16 elements from 16 different locations, necessitating 16 64 byte cache lines to be read into the cache for just one 32-bit value from each.
Perhaps gathers will bypass the cache?

I'm not sure bypassing the cache helps. DRAMs are design for consecutive bursts. The GDDR4 datasheet referenced earlier in this thread requires burst of length 8 of 4 bytes each. That is, the minimum you can read out of this DRAM is 32 bytes. Doing a scatter/gather in which you need 32-bits from all over will have similar inefficiencies just because of the way DRAM works. Once you already need to grab bursts of data, why not do block-based cache coherence?

Temperature can increase leakage current, so a chip with internals that have a higher average temperature might consume more power, depending on how other portions of the chip are altered.

Good point. That just makes the thermal challenges even worse.

Writing the speculative values to cache could theoretically force a write invalidate broadcast if the cache controller tried to treat it like any other write. I suppose reads can be allowed to go on as usual, though there could be some hiccups when speculation pushes other cache lines to shared status and then rolls back to the time prior to the accesses.

I'm not sure I quite follow you. Certainly if you mis-speculate, you might take a hiccup. The same is true for branch prediction, yet if you're mostly right, things work well.

If Larrabee has some kind of forwarded cache line state like in the CSI interconnect, it could theoretically force a rollback on a line in Forwarding status, leaving no cache line anywhere in that state.

Yea, you would need to make sure to handle that case correctly (by correctly coordinating the states in the L1 and L2).

Then there's the obligatory reference to a corner case involving undoing updates to the page table descriptor.

My solution to this: don't allow page table description changes within speculative critical sections. Sounds too nasty.

ArchitectureProfessor · Jan 23, 2008

3dilettante said:
Something just occurred to me.
Wouldn't speculative locking require a thread ID on all in-core memory accesses?

Yes, you would need a pair of speculative read/write bits for each hardware thread context in the core. So, for Larrabee's four thread contexts, you would need something like eight bits per 512-bit cache block. Then, when you perform cache hits, you need to make sure that it wasn't modified speculatively by another thread on the same core.

A bit tricky, but it does work.

ArchitectureProfessor · Jan 23, 2008

MTd2 said:
everything you say is fine, but what I think it most strikes me badly, it is there isn't any numeric estimative for performance whatsoever, in any paper. Generaly performance is in general the prime reason why a processor is successeful or not.

There just isn't enough known about Larrabee's actual internals or what the software it runs actually looks like. Using absolute numbers without knowing more would just be picking numbers out of the air.

As far as some of the individual sub-features (such as speculative locking), there have been some papers that explore it (and try to quantify its performance). The big problem is determining what the parallel workloads of today and tomorrow will look like. So, the academic like myself do the best we can to make convincing arguments backed up by some data.

ArchitectureProfessor · Jan 23, 2008

dkanter said:
Tukwila is designed by the Fort Collins guys. There was actually a big internal debate at Intel over whether to use the DEC designed Tukwila or the FTC designed one. The deciding factor was HP's feedback, which, surprise surprise was for the FTC design (4 core version of Montecito with better system architecture).

Wow, I hadn't heard that. Man, the Alpha group never seems to get lucky.

There's a great UW paper talking about this, it's sort of a post-mortem of why x86-64 succeeded, and the answer is: Microsoft.

Do you have a title or reference to the paper? It sounds interesting.

Tarantula was the design, they recycled the EV4 or EV5 and bolted on a vector unit.

Yea. I found a link to the paper: "Tarantula: A Vector Extension to the Alpha Architecture". Pretty cool stuff.

ArchitectureProfessor · Jan 23, 2008

TimothyFarrar said:
Taking your obvious malloc example, anyone using malloc in a critical path in a multiprocessing algorithm might want to have a serious look at pre-allocating single thread owned objects (no locking then needed to allocate and free).

This is a great example of how changing the algorithm can be much more effective than any sort of fancy locking or whatever. Having per-thread malloc buffers is a really good idea. There is a good memory allocator (malloc() replacement) based on this idea: Hoard that came out of Emery Berger's PhD thesis from UT-Austin.

Speaking of experts, one thing we might be able to take a few notes from is Michael Abrash and Mike Sartain's highly optimized Pixomatic software DX7 renderer... Now Larrabee with only 16-24 cores at 2.5GHz with in-order 4 way "hyper-threading". Seems like even with hardware texture gather, a software renderer on Larrabee wouldn't match a 8800 Ultra, unless it had other serious dedicated GPU features such as triangle setup and some kind of ROP.

The big difference between multiple Pentiums and what Larrabee is doing is the 64-byte vectors. The vectors both give Larrabee huge peak flops, and it will be necessary to get them working well if Larrabee is successful. I just don't know what the vector ISA of Larrabee looks like to really know for sure. The other issues is memory bandwidth.

I'm just saying that for something that is getting marketed as a GPU, Larrabee is offering a CPU like feature set (cache, SLE, etc) which really only make sense if they are replacing key dedicated GPU piping with software which simply doesn't seem like a good idea from a performance perspective.

That is exactly the high-risk bet that Intel is making. I agree that it sounds crazy, and it certainly goes against all conventional wisdom in the GPU space. That is what makes it so interesting.

Let me also say this. If Larrabee isn't successful, it won't be because it didn't have enough multi-core like features (the cache, SLE, and such that you mentioned). Larrabee is probably the best-case scenario, so it should be a good test to see if the idea is fundamentally flawed or not.

Actually you wouldn't want to map it to Larrabee, it's a solution to a problem that Larrabee doesn't have.

Without knowing more about it, that certainly sounds like something in favor of Larrabee, in what it doesn't require solutions to some things because it isn't a problem that Larrabee has that needs to be worked around.

Thorburn · Jan 23, 2008

dkanter said:
Actually I don't think the issue was memory bandwidth, but cache. I know for a fact that the guys designing the P4 were explicitly told to only target desktop workloads, which is why there was no MP version for such a long time (till Tulsa). They wanted to add more cache but were told no.

No Multi-processor Netburst chips until Tulsa? What about Foster MP, Gallatin, Cranford, Potomac and Paxville MP?

Foster MP, Gallatin, Potomac and Paxville MP all had L3 cache.

ArchitectureProfessor · Jan 23, 2008

An analogy to speculative locking

I think I finally figured out a way to put all of the concerns raised about speculative locking as an idea into context by way of an analogy.

Let's go back in time 15 years to the early 1990s. At the time, some academics were really hot on out-of-order execution. In out-of-order execution, a processors speculates to uncover instruction level parallelism by speculative execution, and rolling back when the branch predictions are wrong. Doing out-of-order execution requires register checkpointing mechanisms, handling out-of-order memory operations (or preventing them), structures to maintain in-order retirement, precise exceptions, etc. Lots of book-keeping machinery and a big reliance an branch prediction.

Let's say that some brash and obnoxious young architecture professor went on the comp.arch newsgroup (the web didn't really exist in the early 1990s) and reported rumors that Intel---of all companies---was going to embrace out-of-order execution in the Pentium Pro (P6) design. Sure some small companies had done out-of-order for some little things, but x86? No way.

Of course, posters would respond with how complicated out-of-order execution is. They would say the in-order with smart compiler scheduling is just as good. They would argue in favor of VLIW. They would say the x86 ISA would prevent Intel from ever making a high-performance process that could ever touch the workstation or server market. There would be lots of nay sayers.

Yet, in spite of all of such objects, Intel did make P6. It was the most successful chip in the history of computing. Its basic design is still at the heart of the Core 2 Duo on which I'm writing this message 15 year later.

I think this is a pretty good analogy to the discussion we're having on speculative locking.

Just to make it explicit: today many academics are hot on speculative locking and transactional memory to increase the parallelism of thread-based codes. Such techniques require extra book-keeping overhead for managing speculative state and rely on the prediction that most critical section are independent. Yes, with smart lock-free data structures you don't need speculative locking (just as with some algorithms that can be scheduled well, you don't need out-of-order execution). Yet, speculative locking can really help some workloads (just as out-of-order does).

Seems pretty similar to me. In fact, speculative locking is likely *simpler* than out-of-order execution in terms of overall impact on the core.

Ok, flame away.

digitalwanderer · Jan 23, 2008

I would flame you something awful if I understood what you said.

ArchitectureProfessor · Jan 23, 2008

Thorburn said:
No Multi-processor Netburst chips until Tulsa? What about Foster MP, Gallatin, Cranford, Potomac and Paxville MP?

Foster MP, Gallatin, Potomac and Paxville MP all had L3 cache.

I don't know when they made the "MP" version of this chip. However, just as an FYI, the high-end Itanium 2 from 2006 (90nm) has 24 MBs of on-chip cache.

As compared to today's 4MB-8MB of on-chip cache on the Core 2 line (with 65nm), you can see just how big those caches are (and why those chips are so expensive).

Thorburn · Jan 23, 2008

ArchitectureProfessor said:
I don't know when they made the "MP" version of this chip. However, just as an FYI, the high-end Itanium 2 from 2006 (90nm) has 24 MBs of on-chip cache. As compared to today's 4MB-8MB of on-chip cache on the Core 2 line (with 65nm), you can see just how big those caches are (and why those chips are so expensive).

Foster MP was 1MiB L3 in March 2002 on 180nm
Gallatin-2M had, well, 2MiB L3 in November 2002 on 130nm
Potomac had 8MiB L3 in 2005 on 90nm
Later that year Paxville MP (first dual-core Xeon MP chip) launched with up to 2MiB L2 (I got mixed up in my previous post, it doesn't have L3 it seems) on 90nm.
Tulsa had 1MiB L2 per core and up to 16MiB L3 on 65nm in 2006.

That said, Tulsa was a big and expensive chip too! (die shot here)

dkanter · Jan 23, 2008

Thorburn said:
No Multi-processor Netburst chips until Tulsa? What about Foster MP, Gallatin, Cranford, Potomac and Paxville MP?

Foster MP, Gallatin, Potomac and Paxville MP all had L3 cache.

Sorry, I'm only talking about high performance products.

1MB L3 is tiny - compared to the previous generation 'cachecades' which had 2MB, and in fact, most folks stuck with the 'cachecades' instead.

The 4MB Gallatin I'd count, and the 8MB Potomac. You're right, I forgot about those.

However, Paxville MP had no L3 cache and was a steaming pile of shit; I definitely don't count it. AFAICT counting Paxville as a high performance product is valid insofar as counting a car with octagonal wheels as a high performance vehicle is valid.

David

Bob · Jan 24, 2008

TL,DR: Designing GPUs is hard.

AP, clearly one of us is confused. I'd like to think that it isn't me, or at least, that my previous DRAM controller designs do work as intended

AP said:
The concern I have with your post is that you're failing to take into account the pipelining of requests to a DRAM. Just because the latency of a request to the DRAM is k cycles doesn't mean you can't issue other requests to it in parallel. I think your posts confuses throughput issues with latency issues in several places.

I am accounting for pipelining. The read-to-read command time is 6.4 ns for the selected DRAM. That's a fact. You will not go faster than that. That includes pipelining. The actual time it takes for the read data to return is about 20 ns for that DRAM, give or take. That's a lot longer than 6.4 ns.

The command bus looks a little empty because every read/write command takes 1 clock to issue, but occupies 4 clocks of the data bus (well 8 clocks in DDR anyway). It's also not configured the same as mine, and will have different read-to-write and write-to-read turnaround times (the configuration I picked lets me be symmetric).

If changing from read mode to write mode is so costly, a smart memory controller would defer the writes in a buffer (as writes aren't on the critical path) and then burst all the pending reads together. Seems like you picked an unrealistic worst case situation.

Ah, the magical smart memory controller. More on this later.

If you delay writes, you need to store the write address/data somewhere. In a queue or in a cache for example. If you use a queue, that's dedicated hardware (= area) and only solves a small part of the problem. Moreover, at some point you will need to write that data. That will cause pending reads to be pending for potentially a long time. If you use a cache, you end up pinning down lines in your cache for potentially too long, which results in less cache available for other tasks.

You're also assuming that writes don't happen all that often.

Some rhetorical questions for you: Where do the outputs of a vertex or geometry shader sit? In an on-chip queue? In a cache that can spill to memory? How much output are we talking about to sustain high utilization of the chip? Same question for the pixel data that you write out. Where does it go? How many pixels are you emitting and how much data are we talking about?

For reference, G80 (a ~1.5 year old GPU) can transform ~500 million triangles/sec and write out ~12 billion pixels/sec, simultaneously, at 500 MHz.

Hang on, much of this latency can be overlapped with other transfers. The "Read to Write" diagram on page 33 clearly shows only 2 "dead" clocks cycles between a read and a write. This is only a couple of nanoseconds of dead time, not the 28ns you calculated in your last post.

If you look at the command bus, there are 14 dead cycles between the read and the write. That's 22.4 ns. Notice the ellipsis in the timing diagram.

The data bus is a red herring: We know it will be underutilized, even if the command bus is saturated because of the overhead of issuing commands in a non-bursty fashion.

So, I'm confused. Are you talking about uncontended latency of 35ns? If so, if you look back in my post, I used an estimate of 50ns for uncontended latency, so my estimate seems conservative based on what you're saying. If you're talking about throughput, they you're not considering the small dead time between reads and writes (or that consecutive read or write bursts can be done at full DRAM bandwidth).

I'm not sure what you mean by “uncontended”. If you mean that only a single thread (or a very few threads) are issuing requests to the DRAM, then that's far from being the interesting case. It's like saying “my CPU runs really quickly when virtually nothing is running on it”.

The interesting case is when you try to maximize your bandwidth, so that you may maximize your shader computations in real-world scenarios.

Or just a better memory scheduler (which is part of every high-performance DRAM memory controller built today).

Let's start by laying down the groundwork. What bandwidth do you want and/or can afford? Let's pick some numbers. 512-bits to DRAM with GDDR with a 1.6 ns cycle time, same as above. That works out to a peak bandwidth of 80 GB/sec. Let's say we want to be 90% efficient, for a sustained bandwidth of 72 GB/sec. Let's also assume no DRAM refresh needs to happen, that no external clients need to access the memory for trivialities like “displaying a picture on your monitor” and page switches are free.

What do we need to do to hit our sustained rate? There is an overhead of 22.4 ns for changing from reading to writing or vice-versa. Burst reads or writes is 6.4 ns per command. That means the overhead is equivalent to 3.5 bursted commands, which need to round up to 4 since the command bus is SDR. We also have:

efficiency = burst_size / (burst_size + overhead)

Filling in numbers, that's:

90% = burst_size / (burst_size + 4) <=> (burst_size + 4) * 0.9 = burst_size <=> 0.1 * burst_size = 3.6 <=> burst_size = 36 commands.

Thus, to hit 90% efficiency on your DRAMs, you need to issue at least 36 commands in one direction before you can switch. That means that if you are doing a read and then need to a write, you need to wait over 230 ns. And then, you will need to occupy the DRAMs with 230 ns worth of writes before you can resume reads.

If you do anything else, like try to insert “emergency” read/writes in the stream, you will break the burst and incur the 28.8 ns penalty, lowering your efficiency.

Ok, now to tackle the question of the magic memory controller/scheduler.

What does this the scheduler do? It needs to queue up requests for (given the above) ~230 ns, sort those requests by request type, then issue them.

Moreover, it also needs to sort them by page, across banks, etc so that you don't incur the other penalties associated with DRAM access.

So now the question is: How long until a request is serviced? What if you get a single odd request to a lone page? How long does the controller hold on to that request before it gives up trying to find adjacent requests to keep the DRAMs busy? We're already at ~230 ns of wait time for a single page, after all.

Yet, what does this have to do with the difference between Larrabee and a GPU? It seems like all of the above issues are equally true for both systems? I'm really confused what your point is about all this.

My original point was that Larrabee's 128 threads was likely enough to saturate its available DRAM bandwidth. Do you still disagree with that assertion?

My point is: You need to be careful about how you design the hardware and how you use it. You can't just stick a bunch of CPUs together and pretend like it'll be fast.

I disagree with your assertion, unless Larrabee's clock rate is less than ~400 MHz, in which case 128 threads should be plenty.

Furthermore, since my original comment, we've now concluded that Larrabee likely has vector scatter/gather support. If that is the case, Larrabee will have no problem generating enough outstanding misses to swamp any reasonable GDDR3/4 memory system.

Great. That makes things worse, not better.

Edit: Misc typos fixed.

Nick · Jan 24, 2008

TimothyFarrar said:
Speaking of experts, one thing we might be able to take a few notes from is Michael Abrash and Mike Sartain's highly optimized Pixomatic software DX7 renderer. Seems like they efficiently managed to do just about everything but advanced texture filtering (mipmap, trilear filtering) and programmable shaders in software.

Programmable shaders existed even before Pixomatic: Real Virtuality, a far ancestor of SwiftShader. The key to high performance software rendering is dynamic code generation. Instead of using branches everywhere the pipeline can be configured differently, generate exactly the code needed for the pipeline in a certain configuration. The rest of the challenge is to use the CPU's advanced instructions as efficiently as possible.

I'm just saying that for something that is getting marketed as a GPU, Larrabee is offering a CPU like feature set (cache, SLE, etc) which really only make sense if they are replacing key dedicated GPU piping with software which simply doesn't seem like a good idea from a performance perspective.

The GPU's shader cores execute software as well, and Larrabee will offer competitive GFLOP levels. The only thing that's fundamentally slower on a CPU is texture sampling. It can take up to a hundred clock cycles to take a bilinearly filtered texture sample (including average memory latencies), while a GPU delivers one very clock cycle (per sampler unit). That's why Larrabee will have texture sampler units too - problem solved. Also, for things like triangle setup a GPU has one maybe two units, while Larrabee will be able to use all its cores. There's never a bottleneck and never idle units. Future GPU's might handle setup and ROP in their shader units as well...

It's incredible how deep the "software is slow, dedicated hardware is fast" idea goes. I've had over a dozen discussions with people somehow believing that AGEIA's chips have some magic component that evaluate Newton's formulas faster than a CPU's multipliers and adders. But I digress.

Nick · Jan 24, 2008

ArchitectureProfessor said:
In fact, speculative locking is likely *simpler* than out-of-order execution in terms of overall impact on the core.

Ok, flame away.

If it helps, I'm sold on the idea of speculative locking.

Keeping 128 threads running without any advanced form of low-overhead hardware locks would be madness.

One immediate application I can think of is shared (vertex) caches. You basically have two options: not using a cache and just processing every element you need and writing the results back to RAM, or using a shared cache and keeping the data on-chip. Obviously the latter is a better option if you want to keep bandwidth needs low. But sharing a software cache between a high number of threads requires a lock on every element. Having speculative locks then becomes a whole lot more interesting than an atomic lock that syncs all 32 cores. And I can't imagine a lock-free algorithm that would have acceptable overhead.

Could someone maybe sum up other hardware lock enhancements? There's a clear need for one on Larrabee, and so far speculative locking seems like the best candidate. Also note that it needs optimal scaling behavior. Future versions might run 1024 threads...

TimothyFarrar · Jan 24, 2008

Nick,

BTW, probably not obvious from my post, but I'm not trying to knock software rendering.

Slightly OT, but if say your software renderer was CPU bound from bilinear filtering, why not simply use nearest neighbor on a pre-filtered (pre-upsampled like inverse mipmaping) dynamic texture cache which adapts to what textures need bilinear filtering (amortize the filtering cost over many frames)?

So we take texturing off the table, what about everything else? Dedicated GPU handles much of the rest fully in parallel while the shader cores churn away. ALU and TEX numbers for the GPU only account for the shader capacity. If Larrabee is in fact going to be a pure software renderer with a hardware texture unit, "the rest" (all the non-shader parts of the GPU pipeline) will take ALU, memory bandwidth, from the shaders. My point was it seems from actual fully optimized software rendering examples, that "the rest" is a significant amount of work.

Given that you probably have more real world experience with software rendering that anyone else here, care to share any incite into just what type of performance we could expect on a software implementation of DX10 on Larrabee if Larrabee had no other dedicated GPU functionality besides a texture unit?

Also how would SwiftShader shader now in 2007 on the best Quad Core Chip compare to a Geforce 8800 Ultra if you turn the texture filtering off and simply push flat shaded triangles through the pipeline?

Larrabee: Samples in Late 08, Products in 2H09/1H10

ArchitectureProfessor

3dilettante

MTd2

dkanter

TimothyFarrar

ArchitectureProfessor

ArchitectureProfessor

ArchitectureProfessor

ArchitectureProfessor

ArchitectureProfessor

Thorburn

Moderator

ArchitectureProfessor

digitalwanderer

wandering

ArchitectureProfessor

Thorburn

Moderator

dkanter

Bob

Nick

Nick

TimothyFarrar

Similar threads