Larrabee: Samples in Late 08, Products in 2H09/1H10

My source at Intel told me the vectors are new 16-way (64-byte, 512-bit) vectors. Not only are the vectors much wider than SSE2's XMM registers, there are 32 vector registers (versus the 16 XMM registers).
All my excitement about Larrabree evaporated after reading this statement...

I was hoping for step towards MIMD direction (that's how I've read "many tiny x86 cores"). That would be something remarkably different from current GPUs. Instead, we get wider SIMD units... the 16-wide G80's cluster comes to mind. It seems to me that programs for Larrabree are going to be looking more like programs for CUDA than programs for our current CPUs.

Can somebody explain me what is the supposed advantage of Larrabree model over G80 model? What are the truly new opportunities to explore?
I asked the same question. Apparently, it is just about reusing existing tools. Using standard x86 for everything else means they can start with the Intel x86 compiler for scalar code, use the same page table, and the same I/O device interfaces. They can get a simple kernel up and running quickly. They already have tools for tracing, simulating, and verifying x86. To Intel x86 is like that quirky uncle that is more endearing than annoying most of the time. They are so used to its quirks, that they really don't mind them anymore. Strange but true.
The argument about reusing existing compilers doesn't sound convincing to me. HW implementations of x86 differ in details (like various penalties for complicated address modes or instruction length, decoding constraints, etc.), and programs for them need to be optimized differently. Why would otherwise Intel develop its own C++ compilers? I doubt they can just pick existing compiler, throw it at Larrabree and then expect it to work optimally. I think it's certain that Larrabree will require new compiler. And quirks of x86 make writing compilers for it harder, not easier.
 
I was hoping for step towards MIMD direction (that's how I've read "many tiny x86 cores"). That would be something remarkably different from current GPUs. Instead, we get wider SIMD units... the 16-wide G80's cluster comes to mind. It seems to me that programs for Larrabree are going to be looking more like programs for CUDA than programs for our current CPUs.

Yes, the reliance on SIMD (vectors) of Larrabee is remarkably similar in that way to the 16-wide clusters in the G80.

If you want lots of simple cores, AGEIA's next-generation chip is supposed to have 720 simple cores. Yes, 720. These are just simple, scalar cores with no fancy SIMD units or anything. I was told that they didn't find SIMD that useful for their physics computations. It will also use some sort of hardware caches, but not be cache coherent. The programmer will be required to explicitly force writebacks out of the cache to make then globally visible. There is some sort of support for read-read sharing (caching of the same read-only data in multiple caches), but I think the software is responsible for flushing the caches at synchronization points. My point: if you just want a chip with lots of cores, you might find AGEIA's chip really interesting. It also targets the same 2009/2010 timeframe as Larrabee.

The argument about reusing existing compilers doesn't sound convincing to me.

I think you're underestimating the advantage of having something on day one that can produce code that runs (versus needing to argue about an ISA definition, define it precisely, and then re-target a compiler to it). The compiler will need to be tweaked to generate good code for Larrabee, but tweaking the scheduling pass in a compiler is simpler than retargeting a compiler to a new ISA. It also isn't just the compiler. It is the assembler, debuggers, performance tuning tools, etc. All of these are easier to modify to support some new vector instructions than to re-target to a new ISA.

Plus, Intel really likes the idea of "x86 everywhere". :)
 
AP, clearly one of us is confused. I'd like to think that it isn't me, or at least, that my previous DRAM controller designs do work as intended :)

Bob, you certainly understand the specifics of DRAM controllers, I don't doubt that (now). I'm just trying to understand the system-level effects of what you're say.

In this message I'll try to ask some more detailed questions about the numbers you gave and such. In my next message I'll try to address the higher-level issues.

I think part of the issues that for CPU memory systems, you generally don't have 90% utilization of the DRAM bandwidth. Most of single-core CPU memory system design is about minimizing and hiding latency. I realize that is different for GPUs and many-core CPUs, but I think I've been a bit slow to adjust to the difference. I think my conclusions still hold, but let me comment more below.

It's also not configured the same as mine, and will have different read-to-write and write-to-read turnaround times (the configuration I picked lets me be symmetric).

The GDDR4 DRAM you picked actually does have asymmetric read/write turn around times. I think that was part of the confusion. As shown on page 44 (I accidentally said 34 in my last post), it shows that Read->Write transitions don't have much penalty (just a few bus cycles). However, page 48 shows that the Write->Read penalty is the almost 30ns that you gave in your post. So, the Read<->Write transition issues are certainly real, but they aren't as bad as you made it in your first post.

If you delay writes, you need to store the write address/data somewhere. In a queue or in a cache for example. If you use a queue, that's dedicated hardware (= area) and only solves a small part of the problem. Moreover, at some point you will need to write that data. That will cause pending reads to be pending for potentially a long time. If you use a cache, you end up pinning down lines in your cache for potentially too long, which results in less cache available for other tasks.

I think having a queue at each memory controller wouldn't be that bad. With 128 threads in the system and, say, four memory controllers, having a 64-entry queue at each memory controller (4KB) isn't that bad for a system with 8MB of L2 cache.

But, yes, I agree that queueing up a bunch of stores and then doing a large store burst would add significant latency to the reads queues up behind it. It could easily double the latency.

You're also assuming that writes don't happen all that often.

I'm assuming that writes are less frequent than reads. I'm also assuming that "writes" are "writebacks from cache coherence caches". In that sense, such writes aren't on the critical path.

Some rhetorical questions for you: Where do the outputs of a vertex or geometry shader sit? In an on-chip queue? In a cache that can spill to memory? How much output are we talking about to sustain high utilization of the chip? Same question for the pixel data that you write out. Where does it go? How many pixels are you emitting and how much data are we talking about?

These all go through the coherence memory system, generating cache misses and cache writebacks as needed. With 8MB of cache on chip, you'll get some buffering and writing combining effects. If a software queue to communicate between two stages of the graphics pipeline can fit in the on-chip caches, it won't generate DRAM traffic. My rhetorical question to you: how is this any worse than a normal GPU?

If you look at the command bus, there are 14 dead cycles between the read and the write. That's 22.4 ns. Notice the ellipsis in the timing diagram.

Where are these 14 dead cycles? "DQ" is the data bus, right? Are we looking at the figure on page 44 titled "READ to WRITE"?

The data bus is a red herring: We know it will be underutilized, even if the command bus is saturated because of the overhead of issuing commands in a non-bursty fashion.

Why is the data bus a red herring? Isn't the utilization of the data bus the key issue? Isn't that the goal?

I'm not sure what you mean by “uncontended”. If you mean that only a single thread (or a very few threads) are issuing requests to the DRAM, then that's far from being the interesting case. It's like saying “my CPU runs really quickly when virtually nothing is running on it”.

Actually, for single-thread CPUs, that is the common case. However, I do appreciate that for many-core and GPU architectures, that isn't the common case.

The interesting case is when you try to maximize your bandwidth, so that you may maximize your shader computations in real-world scenarios.

OK, that seems reasonable.

Let's start by laying down the groundwork. What bandwidth do you want and/or can afford? Let's pick some numbers. 512-bits to DRAM with GDDR with a 1.6 ns cycle time, same as above. That works out to a peak bandwidth of 80 GB/sec. Let's say we want to be 90% efficient, for a sustained bandwidth of 72 GB/sec. Let's also assume no DRAM refresh needs to happen, that no external clients need to access the memory for trivialities like “displaying a picture on your monitor” and page switches are free.

Okay.

What do we need to do to hit our sustained rate?... Thus, to hit 90% efficiency on your DRAMs, you need to issue at least 36 commands in one direction before you can switch.

Ok, but if we half the read/write turn around time (as per above) and target, say, 80% utilization, then you only need write bursts of 9 (which seems much more reasonable than 36).

That means that if you are doing a read and then need to a write, you need to wait over 230 ns. And then, you will need to occupy the DRAMs with 230 ns worth of writes before you can resume reads.

Following you math, if you only need burst of size 9, then you're only waiting for 54ns, which seems much more reasonable.

Ok, now to tackle the question of the magic memory controller/scheduler.

I'm friends one of the guy that does AMD's DRAM controllers for their multi-core CPUs. That is pretty much all he works on right now. I agree that it isn't a trivial thing to do.
 
If you want lots of simple cores, AGEIA's next-generation chip is supposed to have 720 simple cores. Yes, 720.

720 cores sounds amazing. But without SIMD it seems impossible for me to feed these cores with instructions from the memory. Therefore each core needs some local storage for code and data like the CELL SPUs. But with 720 core I think we cannot expect more than 4KB for this. The other option would be a shared larger program store on chip and a single IP per Core. But I don’t think that building a memory block with hundred of read ports is a good idea in general.
 
My point is: You need to be careful about how you design the hardware and how you use it. You can't just stick a bunch of CPUs together and pretend like it'll be fast.

Sure, you need to be careful. I agree. But how is sticking a bunch of CPUs together different than sticking a bunch of GPU processing elements together? What is fundamentally different about a GPU? What exactly are you worried about in terms of how Larrabee's model will compare to a GPU model? Too few threads on Larrabee? To many threads? Block size? What?

I disagree with your assertion, unless Larrabee's clock rate is less than ~400 MHz, in which case 128 threads should be plenty.

I have to say, I don't understand this at all. I would have said the opposite. The faster the cores the *fewer* threads you would need (not the other way around). In the simplest model, each CPU is generating requests. The time between requests is the "compute time" (the computation between generating misses) and the "wait time" which is the service time and the queueing delay for the miss. The faster the clock, the smaller the compute time. Thus, if all else is held equal, the system with the same number of threads will generate misses more quickly with faster clocks.

Great. [scatter/gather support] makes things worse, not better.

Again, I would have said the opposite. Scatter/gather means more outstanding misses in parallel. The more outstanding misses in parallel, the less latency sensitive any one miss is. The easier it is to drive the DRAM utilization into the 90+ percent. Again, why does my intuition tell me the opposite thing?

We're obviously coming to very different conclusions about this. I'd like to try to understand why.
 
BTW, probably not obvious from my post, but I'm not trying to knock software rendering.
I had no reason to think that. I was just pointing out that a CPU FLOP is just as good as a GPU FLOP, and the only thing missing for high-performance graphics is texture samplers.
Slightly OT, but if say your software renderer was CPU bound from bilinear filtering, why not simply use nearest neighbor on a pre-filtered (pre-upsampled like inverse mipmaping) dynamic texture cache which adapts to what textures need bilinear filtering (amortize the filtering cost over many frames)?
Bilinear filtering is only part of the texture sampling cost. There's address generation, mipmap computation, reading the actual texels, transposing, converting to floating-point, etc. A shader MUL instruction translates to an SSE multiply instruction, but a TEX instruction translates to dozens of instrutions on a CPU, even for point sampling. Pre-filtering is not very interesting unfortunately. The quality is uncomparable to bilinear filtering, it trashes the caches, and takes up extra memory.
So we take texturing off the table, what about everything else? Dedicated GPU handles much of the rest fully in parallel while the shader cores churn away. ALU and TEX numbers for the GPU only account for the shader capacity. If Larrabee is in fact going to be a pure software renderer with a hardware texture unit, "the rest" (all the non-shader parts of the GPU pipeline) will take ALU, memory bandwidth, from the shaders. My point was it seems from actual fully optimized software rendering examples, that "the rest" is a significant amount of work.
Instead of these other fixed-function units, Larrabee has extra cores and eliminated potential bottlenecks. Like I said, future GPUs might go that road as well now that they've already unified vertex and pixel shader units.
Given that you probably have more real world experience with software rendering that anyone else here, care to share any incite into just what type of performance we could expect on a software implementation of DX10 on Larrabee if Larrabee had no other dedicated GPU functionality besides a texture unit?
Your guess is as good as mine. Larrabee is very different from a CPU and I'm not sure if we should call it software rendering any more. Anyway, given the GFLOPS and the texture samplers I see little reason why it wouldn't succeed as a GPU. But Larrabee obviously isn't Intel's way to go head-on with NVIDIA/AMD. Its versatility is what makes it most interesting. Larrabee will be good at ray-tracing, physics, A.I., weather prediction...
Also how would SwiftShader shader now in 2007 on the best Quad Core Chip compare to a Geforce 8800 Ultra if you turn the texture filtering off and simply push flat shaded triangles through the pipeline?
Honestly, it's not optimized for flat shaded triangles. ;) But if you would optimize for just that it would be bandwidth limited. This also points out a significant advantage of Larrabee over current GPUs. It isn't restricted to the standard graphics pipeline. If you're doing something like rendering rectangular GUI elements you don't have to render triangles, you can just use a (multi-threaded) loop for x and y and interpolate only the components that have a non-zero gradient. The setup cost is only exactly what you need. GPUs always have parts that are doing nothing useful.
 
720 cores sounds amazing. But without SIMD it seems impossible for me to feed these cores with instructions from the memory. Therefore each core needs some local storage for code and data like the CELL SPUs. But with 720 core I think we cannot expect more than 4KB for this. The other option would be a shared larger program store on chip and a single IP per Core. But I don’t think that building a memory block with hundred of read ports is a good idea in general.

Yea, I don't know what AGEIA is planning on this. Sounds like a real issue to me. If Larrabee has something like 10MBs of cache total, if you divide that among 720 cores, that is only 14KBs per core. Maybe a 4KB instruction cache and an 8KB data cache? Seems pretty tiny to me. Perhaps they could gang four or eight of these together to share a dual-port cache? I dunno.
 
The faster the cores the *fewer* threads you would need (not the other way around).
Fewer threads and faster cores mean that you have less opportunities to hide memory latency.
If Larrabee uses 'in order execution' what are all that many cores going to work on if they are all waiting for outstanding cache misses to be served?
We already know that 128 threads are simply not enough once texturing enters in the equation.
 
The faster the cores the *fewer* threads you would need (not the other way around).

Let’s do some math.

If we assume that every thread needs after the compute block (ct=compute time) a data fetch (wt = wait time) we need at least wt/ct threads to keep the core busy. If we have fewer threads the core will stall as all threads wait for data. If we now double the core clock ct is reduced to the half. Therefore we need wt/(ct/2) threads. Or more general wt/(ct/x) where x is the clock increase factor.
 
Fewer threads and faster cores mean that you have less opportunities to hide memory latency.

Consider this: if the number of threads is fixed, does having faster cores increase or decease the number of parallel misses (and correspondingly DRAM bandwidth utilization). My point is, slowing down the processors (with the same number of threads) will just reduce the number of outstanding misses (as the processor will take longer to generate the next miss).

Edit: when I say "number of outstanding misses" above, I'm talking about the total number of outstanding misses from all cores, not on a per-core basis.
 
Last edited by a moderator:
If we assume that every thread needs after the compute block (ct=compute time) a data fetch (wt = wait time) we need at least wt/ct threads to keep the core busy. If we have fewer threads the core will stall as all threads wait for data. If we now double the core clock ct is reduced to the half. Therefore we need wt/(ct/2) threads. Or more general wt/(ct/x) where x is the clock increase factor.

This I 100% agree with. The number of threads per core you need to keep the core ALUs busy goes up as the processors get faster. I don't think I said anything that was inconsistent with this analysis.

The context of the comments by Bob were in terms of DRAM bandwidth utilization. That is what confused me. If he was saying that four threads per processor isn't enough to keep the CPUs busy, then I see what he was trying to get at and I basically agree (and it isn't inconsistent with my earlier comments).

The faster the processors, the more the core utilization will decrease (the more time waiting on memory). The faster the processors, the more demand that will be placed on the memory system, meaning the utilization of the memory system will go up.
 
The context of the comments by Bob were in terms of DRAM bandwidth utilization.

Sorry I missed this part.

But in the end it is all about balancing. A common rule in the GPU world is that for every bilinear sample you need to fetch one texel. As you don’t want your texture units suffer from low bandwidth you calculate the clock rate based on the number of texture units and the bandwidth you have. But this doesn’t stop you from running the calculation blocks in another clock domain (like G80). The “right” clock rate here depends on your prediction for the ALU:TEX instruction ratio.
 
Programmable shaders existed even before Pixomatic: Real Virtuality, a far ancestor of SwiftShader. The key to high performance software rendering is dynamic code generation. Instead of using branches everywhere the pipeline can be configured differently, generate exactly the code needed for the pipeline in a certain configuration. The rest of the challenge is to use the CPU's advanced instructions as efficiently as possible.

Neat, dynamic generation of instructions on 24 Harvard architecture cores.
Would it involve self-modifying code or some kind of code page selection?

I don't think any recent multithreaded high-end CPUs have used round-robin multithreading (including Intel). They either use SMT, which dynamically steers individual instructions to ALUs, perhaps even in the same cycle. Or they use "switch on event" multithreading (CMT) in which one thread runs until it is stalled, then it switch to another thread. I would expect that Larrabee would use one of these approaches.
High performance or high throughput, CPU or GPU?
Sun's T1 and T2 chips use a modifed variant of fine-grained multithreading based on issuing instructions from a group of threads round-robin, demoting a thread from the active group on long-latency events.
On the GPU front, R600 uses two-way FMT within its SIMD units.

I'm not sure bypassing the cache helps. DRAMs are design for consecutive bursts. The GDDR4 datasheet referenced earlier in this thread requires burst of length 8 of 4 bytes each. That is, the minimum you can read out of this DRAM is 32 bytes. Doing a scatter/gather in which you need 32-bits from all over will have similar inefficiencies just because of the way DRAM works. Once you already need to grab bursts of data, why not do block-based cache coherence?
My concern is cache pollution in the L1. It's already shared by 4 threads, and now one thread is pulling in a lot of cache lines for just one operation.
If it is known that the cache lines will be reused prior to being evicted, then storing in cache would make sense.
If a lot of the data is discarded, it might make sense to have some kind of buffer that can bypass the cache.
If a gather is a microcoded instruction, it could either be a string of loads or a string of prefetches to a prefetch buffer.


I'm not sure I quite follow you. Certainly if you mis-speculate, you might take a hiccup. The same is true for branch prediction, yet if you're mostly right, things work well.
With branch mispredictions, the scope of side effects is somewhat more contained.
Usually, the processor has to stall if the branch reaches the end of the pipeline, the bottom of the instruction window reaches the branch, or the CPU runs out of load/store entries. This might reach high double digits, worst-case.
If the limit is now the capacity of a mulit-kilobyte cache, it can be potentially longer.
For reads, it's a hiccup, but like you said, it's not something that doesn't already happen.

If Larrabee does not use a non-coherent buffer for its speculation, the following might be problems:

Writes with standard prediction aren't allowed to commit until the branch has been resolved. Speculatively writing to cache, on the other hand, has a wider reach and implications for the supposed uniqueness of the Modified state.

In a MESI protocol, it would be possible for a speculating thread to invalidate all shared copies of a cache line, then discard the line when it finds the speculation failed. I think it might be safe to keep a copy of the shared lines, but the fact that they are not unique might make this unsafe.
In a MESIF protocol, there is a possible way out, so long as the processor tracks the original value of any Forwarding lines it writes to. On roll-back it could keep the Forwarding lines to seed return values to the various cores whose shared entries were invalidated.
This saves a trip to memory, but it also makes one core a potential hot-spot on the ring bus.
The simplest case would be to just invalidate everything and start from memory.

The funny case would be a three stooges situation, where multiple threads try to speculate on the same lock.
If they follow similar code paths, it's only a matter of time until each thread picks up on the coherency traffic of another thread and all threads involved roll back.

A single speculating thread now has the ability to affect the execution of any number of the 128 threads Larrabee is running.
I'm a believer in good thread citizenship, which is why I like the idea of a buffer space or a way to keep some of the side affects from being spilled out by cache coherency until the locking has been confirmed.
 
Last edited by a moderator:
What is this assertion based on?

I'm guess nAo is referring to the fact that current GPU hardware already caches texture accesses about as optimal as you could expect Larrabee to do, and even with a good read-only texture cache, 128 threads is simply not enough to hide the latency of cache misses for texture gather given the amount of vector elements Larrabee is going to have on each core.

Larrabee: 32KB split L1 -> 256KB L2 per core (16-way SIMD per core, 4 threads)

8800 Ultra: 8KB L1 texture -> 128KB? L2 per multiprocessor element (8-way SIMD per multiprocessor, 96 8-way SIMD threads)

128KB? -> number collected from B3D speculation?
 
Last edited by a moderator:
Neat, dynamic generation of instructions on 24 Harvard architecture cores.
Would it involve self-modifying code or some kind of code page selection?
What Harvard architecture? Self-modifying code usually means changing one instruction relatively shortly before executing it. That's not what I'm talking about. Dynamic code generation creates whole new functions. It's essentially an embedded compiler.
 
My concern is cache pollution in the L1. It's already shared by 4 threads, and now one thread is pulling in a lot of cache lines for just one operation.

Yea, cache pollution could certainly be an issue in the L1. Doing some sort of cache bypass as you suggest would help.

If Larrabee does not use a non-coherent buffer for its speculation, the following might be problems

I wouldn't call it non-coherent. It is still coherent, you're just making sure it doesn't leave your cache. Larrabee has private L1 and L2s, with inclusion enforced between the L1 and L2. When the processor is speculating and performs a write, the speculative update goes in the L1. The original non-speculative value is put in the L2. When incoming coherence events occur, the L1 and L2 decide together to detect conflicts. So, you're really not breaking the "single writer" (uniqueness of the modified state) or anything of regular cache coherence. For example, the system likely doesn't support multiple "speculative writers" to the same block (the only one that can speculative write a block is the one with the block in Modified). Only one at a time. As you point out, allowing multiple of such speculations could cause lots of problems.

You're keeping all the speculation local to the private caches of the core. Basically, the issues you raised are real issues, but a correct design that take care of these issues can be built without too much difficulty (but it isn't trivial).

The funny case would be a three stooges situation, where multiple threads try to speculate on the same lock. If they follow similar code paths, it's only a matter of time until each thread picks up on the coherency traffic of another thread and all threads involved roll back.

If necessary, the system could be design to always ensure that the oldest thread wins any conflicts. That at least ensures that one of the speculators succeeds in the face of conflicts. This was explored in a paper called "Transactional Lock Removal" (TLR) by the same authors at the original SLE paper.

A single speculating thread now has the ability to affect the execution of any number of the 128 threads Larrabee is running.

Such is the perils of shared-memory programming. False sharing or lock contention can have the same problems.

I'm a believer in good thread citizenship, which is why I like the idea of a buffer space or a way to keep some of the side affects from being spilled out by cache coherency until the locking has been confirmed.

Yep, that is basically what the L1 and L2 caches work together to achieve.
 
What Harvard architecture? Self-modifying code usually means changing one instruction relatively shortly before executing it. That's not what I'm talking about. Dynamic code generation creates whole new functions. It's essentially an embedded compiler.

I wasn't thinking of any particular time limit on code being self-modifying, other than its persistence in the instruction cache.

I was curious if the dynamic configuration idea meant Larrabee's threads would alter code paths as they monitored performance events, or if new code would be built in different memory locations and then Larrabee would direct execution to them.

The whole built-in compiler thing isn't particularly new for graphics, so I thought this would be different.
 
Let me sum it up this way: a data structure with a coarse grain lock is a CS101 assignment. A data structure with fine-graned locking is perhaps a Junior or Senior level project for a CS major. A good lock-free algorithm can earn you a PhD. Why not build speculative locking hardware to turn a PhD-level problem into a CS101 project?

I absolutely agree. Every transistor spend towards allowing transactional memory or speculative locks is a transistor well spent in my book.
The attractiveness of transactional memory is not just because it makes multi-threaded programming easier, it makes it also semantically more correct. Algorithms using transactional memory can be correctly composited, as opposed to locks, where this is not always possible.
Think of transactional memory as lock-free programming that can operate on random amount of scattered data (as opposed to 32/64bit CAS).
I think AP's ideas about speculative locks can equally be applied to a hardware transactional memory implementation. L1 cache is used to store pending results and are flushed atomically by the hardware. It raises some interesting implementation details such as what happens if there is not enough cache lines (or not enough associativity) to satisfy a read/write request while inside a transactional memory section. I presume an implementation is always allowed to just abort and go back to the checkpoint, meaning of course the thread will likely never be able to progress, but this would be easy to test and remedy by the programmer.
 
Consider this: if the number of threads is fixed, does having faster cores increase or decease the number of parallel misses (and correspondingly DRAM bandwidth utilization). My point is, slowing down the processors (with the same number of threads) will just reduce the number of outstanding misses (as the processor will take longer to generate the next miss).
Stop thinking about DRAM utilization for a moment, do you want your cores to be able to do some work or to stall half of the time?
Regarding the number of threads we need to hide memory latency..do you have a rough idea of how many threads current GPUs can handle at the same time in order to hide such a latency? hint: many more than 128 :)
 
Back
Top