Larrabee: Samples in Late 08, Products in 2H09/1H10

aaronspink · Jan 26, 2008

nAo said:
Fewer threads and faster cores mean that you have less opportunities to hide memory latency.
If Larrabee uses 'in order execution' what are all that many cores going to work on if they are all waiting for outstanding cache misses to be served?
We already know that 128 threads are simply not enough once texturing enters in the equation.

They could simply use 2 stage loads or prefetching and software pipelining. There are actually a LOT of options to hide LOTs of latency with a single thread and a VERY regular workload like graphics. Remember all these "programs" are pre-compiled by the driver which can do a lot of tricks and also do realtime feedback to incorporate even more tricks.

Aaron Spink
speaking for myself inc.

aaronspink · Jan 26, 2008

TimothyFarrar said:
Not to quote myself but,

Looks like Larrabee would need like at least 24 times the number of threads per core to be able to hide texture cache misses?

You are assuming a GPU programming paradigm where things like registers are extremely limited and you are effectively working on one poly per thread.

With greater programmability and flexibility you can do things fairly differently. So fragment A needs a texture read, you send off the read and continue working on fragments B, C, D, E, F, G, H, I, J, K, L, M, N, O, and P and then come back and deal with the texture read for A. Unlike GPUs which have VERY limited programmability, an x86 processor can run a thread that is a task chain! Between task chaining, and the hardware multi-threading, something with 4 hardware contexts can act like a device with a GPU "processor"/unit with 64 or 128 threads.

Aaron Spink
speaking for myself inc.

aaronspink · Jan 26, 2008

nAo said:
Why would you want to use speculation when you can simply fetch the data you need and hide latency anyway? Caches are likely to be less dense than a big register file that doesn't require tagging, while hw prefecthing is likely to require extra area as well.

General rule of thumb, for a given data size, caches are always more dense than register files.

Aaron spink
speaking for myself inc.

aaronspink · Jan 26, 2008

nAo said:
Unfortunately this technique also forces you to group together more stuff in a single computational unit, which means you are going to feel more pain when dynamic branching enters in the equation. While a G8x like architecture will always have the same 'threads granularity', no matter how many registers per thread you need.

edit: don't want to downplay software pipelining, it's a very powerful technique that I use very often.

Yes but remember that software pipelining doesn't have to mean loop unrolling and can just as easily mean task pipelining which doesn't have any of the nastyness of branches...

aaronspink · Jan 26, 2008

Arun said:
I know that. The working set in GPUs is obviously huge compared to all that though.

Um....
Um..........
Um.................

WHAT!?!?!?

Dude, most of the server applications have larger working set sizes than GPUs have memory or address space, or anything basically!

Working set sizes for DBs are upwards of 10s of GBs. Some into the 100s of GBs.

Most server workload are EXTREMELY latency dominated. And caches make a HUGE difference. And the workloads tend to have very low locality.

Aaron Spink
speaking for myself inc.

Arun · Jan 26, 2008

aaronspink said:
WHAT!?!?!?

Dude, most of the server applications have larger working set sizes than GPUs have memory or address space, or anything basically!

Maybe I should have been a bit clearer, since in retrospect what I meant wasn't really obvious. I was replying to Nick who said: "for the sole reason that the working set fits in the cache."

Obviously, Nick didn't imply that all server working sets would fit 100% into the cache, and neither did I. The only ones that are likely to are the working sets of some synthetic benchmarks, which effectively just makes them much less useful. What I think both of us meant is that part of the total huge working set has much higher reuse, and that's the part you benefit most from big caches for.

In the case of graphics, it's not obvious to me any part of the working set with *temporal* coherence (i.e. not the framebuffer) is small enough to benefit substantially from a large cache. So what I was implying is that the high-reuse part of the working set generally would be bigger than on server workloads (or maybe 'only' as big) and so having a few megabytes of cache probably wouldn't help much. Whether I'm right on that is another debate completely of course...

General rule of thumb, for a given data size, caches are always more dense than register files.

You're having a CPU mentality here. But in the case of real GPUs, your register file is, as I said, 6T SRAM with only one read port and one write port. Per-'core' in G80, you have 32KiB of it and the control logic isn't anywhere as complex as for a cache. That means per-bit density including overhead would, I suspect, be very comparable to a CPU's L2 cache.

The reason why GPU register files can work like that precisely is that you have so many threads, so the register file's latency doesn't matter. However, this brings me back to Larrabee: given my understanding of how Larrabee works, it couldn't do that, and the register file (and potentially L1 too) would have to be multi-ported. If anyone has an idea of how it could bypass that problem, I'd be very interested indeed.

P.S.: Woah, insane number of consecutive replies aaron! Read it all, there's some good stuff in there...

Don't have the time to reply to some of it right now though, but I will later if nobody beat me to it.

Nick · Jan 26, 2008

3dcgi said:
Nick, I'm still not clear on why you say speculative prefetching saves bandwidth. My thinking is you can't fetch less data than a design that only fetches what it needs. Or are you saying that Larrabee will have a larger cache combined with prefetching and it's really the larger cache that saves bandwidth?

Yes. The cache obviously isn't large enough to fit all textures into it, but several megabytes is enough to start getting some reuse. Mipmapping, texture compression, tiling, etc. it all contributes to a useful level of spacial and temporal coherence.

The purpose of speculative prefetching would only be to lower average latency. This then translates to requiring less register space.

MTd2 · Jan 26, 2008

Aaron, you should check the other answer concerning the law.

ArchitectureProfessor · Jan 26, 2008

Arun said:
You're having a CPU mentality here. But in the case of real GPUs, your register file is, as I said, 6T SRAM...

I'm pretty sure that all high-performance CPUs have register files with 6T SRAM. As the register file access time can easily become a critical path, it is highly optimized for speed (rather than just area). You're right that it has more ports (10+ ports is common for a four-way superscalar design), which also contributes to its overall area. In contrast, a processor's L2 cache is optimized for density (highly-banked, fewer ports) and low-leakage.

Once you're doing full-custom data path layout and dynamic logic, the register file design becomes much more interesting. For example, there is an idea of "pitch matching" in which you lay out all your datapath elements and ALUs to all have the same width on the chip per bit. This allows the register file to sit right next to the ALUs with out a bunch of wires going back and forth. As I mentioned earlier, the Alpha EV6 used a replicated register file to allow a copy of the registers to be close to each cluster of ALUs (with an extra cycle latency between them). I digress, but lots of effort going into designing something that is such a critical processor structure.

...with only one read port and one write port.

I could see two read ports, but how does it get away with a single read port? Don't the instructions running on the GPU cores use two input operands?

Per-'core' in G80, you have 32KiB of it and the control logic isn't anywhere as complex as for a cache. That means per-bit density including overhead would, I suspect, be very comparable to a CPU's L2 cache.

The control logic for a cache is small. Sure, a cache needs tag RAMs, but that is the only major overhead. In contrast, the L2 caches are much denser than processor L1 caches (for the latency and bandwidth reasons I mentioned above).

The reason why GPU register files can work like that precisely is that you have so many threads, so the register file's latency doesn't matter.

I see that having many threads can hide register file latency. But it doesn't help overall register file bandwidth (hence my question about the number of ports).

ArchitectureProfessor · Jan 26, 2008

MTd2 said:
Aaron, you should check the other answer concerning the law.

I have to admit, I also wasn't really satisfied with the last posts on the copyright/patent/mask right issues. A link to something posted on-line would really help clarify things.

ArchitectureProfessor · Jan 26, 2008

Arun said:
BTW, one of my favorite introduction to graphics and graphics hardware is this presentation by NVIDIA's Stuart Oberman (ex-AMD, main guy behind 3DNow! and K6's FPU): http://rnc7.loria.fr/oberman_invited.pdf - it's not super extensive but it's pretty good IMO.

Thanks for the link. It looks really helpful.

ArchitectureProfessor · Jan 26, 2008

Arun said:
The working set in GPUs is obviously huge compared to all that though.

Perhaps the working sets are so large because so many darn threads are running at once?

Let me say more. As you have several hundred threads (warps) all running at the same time, it seems very likely that you'd just thrash any caches (this is one of the reasons that CPUs thread counts are two or four, but rarely more than that per core). That is, the amount of cache *per thread* becomes very low with too many threads. If you have 16 G80 "multiprocessors" with 24 warps each and 8MB of on-chip cache, that is only 21KBs per warp (aka, thread). Of course you're not going to have much re-use in such a small amount of cache.

However, if you have a smaller number of threads and the software is really careful to schedule tasks together than access similar data (for example, the same textures or the same part of the frame buffer), then perhaps caching makes sense. Perhaps by reducing the number of threads by 4x will reduce the amount of cache you need from "a few tens of megabytes" before getting locality into the ~8MBs that Larrabee has.

nAo · Jan 26, 2008

aaronspink said:
They could simply use 2 stage loads or prefetching and software pipelining. There are actually a LOT of options to hide LOTs of latency with a single thread and a VERY regular workload like graphics. Remember all these "programs" are pre-compiled by the driver which can do a lot of tricks and also do realtime feedback to incorporate even more tricks.

Why do you think modern graphics hardware has to cope with such regular workloads?
While this was true before the introduction of programmable hardware (when it was really possible to prefetch all data needed) now things are quite different and I made a few examples in a previous post to show this.
Moreover we are just talking about rasterization here, that tends to be more regular than say..ray tracing

I thought the idea was that Larrabee has to be good at ray tracing as well, this sounds to me like a call for an heavily multithreaded processor.
All the hardware accelerated ray tracers I'm aware about use this approach.

nAo · Jan 26, 2008

aaronspink said:
General rule of thumb, for a given data size, caches are always more dense than register files.

I should have worded it better and even though I might be completely wrong on this
I'm not sure that all the on chip memory GPUs use to save the state of a thread has to be a register file-type of memory.
In the end the vast majority of these data are not needed at any time, so it would make more sense to move them to a real (multiported) register file when they are needed (and it seems to me that this transfer can be somewhat pipelined and started in advance).
If this is possible then registers memory could be denser than a cache memory (for a given process of course..)

TimothyFarrar · Jan 26, 2008

ArchitectureProfessor said:
Perhaps the working sets are so large because so many darn threads are running at once?

Let me say more. As you have several hundred threads (warps) all running at the same time, it seems very likely that you'd just thrash any caches (this is one of the reasons that CPUs thread counts are two or four, but rarely more than that per core). That is, the amount of cache *per thread* becomes very low with too many threads. If you have 16 G80 "multiprocessors" with 24 warps each and 8MB of on-chip cache, that is only 21KBs per warp (aka, thread). Of course you're not going to have much re-use in such a small amount of cache.

Keep in mind that the G80 doesn't have cached access to global memory (according to CUDA) only 2D localized texture cache and constant cache. Constant cache is designed to be used when all threads of warp hit the same constant (broadcast). I'm going to guess that texture access (if not random) are going to have some locality even with a high number of warps, because the warps are often running fragment programs which are localized in 2D (with respect to framebuffer output position) as well.

TimothyFarrar · Jan 26, 2008

ArchitectureProfessor said:
I could see two read ports, but how does it get away with a single read port? Don't the instructions running on the GPU cores use two input operands?

I see that having many threads can hide register file latency. But it doesn't help overall register file bandwidth (hence my question about the number of ports).

From the CUDA docs,

"Generally, accessing a register is zero extra clock cycles per instruction, but delays may occur due to register read-after-write dependencies and register memory bank conflicts. The delays introduced by read-after-write dependencies can be ignored as soon as there are at least 192 active threads per multiprocessor to hide them. The compiler and thread scheduler schedule the instructions as optimally as possible to avoid register memory bank conflicts. They achieve best results when the number of threads per block is a multiple of 64."

Arun · Jan 26, 2008

ArchitectureProfessor said:
I could see two read ports, but how does it get away with a single read port? Don't the instructions running on the GPU cores use two input operands?

Three input operands at full speed, actually (MADs). But the trick is to have as many banks as there are input operands.

The reason why that works differs from GPU to GPU slightly. On R5xx/R6xx, I would suspect they benefit from the fact their batch size is higher than the ALU. On G8x, the batch size and the effective ALU wide are identical. So threads are likely statically allocated to one bank and to achieve full efficiency, you need one thread of each bank that'd ready to execute instructions to achieve full efficiency. Another thory is that registers inside threads are allocated to different banks, and some CUDA docs do seem to be implying that but they're sufficiently vague that I'm not sure that's exactly how it works.

TimothyFarrar · Jan 26, 2008

aaronspink said:
You are assuming a GPU programming paradigm where things like registers are extremely limited and you are effectively working on one poly per thread.

With greater programmability and flexibility you can do things fairly differently. So fragment A needs a texture read, you send off the read and continue working on fragments B, C, D, E, F, G, H, I, J, K, L, M, N, O, and P and then come back and deal with the texture read for A. Unlike GPUs which have VERY limited programmability, an x86 processor can run a thread that is a task chain! Between task chaining, and the hardware multi-threading, something with 4 hardware contexts can act like a device with a GPU "processor"/unit with 64 or 128 threads.

BTW, 24 number should have been more like 6.

What you are saying is classic in-order RISC programming with a large register file (place load ops long before you need them)?

So with Larrabee you have,

32 regs x 16 wide x 4 threads x 24 cores = 49152 scalers
32 regs x 4 threads x 16 cores = 3072 max possible independent 16 wide SIMD loads

And with G80 effectively,

8192 "scalar regs" x 16 cores = 131072 scalers
8192 scalers / 32 scalers per warp / 24 warps per core = ~10 32-wide SIMD regs per 24 warps
and in terms similar to the Larrabee calculation above,
10 regs x 24 threads x 16 cores = 3840 max possible independent 32 wide SIMD loads

Basically the same thing automatically happens with G80 just with 1/3 the regs per thread (if the programs don't exceed the 10 max regs for 24 warps). Still G80 can use more loads at 2x the SIMD width, and again is a 2006/2007 chip compared to a 2009/2010 Larrabee...

However your point if I am understand it right, does seem to point to one reason Larrabee won't be as behind as some of us expected.

MTd2 · Jan 26, 2008

ArchitectureProfessor said:
I have to admit, I also wasn't really satisfied with the last posts on the copyright/patent/mask right issues. A link to something posted on-line would really help clarify things.

I wish I could find something on the net, but there is no reasonable analasys posted anywhere. People just speak randomly about having an x86 license, but almost no one really bothers to look at patents or laws, or to check the reliability of such claims. Othersiwe, I am sure you could infer lots of markets strategies and future trends if more people were used to look at the legal side of the inventions.

I hope I can have more motivation to post about this later.

ArchitectureProfessor · Jan 26, 2008

TimothyFarrar said:
From the CUDA docs,
"The compiler and thread scheduler schedule the instructions as optimally as possible to avoid register memory bank conflicts. They achieve best results when the number of threads per block is a multiple of 64."

Yikes! A banked register file. Of course, they do have the threads to tolerate bank conflicts, but this still sound pretty nasty (both from a performance point of view and from a design complexity point of view). I'm sure they've found reasonable engineering solutions to minimize the impact, and it probably isn't *that* bad.

However, an SRAM memory with a banked interface is going to be less dense than a single-ported SRAM. The extra wires for the additional address bits and such aren't free. Of course, all SRAMs of any size are internally banked (to avoid long bit and word lines), but actually allowing a SRAM to take in multiple address and spit out multiple data words is going to result in a somewhat larger structure.

Of course, the real question is how does such a banked register file (G80) compare to a multi-port register file (presumably what Larrabee will use). I would say that the banking is likely a bit cheaper (in terms of area), but having, say, three ports (two read, one write) isn't that expensive either (you can overlay the wires over the SRAM cells in many cases).

Just thinking out loud, another option would be for Larrabee to use a multi-SRAM cell register file. This was used in the IBM RS64-IV two-way multithreaded processor from the late-1990s. The observation was that on any given cycle, the processor would always read register values from the same thread (it was switch-on-cache-miss multithreaded). So, in this case, you don't need a full bit-line for each bit in the cache. You only need one bit-line for each pair of bits. I don't recall all the details, but this allowed them to build a register file with twice the bits (for two threads) without doubling the area. As multi-ported register files are often wire-limited anyway, adding the extra bits doesn't need to cost that much, but YMMV.

I found the ISSCC paper from 1998 by Storino et al that describes it: "To accomplish a dual-thread operation the registers file must have dual storage elements for each bit. the natural inclination would be to have multiplicity in write ports, read ports, and storage elements, significantly enlarging the area and lower the performance... Given the orthogonal nature of threads, it is not necessary to read or write identical word locations of the separate threads in the same cycle. Through this observation the hardware implementation is reduced significantly because both write and read ports are shared. By sharing ports, only duplicate memory elements were required. No extra decoders are needed because there is only one additional bit for thread selection."

I'm not saying that GPUs can't (or don't) play such tricks. They likely do. My main point on this is that trying to reason (and argue) about the area of various sorts of register files, caches, scratch memories, and secondary caches is more subtle that in might appear.

Larrabee: Samples in Late 08, Products in 2H09/1H10

aaronspink

aaronspink

aaronspink

aaronspink

aaronspink

Arun

Unknown.

Nick

MTd2

ArchitectureProfessor

ArchitectureProfessor

ArchitectureProfessor

ArchitectureProfessor

nAo

Nutella Nutellae

nAo

Nutella Nutellae

TimothyFarrar

TimothyFarrar

Arun

Unknown.

TimothyFarrar

MTd2

ArchitectureProfessor

Similar threads