Larrabee: Samples in Late 08, Products in 2H09/1H10

MfA · Jan 25, 2008

It doesn't switch between blocks or threads though, it deals with one block completely before switching to the other and it only switches between warps (max 24 per multiprocessor). In traditional terms the warp is the thread.

It still doesn't matter though, prefetching won't help one iota, scatter/gather won't help. Those thousands of registers on GPUs are very expensive, if simple tricks could have saved on two thirds of them GPUs would have done it already.

nAo · Jan 25, 2008

MfA said:
In traditional terms the warp is the thread.

Yep, but this is the terminology that CUDA docs use and that also AP used.

It still doesn't matter though, prefetching won't help one iota, scatter/gather won't help. Those thousands of registers on GPUs are very expensive, if simple tricks could have saved on two thirds of them GPUs would have done it already.

100% spot on

MfA · Jan 25, 2008

Oops, that wasn't right either ... it does switch between warps between blocks. Ah bah, thinking of it in number of threads isn't terribly enlightening anyway. The maximum number of outstanding texel-quad requests might be nice to know, but that's impossible to tell for Larrabee.

TimothyFarrar · Jan 25, 2008

Here is something really interesting from the CUDA docs about texturing,

"The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture addresses that are close together will achieve best performance. Also, it is designed for streaming fetches with a constant latency, i.e. a cache hit reduces DRAM bandwidth demand, but not fetch latency."

nAo · Jan 25, 2008

TimothyFarrar said:
i.e. a cache hit reduces DRAM bandwidth demand, but not fetch latency."

If you can't prefetch all the bandwidth in the world won't reduce your memory latency.

Nick · Jan 25, 2008

MfA said:
You can assume I meant non speculative, software, prefetching.

Yes, but why keep speculative hardware prefetching out of the equation? It works perfectly on CPUs and supercedes software prefetching in effectiveness.

Dunno about "graphics" but small polygon rendering certainly not.

Deferred shading.

nAo · Jan 25, 2008

Nick said:
Deferred shading.

dependent texture reads

Nick · Jan 25, 2008

nAo said:
Yep, and also in 2010 you don't want a GPU that slows down to a crawl cause you're doing this:

output.colour = tex2D(sampler0, tex2D(sampler1, uv).xy);

Even with dependent texture reads there's a large amount of coherency. With proper mipmapping the next texture sample is always one texel away from the previous one (else you end up with shimmer). The only thing the prefetch unit has to do is predict whether the next texel is up, down, left, or right...

You all seem to be assuming that Larrabee has to be able to hide the latency of a cache miss on every read. Hit rate on Larrabee is likely going to be higher than 90% if it's even remotely comparable to modern CPUs.

4 pixel quads per thread, 4 threads per core, 9:1 ALU:TEX rate is already 144 cycles of latency hiding. And with a 10% miss rate you can even hide 1440 cycles of latency before slowing down to a crawl...

Nick · Jan 25, 2008

MfA said:
It still doesn't matter though, prefetching won't help one iota, scatter/gather won't help. Those thousands of registers on GPUs are very expensive, if simple tricks could have saved on two thirds of them GPUs would have done it already.

A GPU's gigantic register files are practically the equivalent of a CPU's gigantic caches. Both are used to store temporary results. The only difference is that with a GPU nothing is speculative, while a cache has a hit/miss rate. GPUs could shrink their register files if they used caches with speculative prefetching that lower average texture sample time.

Could anyone give me some insight on the density of a GPU's register file, versus a CPU's cache? If the latter is denser, even by a small amount, speculative prefetching and a 10% miss rate doesn't sound so bad...

nAo · Jan 25, 2008

Nick said:
Even with dependent texture reads there's a large amount of coherency. With proper mipmapping the next texture sample is always one texel away from the previous one (else you end up with shimmer). The only thing the prefetch unit has to do is predict whether the next texel is up, down, left, or right...

Mip maps are a luxury we often don't have: think about a full screen refraction effect done via EMBM.
Other examples off the top of my head
1) sample precomputed irradiance from a cube map using a normal read from a normal map
2) perform tone mapping and/or color correction with 2D or 3D LUTs
3) gather an occlusion term reprojecting a depth map to light space in a jungle-like scene.

Speculation is going to be completely useless in these cases.

A GPU's gigantic register files are practically the equivalent of a CPU's gigantic caches. Both are used to store temporary results. The only difference is that with a GPU nothing is speculative, while a cache has a hit/miss rate. GPUs could shrink their register files if they used caches with speculative prefetching that lower average texture sample time.

Why would you want to use speculation when you can simply fetch the data you need and hide latency anyway? Caches are likely to be less dense than a big register file that doesn't require tagging, while hw prefecthing is likely to require extra area as well.

Nick · Jan 25, 2008

nAo said:
Mip maps are a luxury we often don't have: think about a full screen refraction effect done via EMBM.

Unless you want a shimmer fest there will be a high spacial and temporal coherence for this effect.

Other examples off the top of my head
1) sample precomputed irradiance from a cube map using a normal read from a normal map
2) perform tone mapping and/or color correction with 2D or 3D LUTs
3) gather an occlusion term reprojecting a depth map to light space in a jungle-like scene.

All of these use fairly small textures, that might largely fit inside Larrabee's caches. So you actually end up using less bandwidth than a GPU and requiring less latency hiding.

Furthermore, actual applications always have a mix of predictable and less predictable access patterns, small and large textures, mipmapping and no mipmapping, etc. With a cache you can have a sweet balance between bandwidth, hit rate and low latency. A miss gets compenstated with a few hits. A GPU has only one answer to all these different situations: burn bandwidth and have a gigantic register file.

Speculation is going to be completely useless in these cases.

It's never going to be completely useless. I've yet to see a shader with completely random accesses in a large texture. And even a badly predicted prefetch might load data that is going to be used two accesses later.

Why would you want to use speculation when you can simply fetch the data you need and hide latency anyway?

Saving bandwidth and requiring less registers. Oh and lowering total execution latency. If you want Larrabee to do 10 collision tests for your physics engine you rather have the result in 100 ns instead of 10 ms. It's ok for a GPU to lag a couple frames behind for graphics, but that's unacceptable for other real-time applications.

Caches are likely to be less dense than a big register file that doesn't require tagging, while hw prefecthing is likely to require extra area as well.

Thanks for the info. What kind of registers do GPU's use anyway? I assume it's not master-slave edge-triggered latches using transmission-gate multiplexers (21 transistors)? SRAM cells have only 6 transistors and are extremely size optimized. It would be interesting to know how much MB of register space a GPU has versus a CPU cache of the same timeframe.

And what about other parameters like power consumption? Density might not be the biggest issue when transistor budgets keep increasing but bandwidth does not. Aren't we getting close to the point where adding more registers to a GPU to hide latency wouldn't help because we're bandwidth limited anyway? Doesn't it make sense then to spend those transistors on caches that reduce both bandwidth and average latency?

crystall · Jan 25, 2008

Nick said:
You all seem to be assuming that Larrabee has to be able to hide the latency of a cache miss on every read. Hit rate on Larrabee is likely going to be higher than 90% if it's even remotely comparable to modern CPUs.

The hit-rates will be significantly lower, Larrabee is not comparable to modern CPUs from that point of view. The per-thread cache will be significantly lower compared to regular desktop offerings and will be hit much harder if scatter/gather is used. Besides for tackling current graphics you really need a cache optimized for the fact that most of your data is bi-dimensional. Or you need to reorder your data which is not always possible.

Arun · Jan 25, 2008

Nick said:
Saving bandwidth and requiring less registers. Oh and lowering total execution latency. If you want Larrabee to do 10 collision tests for your physics engine you rather have the result in 100 ns instead of 10 ms. It's ok for a GPU to lag a couple frames behind for graphics, but that's unacceptable for other real-time applications.

I was taking some of your points very seriously until I read that one. I really don't think you'd find the number of threads to affect latency too badly if you did the math (I know that number was in jest but still)... Especially if you could put your compute threads on high priority in the scheduler, which you likely could.

More seriously, regarding caching+prefetching for GPUs: I'm sure it would help, especially once you realize how big the memory burst sizes are going to be with GDDR5 so the chance you hit the right memory block is obviously higher. However, if you do waste it (which will always happen, otherwise it's not a *speculative* prefetch) then you've also just lost a bunch of bandwidth. So my point is this: you argue it'd save bandwidth. Personally, I'm pretty damn sure it'd *waste* bandwidth overall.

Look at it this way: if bigger caches improved perf/mm2 or perf/watt, you'd already see those in current products because that's a 5 minutes change. But you don't. Reuse just doesn't go up that much in graphics. Can prefetching improve the hit rate? Yes, but that won't make you reuse the data magically. On the other hand, when your bandwidth utilisation could be maximised anyway, doing prefetching to hide latency *will* waste bandwidth. And it also won't hide latency as well or as systematically as registers+threads. yay?

One positive point for Larrabee I thought I'd mention though: if you do texture filtering in the shader core, your ALU-TEX ratio is higher so that doesn't hurt to hide average latency... It wouldn't help with a chain of random dependent texture fetches, but errr, I doubt it's supposed to either!

(oh and btw, fwiw, current GPUs are already smart enough to issue multiple loads per thread/warp to maximise latency hiding and I suspect small LUTs that are frequently used will likely also remain in there if your cache trashing isn't out of this world...)

Thanks for the info. What kind of registers do GPU's use anyway?

afaik, it's multi-banked 6T single-port SRAM (1 read port+1 write port though for a variety of reasons but that's not much more expensive than a shared port iirc).

Aren't we getting close to the point where adding more registers to a GPU to hide latency wouldn't help because we're bandwidth limited anyway? Doesn't it make sense then to spend those transistors on caches that reduce both bandwidth and average latency?

I don't think we're going to be horribly bandwidth limited in the 2009 timeframe personally given that we could see NV/AMD use 6GHz effective GDDR5 on 384-bit busses or something crazy like that (yes, nearly 300GB/s!) *if* they needed it. Which is not completely obvious given how much it is but we'll see how it goes.

EDIT: Added a bunch of stuff.

Nick · Jan 25, 2008

crystall said:
The hit-rates will be significantly lower, Larrabee is not comparable to modern CPUs from that point of view. The per-thread cache will be significantly lower compared to regular desktop offerings and will be hit much harder if scatter/gather is used.

The L2 cache is supposedly shared by all cores for read-only access. So you do get one huge cache for texture sampling, with higher hit rates as a result (combined with speculative prefetching).

Hit rates are indeed likely lower than for actual CPUs, but still much higher than for GPUs. I've heard about 96-99% for a Core 2, depending on the workload, but Larrabee should do fine with 90% if the TEX:ALU rate isn't too high.

Besides for tackling current graphics you really need a cache optimized for the fact that most of your data is bi-dimensional. Or you need to reorder your data which is not always possible.

The fixed-function texture sampler might include address bit interleaving. But I don't think it's that crucial. If the cores each work on a tile in the render target then they'll need a square shaped region in texture space anyway. So it doesn't matter much if you go horizontally or vertically, cache lines are equally likely going to be accessed again. A GPU's texture cache might not benefit from coherency within a tile, but Larrabee very likely will. Frankly, it depends on it.

Nick · Jan 25, 2008

Arun said:
I was taking some of your points very seriously until I read that one. I really don't think you'd find the number of threads to affect latency too badly if you did the math (I know that number was in jest but still)... Especially if you could put your compute threads on high priority in the scheduler, which you likely could.

I wasn't talking about the latency caused by the number of threads, but the sheer number of cycles you're waiting for memory accesses. Even if a thread has a GPU shader unit all for its own it's going to take tens of thousands of clock cycles to complete execution. With a cache and a reasonable hit rate it migth take only a few hundred cycles. If you need to do multiple 'passes' of data processing the total latency till the results are returned to the host CPU can be unacceptable. This is exactly the reason why some GPGPU projects simply fail. The CPU is still better at processing small datasets with many dependencies. Larrabee, being somewhere in between the two, can offer a significant benefit here for other workloads. Not to mention raytracing...

More seriously, regarding caching+prefetching for GPUs: I'm sure it would help, especially how once you realize how big the memory burst sizes are going to be with GDDR5 so the chance you hit the right memory block is obviously higher. However, if you do waste it (which will always happen, otherwise it's not a *speculative* prefetch) then you've also just lost a bunch of bandwidth. So my point is this: you argue it'd save bandwidth. Personally, I'm pretty damn sure it'd *waste* bandwidth overall.

Not necessarily. I agree that prefetching a line that is never used is a waste, but first of all prefetches only happen on idle cycles. So whereas a GPU would just give its memory controllers a rest, Larrabee would fetch data to avoid future misses. It becomes interesting when you have for example a shader with little texture accesses but tons of temporary variables. The GPU would run out of registers and run at a fraction of its capacity. Larrabee, which has plenty of registers and L1 space would keep running at full speed and have a very high hit rate.

So as TEX:ALU ratios keep decreasing and caches get bigger, current GPU architectures get less interesting.

Anyway, I'm not claiming in any that Larrabee will be faster than 2009's fastest GPU. I'm just saying that prefetching and such can ensure that it won't suck entirely.

Look at it this way: if bigger caches improved perf/mm2 or perf/watt, you'd already see those in current products because that's a 5 minutes change. But you don't.

R600 has 256 kB of shared L2 texture cache, touted by some as revolutionary...

Reuse just doesn't go up that much in graphics. Can prefetching improve the hit rate? Yes, but that won't make you reuse the data magically. On the other hand, when your bandwidth utilisation could be maximised anyway, doing prefetching to hide latency *will* waste bandwidth. And it also won't hide latency as well or as systematically as registers+threads. yay?

The problem is that 256 kB is not nearly enough for reuse. In the CPU world we see 1.5 GHz server CPUs beat 3 GHz Celerons for the sole reason that the working set fits in the cache. The tipping point for graphics is likely around a few dozen MB. But you're going to start to see some effect from a few MB too. Even if you can store just a repetitive detail texture in the cache it's going to save a lot of bandwidth. And with some smart task decomposition and scheduling you can maximize reuse.

afaik, it's multi-banked 6T single-port SRAM (one read port+1 write port though for a variety of reasons but that's not much more expensive than a shared port iirc).

Ok, thanks.

I don't think we're going to be horribly bandwidth limited in the 2009 timeframe personally given that we could see NV/AMD use 6GHz effective GDDR5 on 384-bit busses or something crazy like that (yes, nearly 300GB/s!) *if* they needed it. Which is not completely obvious given how much it is but we'll see how it goes.

That's indeed impressive. I wonder how much data that is between a round-trip (i.e. the register space you need for hiding the latency of one access, assuming full bandwidth utilization). I just think there's a point where you want to start keeping things on-chip to lower average lateny and save register space.

ArchitectureProfessor · Jan 25, 2008

Arun said:
Look at it this way: if bigger caches improved perf/mm2 or perf/watt, you'd already see those in current products because that's a 5 minutes change.

Perhaps having chip-wide coherent caching, which GPUs don't have (and isn't a 5 minute change) is the secret here? It is hard to cache stuff if you can't guarantee that some other core isn't going to change it. Without cache coherence, it is harder to buffer intermediate results and such without dedicated (and smaller) hardware queues.

Jawed · Jan 25, 2008

Nick said:
Could anyone give me some insight on the density of a GPU's register file, versus a CPU's cache? If the latter is denser, even by a small amount, speculative prefetching and a 10% miss rate doesn't sound so bad...

If it makes you feel better, current GPUs use a "cache" to reorder operand fetches (and stores?) between the register file and the ALUs. R600 apparently has 32KB total of this cache, split into four, one per SIMD cluster (16x vec4+scalar). It's also used for streamout and prolly other things.

In R600 the register file is part of the virtual memory system, so it is actually a "cache" - but arguably on today's graphics workloads it's prolly rare that the register file finds portions of itself swapped out.

Jawed

ArchitectureProfessor · Jan 25, 2008

MfA said:
Those thousands of registers on GPUs are very expensive, if simple tricks could have saved on two thirds of them GPUs would have done it already.

If you count Larrabee's registers that same way the as the G80s registers, Larrabee also has thousands of registers. 32 cores x 4 threads x 32 vector x 16 words = 65536 floating point registers on Larrabee.

That is 256KBs of registers across the chip (8KBs per core).

From what I understand, the G80s has 24 warps, but it shares something like 256 registers among them. Meaning the average warp only has something like 8 or 10 registers each.

Larrabee isn't skimping on registers, just the number of "warps" it has.

Jawed · Jan 25, 2008

Arun said:
I was taking some of your points very seriously until I read that one. I really don't think you'd find the number of threads to affect latency too badly if you did the math (I know that number was in jest but still)... Especially if you could put your compute threads on high priority in the scheduler, which you likely could.

All threads in Larrabee are compute threads - you're making a meaningless distinction.

Look at it this way: if bigger caches improved perf/mm2 or perf/watt, you'd already see those in current products because that's a 5 minutes change.

What's R600's centralised 256KB of L2 then, and why does it seemingly have L2<->L2 coherency in a multi-GPU system (or if not, it's prolly coming in R7xx)? When all prior ATI GPUs were content with a few 10s of KB of L1, and absolutely no L2.

But you don't. Reuse just doesn't go up that much in graphics. Can prefetching improve the hit rate? Yes, but that won't make you reuse the data magically.

R600 heavily prefetches texels, driven by the rasteriser - it isn't doing per-TEX instruction prefetching. Prefetching allows the memory system to reorder fetches and maximise burstiness. R600 can do this because it links rasterisation and attribute (texture coordinate) interpolation, doing these ahead of the actual execution of the TEX instruction.

3-D rendering texture caching scheme

Jawed

ArchitectureProfessor · Jan 25, 2008

Software pipelining & such

I wanted to point out another possible technique that might help overcome Larrabee's smaller number of threads (aka warps) by using the larger number of registers per warp that Larrabee has (10 per warp vs 32 per thread in Larrabee).

Loop transformations such as loop unrolling and software pipelining can really help hide latency and generate lots of parallel misses (giving similar benefits as multiple threads).

Consider the following compuation:

Code:

Load A1
Load A2
Compute A3 = f(A1, A2)
Store A3
Repeat

Now, if you're going to do this same computation on multiple times on different data (which a GPU would certainly do...), you can unroll and re-schedule the loop:

Code:

Load A1
Load B1
Load C1
Load A2
Load B2
Load C2
Compute A3 = f(A1, A2)
Compute B3 = f(B1, B2)
Compute C3 = f(C1, C2)
Store A3
Store B3
Store C3
Repeat

Sure, this uses 3x the number of live registers, but Larrabee has 3x the number of registers per warp/thread. Such a loop unrolling can generate 3x the number of parallel misses. While these three misses are outstanding, one of the other three threads runs. Of course, A GPU would get the same benefits by putting A, B, and C in different threads/warps. The point I'm trying to make is that once you consider tricks like this, the number of parallel memory fetches outstanding from Larrabee could be very similar.

Edit: clarified warp vs thread a bit

Larrabee: Samples in Late 08, Products in 2H09/1H10

MfA

nAo

Nutella Nutellae

MfA

TimothyFarrar

nAo

Nutella Nutellae

Nick

nAo

Nutella Nutellae

Nick

Nick

nAo

Nutella Nutellae

Nick

crystall

Arun

Unknown.

Nick

Nick

ArchitectureProfessor

Jawed

ArchitectureProfessor

Jawed

ArchitectureProfessor

Similar threads