Larrabee: Samples in Late 08, Products in 2H09/1H10

Jawed · Jan 25, 2008

ArchitectureProfessor said:
From what I understand, the G80s has 24 warps, but it shares something like 256 registers among them.

768 registers under CUDA, per SIMD, but seemingly only 512 under graphics workload - CUDA is a bit of a misdirection when it comes to the way G80's internal resources are configured for graphics.

Jawed

ArchitectureProfessor · Jan 25, 2008

Jawed said:
768 registers under CUDA, per SIMD, but seemingly only 512 under graphics workload - CUDA is a bit of a misdirection when it comes to the way G80's internal resources are configured for graphics.

My impression was that under CUDA you can write warps that use 32 registers, but the hardware can't run 24 of them at a time (it would run fewer of them). The same is true for graphics computations based on how many live registers the shader needs (or something like that). So, it might have 768 virtual registers (32 registers x 24 warps), but how many does the hardware actually support? It might be 512, but my impression was that it was fewer than that. The 256 I said above was a guess based on a 8KB register file.

Anyone know for sure how many hardware registers the G80 has?

3dilettante · Jan 25, 2008

The CUDA occupancy spreadsheet is used to calculated how much utilization a given register allocation and thread count can give.

There are limits to the number of registers per thread, but the physical number of registers is listed at 8192 32-bit registers per multiprocessor.
At 16 multiprocessors per chip, that adds up to 131072 32-bit registers, or the equivalent to 8192 of those 16-element vector registers, about double a 32-core Larrabee.

There seems to be a limitation in that each thread can only address 64 of the registers, which might due to a 6 bit reg identifier per instruction.

This does leave out the integer register files for Larrabee, but the count is dominated by the FP regs.

As has been mentioned, some of the calculations seem "off".

Jawed · Jan 25, 2008

ArchitectureProfessor said:
Anyone know for sure how many hardware registers the G80 has?

8192 scalars per SIMD shared by all threads.

Have you played with the CUDA Occupancy Calculator:

http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

Fiddle with the numbers to see where the performance cliffs are.

Jawed

nAo · Jan 25, 2008

Nick said:
Unless you want a shimmer fest there will be a high spacial and temporal coherence for this effect.

We would all like to render things to perfection, but sometime it's not possible (not enough GPU time, not enough memory, etc..).

All of these use fairly small textures, that might largely fit inside Larrabee's caches. So you actually end up using less bandwidth than a GPU and requiring less latency hiding.

Fairly small? Shadow maps and 3D LUTs can require megabytes of memory.

It's never going to be completely useless. I've yet to see a shader with completely random accesses in a large texture. And even a badly predicted prefetch might load data that is going to be used two accesses later.

Deferred shadow mapping on complex depth buffers.

If you want Larrabee to do 10 collision tests for your physics engine you rather have the result in 100 ns instead of 10 ms. It's ok for a GPU to lag a couple frames behind for graphics, but that's unacceptable for other real-time applications.

All this discussion started because I wrote "..if texturing enters in the equation.."

And what about other parameters like power consumption? Density might not be the biggest issue when transistor budgets keep increasing but bandwidth does not

Don't know much about this but in theory the GPU only needs to use a small subset of all the available registers space at any time, I guess this fact can be exploited to reduce power consumption.

. Aren't we getting close to the point where adding more registers to a GPU to hide latency wouldn't help because we're bandwidth limited anyway? Doesn't it make sense then to spend those transistors on caches that reduce both bandwidth and average latency?

In my everyday experience GPUs are mostly bandwidth limited when working on stuff that uses extremely simple shaders that don't require many live registers (particles, shadow maps..).
Typically you need a lot of registers for those shaders that are long and complex, with a high ALU:TEX ratio that don't require a lot of bandwidth to begin with.

Jawed · Jan 25, 2008

nAo said:
Don't know much about this but in theory the GPU only needs to use a small subset of all the available registers space at any time, I guess this fact can be exploited to reduce power consumption.

That's not right. The GPU can't hide random ALU thread stalls if the register file (and the SIMD itself) isn't full to capacity with other threads waiting to work.

Sure, the count of defined registers per shader program might mean less than 100% occupany in the register file, but that's an artefact of register file granularity - not a power saving feature.

Jawed

ArchitectureProfessor · Jan 25, 2008

3dilettante said:
There are limits to the number of registers per thread, but the physical number of registers is listed at 8192 32-bit registers per multiprocessor.

Ok, if I'm figuring this correctly, I was off by a factor of two. With 16-wide warps and 24 warps per multiprocessor, that gives ~21 registers per warp (aka Larrabee thread). So Larrabee's 32 registers per thread isn't that much more than the G80s (and Larrabee still has a smaller number of overall threads, as has been pointed out). Plus, the G80 is more flexible in its thread allocation (fewer to threads that don't need it, and, from what was said above, up to 64 for threads that can actually use that many threads).

nAo · Jan 25, 2008

Jawed said:
That's not right. The GPU can't hide random ALU thread stalls if the register file (and the SIMD itself) isn't full to capacity with other threads waiting to work.

I don't see why all the register file has to be active when for a few clock cycles your core is only going to need a little chunk of it that is likely to be allocated contiguously.

Sure, the count of defined registers per shader program might mean less than 100% occupany in the register file, but that's an artefact of register file granularity - not a power saving feature.

I'm not following you here. The occupancy of the register file is determined by the number of registers a single thread need and by the amount of threads in a block.
Typically all register file is used (I guess you can have unused slots due to alignment restrictions in the number of registers or threads that the hw supports)
But again, at any given time the GPU is working on a SMALL subset of the whole thing, it's not like a warp uses registers scattered everywhere in the register file.
Registers allocation is probably deterministic.

Jawed · Jan 25, 2008

nAo said:
But again, at any given time the GPU is working on a SMALL subset of the whole thing, it's not like a warp uses registers scattered everywhere in the register file.
Registers allocation is probably deterministic.

Sorry, thought you were talking about occupation when in fact you were merely referring to reads and writes.

Jawed

ArchitectureProfessor · Jan 25, 2008

nAo said:
In my everyday experience GPUs are mostly [memory] bandwidth limited when working on stuff that uses extremely simple shaders that don't require many live registers (particles, shadow maps..). Typically you need a lot of registers for those shaders that are long and complex, with a high ALU:TEX ratio that don't require a lot of [memory] bandwidth to begin with.

Well, if we're not memory bandwidth bound in most cases, then it goes back to just hiding enough latency to keep the ALUs busy. For complex shaders (which have more computation per memory access), a smaller number of threads would be able to keep the ALUs busy.

Let me make one more comment about this. I think part of the Larrabee bet is that shaders are generally getting more complex. On simple shaders (especially ones that are too big to cache), I think whatever Gxx is out at the time will likely toast Larrabee. However, on more complex shaders (which are taking more and more of the GPUs resources over time) Larrabee will be much more competitive.

nAo · Jan 25, 2008

ArchitectureProfessor said:
I wanted to point out another possible technique that might help overcome Larrabee's smaller number of threads (aka warps) by using the larger number of registers per warp that Larrabee has (10 per warp vs 32 per thread in Larrabee).

Ppl use this 'trick' all the time on PS2's vector units and on CELL SPUs to hide.. arithmetic latency

While these three misses are outstanding, one of the other three threads runs. Of course, A GPU would get the same benefits by putting A, B, and C in different threads/warps. The point I'm trying to make is that once you consider tricks like this, the number of parallel memory fetches outstanding from Larrabee could be very similar.

Unfortunately this technique also forces you to group together more stuff in a single computational unit, which means you are going to feel more pain when dynamic branching enters in the equation. While a G8x like architecture will always have the same 'threads granularity', no matter how many registers per thread you need.

edit: don't want to downplay software pipelining, it's a very powerful technique that I use very often.

Jawed · Jan 25, 2008

ArchitectureProfessor said:
Plus, the G80 is more flexible in its thread allocation (fewer to threads that don't need it, and, from what was said above, up to 64 for threads that can actually use that many threads).

64 is what you'd call well over a very tall performance cliff - you're running G80 at 17% occupancy. With 128 pixels per block (4 warps) and only 1 block active, you're now able to handle way less texturing latency. Of course, if your shader does no texturing, then that's fine.

Jawed

Nick · Jan 25, 2008

ArchitectureProfessor said:
Sure, this uses 3x the number of live registers, but Larrabee has 3x the number of registers per warp/thread. Such a loop unrolling can generate 3x the number of parallel misses. While these three misses are outstanding, one of the other three threads runs.

You appear to have missed the part where I suggested 4 pixel quads per thread can result in 144 cycles of latency hiding for a 9:1 ALU:TEX ratio.

So Larrabee's 32 registers per thread isn't that much more than the G80s (and Larrabee still has a smaller number of overall threads, as has been pointed out).

Larrabee has 8 kB worth of L1 cache that can be used for temporaries (if needed). Heck, even L2 cache isn't so horrible, compared to reducing the number of pixels per thread and having less latency tolerance.

ArchitectureProfessor · Jan 25, 2008

Nick said:
You appear to have missed the part where I suggested 4 pixel quads per thread can result in 144 cycles of latency hiding for a 9:1 ALU:TEX ratio.

I saw it, and I agree. I think this goes along with the idea that Larrabee will be most competitive for more complex shaders (more ALU per TEX). We're on the same page here.

Arun · Jan 25, 2008

Nick said:
I wasn't talking about the latency caused by the number of threads, but the sheer number of cycles you're waiting for memory accesses. Even if a thread has a GPU shader unit all for its own it's going to take tens of thousands of clock cycles to complete execution. With a cache and a reasonable hit rate it migth take only a few hundred cycles. If you need to do multiple 'passes' of data processing the total latency till the results are returned to the host CPU can be unacceptable.

While that is correct in theory, I am very skeptical it's the problem in practice. Let us assume that it takes 50K cycles, and that I need 6 passes. That's 300K cycles, which corresponds to 0.5ms on a G92. If that was the only latency between the CPU and the GPU, you could do dozens of roundtrips per frame! And 300K cycles is just an insanely pessimistic number for *any* workload.

Not necessarily. I agree that prefetching a line that is never used is a waste, but first of all prefetches only happen on idle cycles.

But that's precisely my point: what if there aren't enough idle cycles because your application is very bandwidth-intensive? I have a very hard time believing your system-wide efficiency will remain good in that kind of circumstance.

I'd refer you to some of Bob's points in his posts on memory controllers in this thread. They also imply something else if you can read between the lines: by not being as sensitive to latency, you can achieve greater bandwidth efficiency for the overall system. This is another key advantage of having that many registers and threads, AFAICT.

So as TEX:ALU ratios keep decreasing and caches get bigger, current GPU architectures get less interesting.

I don't disagree with that overall rule of thumb, but I suspect you're significantly overestimating that effect's magnitude. YMMV, however.

R600 has 256 kB of shared L2 texture cache, touted by some as revolutionary...

I honestly wouldn't call that revolutionary, but once again YMMV. I'd be very curious as to whether AMD kept that texdture cache size in RV670; I wouldn't be surprised at all if that was one of the many ways they lowered the transistor count.

In the CPU world we see 1.5 GHz server CPUs beat 3 GHz Celerons for the sole reason that the working set fits in the cache.

I know that. The working set in GPUs is obviously huge compared to all that though.

The tipping point for graphics is likely around a few dozen MB. But you're going to start to see some effect from a few MB too.

I'm a big fan of embedded memory (but not SRAM due to the low density) myself, but I don't think it'll ever help as much for texturing as for the framebuffer.

Even if you can store just a repetitive detail texture in the cache it's going to save a lot of bandwidth. And with some smart task decomposition and scheduling you can maximize reuse.

Indeed, that can't hurt. I'd point out that textures are only a major bandwidth problem when they're uncompressed though, and I would suspect that detail textures would tend to be DXTC or 3DC. I'd be much more interested in caching large LUTs myself (there aren't many of those in current games, although STALKER has one for lighting).

That's indeed impressive. I wonder how much data that is between a round-trip (i.e. the register space you need for hiding the latency of one access, assuming full bandwidth utilization). I just think there's a point where you want to start keeping things on-chip to lower average lateny and save register space.

See my above points (including memory controller efficiency) as to why trying to 'save register space' might not be the best strategy.

Anyway, as I said above, I'm a big fan of both eDRAM and TBDR architectures (and even combinations of both). Yes, it's more expensive, but it still makes sense economically: your chip costs more, but your RAM costs less. So you increase your ASPs for a given segment of the market to the detriment of Samsung's.

However, I am not convinced even using a huge eDRAM-based L3 cache for texturing could deliver a significant performance boost, or would allow current architectures to get away with fewer registers. That's just my informed opinion though, I don't have raw data at my disposal obviously.

ArchitectureProfessor · Jan 25, 2008

Arun said:
I'd point out that textures are only a major bandwidth problem when they're uncompressed though, and I would suspect that detail textures would tend to be DXTC or 3DC.

Well, then isn't good support for texture compression the key?

I did some reading up on texture compression based on the earlier comments in this discussion, and it looks like most texture compression is on 4x4 (16 pixels) to/from either 64 bits or 128 bits. As 16 4-byte pixels is that same size as a Larrabee vector, perhaps Larrabee might include a "decompress" instruction that decompresses a 64-bit value in a non-vector register into a 512-bit vector register? It seems the decompression operations are easy enough (if you throw enough hardware at it) that such an operation could be done in the 4-cycle latency typical of Larrabee's other vector instructions. Or better yet, an instruction in which you say exactly which pixel you want from the 4x4 grid and it extracts just that single 32-bit pixel. This could even be done in a SIMD fashion. Would that help?

Arun · Jan 25, 2008

ArchitectureProfessor said:
Well, then isn't good support for texture compression the key?

No, the problem is that if what you want is a lookup into a INT8 or FP16 texture and you actually need that many discrete values, it doesn't matter how good your block compression is, you won't achieve that in 4 bits...

As for doing that in an instruction, it's more likely that the texture sampler hardware would handle it TBH. That's for a variety of reasons, including the fact there's no guarantee all of the texels you want to bilinearly filter will be in the same block!

ArchitectureProfessor · Jan 25, 2008

Arun said:
No, the problem is that if what you want is a lookup into a INT8 or FP16 texture and you actually need that many discrete values, it doesn't matter how good your block compression is, you won't achieve that in 4 bits...

Where are such textures used? I assume these aren't normal textures but more of a lookup table (LUT)? Or are they? I really don't know how textures are used (as I've said before, I don't know that much about graphics, but I'd like to learn more).

As for doing that in an instruction, it's more likely that the texture sampler hardware would handle it TBH.

Assuming Larrabee actually has dedicated texture sampler hardware...

Arun · Jan 25, 2008

ArchitectureProfessor said:
Where are such textures used? I assume these aren't normal textures but more of a lookup table (LUT)? Or are they? I really don't know how textures are used (as I've said before, I don't know that much about graphics, but I'd like to learn more).

Oh, it was a generic comment. They could be used for many things (but aren't incredibly frequent right now AFAIK). The 'classic' DX9 game has at least these textures for every model:
- Color texture (4 to 8-bit per pixel, DXT1 or DXT5)
- Normal texture (8-bit per pixel, 3DC)
- Height texture (8-bit per pixel, uncompressed)

But you can add many things to that, such as information on specular lighting (4-bit to 12-bit) or even more detailed data on the lighting model (but that's not very frequent yet). There are tons of small things you can do, but since it's all programmable obviously it varies a lot from engine to engine...

Regarding LUT-like workloads, there are also many things. STALKER keeps one to handle lighting in a rather original way. You could also create a LUT at runtime that keeps spherical harmonics-like information for an ambient occlusion hack. And those are just some really basic examples because I don't have much inspiration right now.

BTW, one of my favorite introduction to graphics and graphics hardware is this presentation by NVIDIA's Stuart Oberman (ex-AMD, main guy behind 3DNow! and K6's FPU): http://rnc7.loria.fr/oberman_invited.pdf - it's not super extensive but it's pretty good IMO.

Assuming Larrabee actually has dedicated texture sampler hardware...

Which it does according to all public and private data at my disposal. But filtering is done in the shader core, I think.

EDIT: Just making sure I was properly understood in my previous posts. Bandwidth for DXTC/S3TC textures is very far from negligible; what I was saying is that on modern HW, it's pretty damn hard to be limited by bandwidth in texture-heavy workloads where all textures are compressed. Obviously a single Vec4 INT8 texture takes 8x more bandwidth than a DXT1 texture, yet the sampler/filterer hardware will handle it at the same speed.

Nick · Jan 26, 2008

Arun said:
While that is correct in theory, I am very skeptical it's the problem in practice. Let us assume that it takes 50K cycles, and that I need 6 passes. That's 300K cycles, which corresponds to 0.5ms on a G92. If that was the only latency between the CPU and the GPU, you could do dozens of roundtrips per frame! And 300K cycles is just an insanely pessimistic number for *any* workload.

Ok, you have me convinced. Larrabee could finish tasks faster but that's likely not a big deal.

But that's precisely my point: what if there aren't enough idle cycles because your application is very bandwidth-intensive?

If Larrabee's caches do what they're supposed to do then there will be periods of some misses followed by periods of many hits. CPU RAM access behaviour is typically very bursty.

Here's a scenario: Suppose you have a very bandwidth intensive situation, say a huge shadow map. Then the threads won't be able to hide all the latency, and some start to stall. This frees up cycles for speculative prefetching, and when the threads continue at full speed again they'll have extra data in the caches they can access at low latency, till they run out of data again and the cycle restarts. So this automatically averages out to a balanced situation where you're making good use of available bandwidth and the threads stay busy.

Several speculative prefetches can go to waste, but not all of them. And as soon as there is some reuse of its cached data Larrabee can do better than a GPU because it saves bandwidth for other things and saves register space for long complex shaders.

I don't disagree with that overall rule of thumb, but I suspect you're significantly overestimating that effect's magnitude. YMMV, however.

If it doesn't happen this decade it will happen in the next. I fully agree Intel is taking a risky bet. But if they succeed...

Larrabee: Samples in Late 08, Products in 2H09/1H10

Jawed

ArchitectureProfessor

3dilettante

Jawed

nAo

Nutella Nutellae

Jawed

ArchitectureProfessor

nAo

Nutella Nutellae

Jawed

ArchitectureProfessor

nAo

Nutella Nutellae

Jawed

Nick

ArchitectureProfessor

Arun

Unknown.

ArchitectureProfessor

Arun

Unknown.

ArchitectureProfessor

Arun

Unknown.

Nick

Similar threads