Larrabee: Samples in Late 08, Products in 2H09/1H10

GPU's approach[/url] a software implementation takes only three additions, three compares and two 'and' operations in the inner loop. With 24 cores and single-issue 16-wide SIMD at 2 GHz that's 96 gigapixel/s. A GeForce 8800 GTX is capped at 13.8.
I wrote sw rasterizers in the past and it's not the amount of computations you need to perform to evaluate edge equations that worries me, but flow control.
The numbers you quoted tell me that a 24 cores Larrabee in 2010 would need to use over 10% of its computational power to match a 4 years older GPU at rasterizing huge quads. But no one really cares about that case, efficiency (for a CPU) is likely to go down dramatically with lots and lots of small primitives (though SMT can help in this case).
I'm sure ppl can come up with extremely clever rasterizers that exploit SIMD ALUs and SMT, still I don't see why a tiny fixed function hw rasterizer wouldn't be regarded as useful, if not even completely necessary, in order to make Larrabee a competitive part.

[edit] a couple of years ago I started designing a rasterizer for CELL (Olano style) that I never had the time to implement for real.
Don't remember the exact numbers I came up with anymore, but I do remember that peak performance were going to be quite pathetic
compared to RSX..in a best case scenario.
 
I wrote sw rasterizers in the past...
In the past shading consisted of one or two blend stages, so obviously rasterization itself was a significant cost compared to that. Heck, before texturing it was the only cost. Nowadays with shaders that have hunderds of operations, it's of lesser importance (in my experience).
...it's not the amount of computations you need to perform to evaluate edge equations that worries me, but flow control.
Could you elaborate?
The numbers you quoted tell me that a 24 cores Larrabee in 2010 would need to use over 10% of its computational power to match a 4 years older GPU at rasterizing huge quads.
Huge quads... and short shaders. A GPU's rasterizer units are obviously overdimensioned to avoid it to become a bottleneck. So rasterization very likely takes much less than 10% for Larrabee on average. But it has the potential of beating a (4 years old) GPU in situations where rasterization is its bottleneck.
But no one really cares about that case, efficiency (for a CPU) is likely to go down dramatically with lots and lots of small primitives (though SMT can help in this case).
I don't see how efficiency would be much better in this case for a GPU? Could you please define efficiency in a little more detail, and point me to the problematic things (that GPUs have no problems with)?
...still I don't see why a tiny fixed function hw rasterizer wouldn't be regarded as useful, if not even completely necessary, in order to make Larrabee a competitive part.
Because it will be bandwidth limited much sooner than rasterization limited. Because a dedicated unit could be a bottleneck for short shaders and a waste of transistors for long shaders. And because as I already suggested it's possible to add instructions that benefit rasterization, so all of the cores can do reasonably efficient rasterization.

I think you're still a bit stuck in the idea that dedicated fixed-function hardware is by definition smaller and faster than programmable hardware. This is true if you're only doing one task. If you have a mix of different tasks that require somewhat similar operations, it makes a lot of sense to unify them in the same units and put the whole chip to use at all times, independent of the workload balance.
a couple of years ago I started designing a rasterizer for CELL (Olano style) that I never had the time to implement for real.
Don't remember the exact numbers I came up with anymore, but I do remember that peak performance were going to be quite pathetic
compared to RSX..in a best case scenario.
I have a hard time believing that rasterization would be the sole cause of that. But I wouldn't mind being proven wrong with some convincing numbers.
 
Then again, GPUs might choose a very similar software-oriented route...
ROPs are pretty much guaranteed to be software by the time Larrabee appears (commercially) and by the time of your gen 3 Larrabee I'd be surprised if there's more than a hair's breadth between GPUs and Larrabee for the D3D pipeline - the major difference being how x86 code runs. And in AMD's case, well...

Jawed
 
Going from Pentium M to Core 2 the only 'major' changes are doubling the L1 cache bus width to 128 bit, doubling the width of the SSE execution units, and issuing four operations per clock.
Why did you start with Banias? There were x86 cores prior to 2003.

True, but my point is that Larrabee might evolve even more rapidly.
From your earlier examples, it would seem to be quite the opposite, since Merom "only" changed three things about Banias.
That's 3 minor architectural nitpicks in three years.

The stability of the ISA doesn't free the hardware to evolve more rapidly.
In modern x86 it is a relatively fixed hardware cost for translating x86 into an ever-changing internal representation. The vast majority of the silicon doesn't care at all how stable the ISA is.

If there is any faster evolution to Larrabee, it will be due to market and manufacturing factors that have little to do with the ISA itself.
 
What's the actual impact of extra read/write ports (but keeping the same total bus width) on size and performance?

I'm probably not qualified to give a really good answer on that, but perhaps the professor would have a good answer. If gather wasn't so expensive, I'm sure you would have seen it in SSE by now...

With 16-wide SIMD units, I expect it can transpose a 4x4 matrix in one instruction. This would make AoS-SoA conversion very fast, and indeed make scatter/gather less of a necessity for everything except texture sampling.

Perhaps Larrabee would have a full generalized vector permute, but you would need a 4bit*16 = 64bit value for that, so my guess is that vector permute will require an integer register (and not work from an immediate) which would be more expensive. Still most AOS code will not be able to average above 50% utilization (meaning actual ALU work done vs permutation/moving stuff around) of the 16-wide vectors over its code even with scatter/gather, so AOS is realy not a good path to think about at all.

I'm not sure I follow here. Render targets are never read directly after being written. The first read as a texture could be in a very different location than the last write. So pixel order doesn't matter, at this level.

Perhaps not immediately, but you are going to need to have texture access to render targets for obvious stuff (like any deferred shading paths or image space post processing paths), so there is no sense in doing another pass just to re-order pixels for 2D locality required by the texture fetch units.

Not necessarily. The software is not restricted to implementing one rendering pipeline implementation, so you can specialize for the type of primitive. Generic scatter/gather would be handy though.

I don't think Larrabee will have scatter, but instead will require 16 cycles to scatter, so single pixels will still be slow, especially if blending. But perhaps at least it won't be as bad as the multiple extra instructions needed to scatter on SSE.

And I think it can have an advantage in terms of efficiency. It never has a bottleneck, never has bubbles because of register pressure, doesn't have to mask away unused components, can reduce overdraw to zero, specialize for specific scenarios, etc.

Lets not get too optimistic here... for one thing, current GPUs already cache textures about as optimally as you could expect Larrabee to if it has a separate texture cache (if not separate then texture caching will probably be suboptimal on Larrabee), so Larrabee won't have any advantage in texture fetch performance, and will suffer from the exact same texture bottleneck that currents GPUs have. Masking also is a given for shaders that have any divergent branches, and as I described before, the branch granularity will probably need to be large (32,48, or 64) to have enough instructions after a texture load to hide the latency of going to L2 (and beyond) with only 4 in-order threads per core. Overdraw is application specific, and is only going to increase as GPU performance goes up, as artists and programmers can get away with more transparent effects (adding a fully programmable ROP will accelerate this).

Don't get me wrong, I still think Larrabee will be a very interesting platform to program on.

Attribute interpolation is at most two multiply-add operations, I see no issues there. And I don't think there's any part of the pipeline we haven't discussed yet that isn't obviously fairly efficient to implement on a CPU. :)

At least to be perspective correct you need more than two MACs right?

One thing to remember here, is that a software GPU is going to need to take ALU and memory bandwidth to do all the work and routing that the normal GPU does in hardware. At least in this simple case of just shader (after interpolation) and ROP, you probably have an extra (very rough estimate),

- 2x load from RT (get dest)
- 4x unpack
- subtract (1-alpha)
- 3x mul (1-alpha)*dest
- 3x mac (alpha)*src+dest ... finish blending for RGB
- 2x pack
- 2x store to RT

perhaps 17 instructions (average a little over 1 op/pix however) to do the ROP alone for a RGBA FP16 render target when doing simple alpha blending. Now in the case of simple particle engine (common blending case) which involves one texture lookup and perhaps a single multiply times an interpolated color value (from the vertex shader), a majority of the ALU work would be for ROP and attribute interpolation.

The point is that the extra software GPU work can easily exceed the amount of work actually done in the shader itself. Of course, TEX:ALU ratio issues might help Larrabee hide some of the extra ALU work done in software, but not all of it (in this simple particle engine case the particle texture might mostly be in L1 all the time). Even if Larrabee's peak Gflops is the same as GPUs, many clock cycles will get eaten by non-shader work.

[edit] of course this is just a simple example which doesn't cover the other important overheads like early-Z, stencil, and everything prior
 
Itanium's write-through L1 data cache has 2 read ports and 2 write ports at a 1 cycle latency.

The low latency can be credited to its small size (16KB), and I've seen others also credit the fact IA-64 only has one memory addressing mode in helping shave off a cycle.

x86 processors have avoided heavy multiporting of their L1 caches in order to maintain cycle time. Phenom has a pseudo-dual ported cache, and Core2's is dual ported.

Extra ports do increase the physical size of the cache for a given capacity.
 
Now for some other parts of the pipeline,

Stencil would require pack (+load) and unpack (+store) from 16pix*8bit=128bit aligned addresses with similar code structure as for needed for ROP. Predicates again would be mandatory to mask pixels. Depth compare would be similar to stencil and ROP code, without any pack/unpack instructions. Also without dedicated hardware compression of depth, Z bandwidth usage could be excessive compared to GPUs. Doing decompression in software would negate any bandwidth gains with ALU cost.

Some type of software hierarchical Z to clip full 4x4 pixel blocks (or larger) prior to shading could be used for early Z optimizations. The problem here is that SOA doesn't map well to the hierarchical Z write problem (need to do a parallel min/max reduction, could be 4 de-interleave and 4 min/max instructions all serially dependent, followed by a final scaler load, min/max and store). Also on early Z clip side, would need to compute min Z for the pixel blocks and load/cmp from hierarchical Z buffer.

Lots of overhead here already just for pixel parts of the pipeline ...

To fetch vertex attributes you basically need gather. Perhaps Larrabee would use it's texture units for this? Would make some sense, hardware could share format conversion for instance. So separate hardware wouldn't be needed for this.

The path from vertex shader to pixel shader gets really messy (no time to cover now). And it seems with Larrabee you have to eat all the bandwidth of this overhead through the common memory subsystem (instead of perhaps having GPU dedicated hardware paths for this). This could be a serious problem for small triangles...
 
Last edited by a moderator:
Why did you start with Banias? There were x86 cores prior to 2003.
Because it shows that in five years time relatively little has changed.
From your earlier examples, it would seem to be quite the opposite, since Merom "only" changed three things about Banias. That's 3 minor architectural nitpicks in three years.
Exactly. They can keep the same architecture, add more cores, update the software with new functionality. Done.

NVIDIA/ATI goes back to the drawing board for every major generation and it takes years to see the light of day. Like I've said before, Larrabee could support the next version of DirectX the day Microsoft finalizes the specifications. That's what I call rapid evolution.
The stability of the ISA doesn't free the hardware to evolve more rapidly.
I'm not talking about the evolution of the hardware, I'm talking about the evolution of it as a whole.
 
Because it shows that in five years time relatively little has changed.
Because the code they targeted has barely changed.
That is not a luxury the graphics cards had.

*edit* fixed bad quote formatting

NVIDIA/ATI goes back to the drawing board for every major generation and it takes years to see the light of day.
You've pointed out one very major transition.
There is a much more clear chain of inheritance in the graphics units that preceded G80.

I'm willing to give the GPUs just one major internal reorganization as a freebie in the face of the changes brought by the PPro and Netburst cores.

Like I've said before, Larrabee could support the next version of DirectX the day Microsoft finalizes the specifications. That's what I call rapid evolution.
It worked for Transmeta, right?
Outside of the engineers charged with redesigning the hardware every generation (the number of redesigns and the actual cost of which I think you are overestimating), who else is going to care?

What good is it if Larrabee can be made to run DX14 applications, when by that time nothing from Larrabee's time frame can be expected to perform acceptably?
 
Last edited by a moderator:
If gather wasn't so expensive, I'm sure you would have seen it in SSE by now...
It took them ten years to add a packusdw instruction with SSE4 (ssdw, sswb and uswb existed since MMX). Plenty more examples... ;)

So I don't think there's a very strong correlation between how (in)expensive something is to implement, and actually adding it. They just need to see a need for it. Several SSE instructions are useful for video encoding but almost nothing else, and fairly complicated. So the complexity of scatter/gather alone doesn't justify not adding it. So far they've just been happy with the idea that most things can use SoA data. With Core 2's 128-bit wide SSE execution units and multi-issue, the need for scatter/gather has definitely grown.

By the way, the first scatter/gather implementation for SSE could be real simple. They could just serialize the memory operations. This already saves code size and I'm sure it can be faster than today's extract/insert instructions. If the instruction is used frequently they can implement it for L1 cache. Anf if it becomes really popular they could optimize it for L2 as well. Lots of options in between too (like doing two fetches per clock instead of four).
Perhaps Larrabee would have a full generalized vector permute, but you would need a 4bit*16 = 64bit value for that, so my guess is that vector permute will require an integer register (and not work from an immediate) which would be more expensive.
This is how SSE4's pshufb works, on a byte level.
Lets not get too optimistic here... for one thing, current GPUs already cache textures about as optimally as you could expect Larrabee to if it has a separate texture cache (if not separate then texture caching will probably be suboptimal on Larrabee), so Larrabee won't have any advantage in texture fetch performance, and will suffer from the exact same texture bottleneck that currents GPUs have.
So far I was assuming that the texture samplers share a large L2 cache with speculative prefetching. So yeah that's an optimistic assumption. Without it there would be no reuse and high latency, and I can hardly imagine it to be able to sustain high performance... It's all just speculation and I fully realize we'll have to wait till it's presented to see how they solved these issues (or not).
Overdraw is application specific, and is only going to increase as GPU performance goes up, as artists and programmers can get away with more transparent effects (adding a fully programmable ROP will accelerate this).
I'm not sure if this is true. Games don't add more transparent effects just because they can afford to. Crysis still looks a lot like Far Cry. Sure, the ROPs are tasked more but in comparison to the longer shaders I expect it to be relatively equal or even less.
At least to be perspective correct you need more than two MACs right?
Right, for the generic case a division per pixel and another multiply per interpolant is needed. But with adjecent pixel quads the interpolation itself can be done with just an addition.

Doesn't G80 use multiple instructions for interpolation and perspective correction as well?
One thing to remember here, is that a software GPU is going to need to take ALU and memory bandwidth to do all the work and routing that the normal GPU does in hardware.
Correct me if I'm wrong but doesn't a GPU have a lot of queues? For Larrabee, caches bind everything together and they should offer insane bandwidths for on-chip data. Some overhead is unavoidable, but I don't think GPUs have a significant advantage here. If this was really an issue we'd still have GPUs with combined sampler and ALU pipelines, and non-unified pixel and vertex shaders...
...perhaps 17 instructions (average a little over 1 op/pix however) to do the ROP alone for a RGBA FP16 render target when doing simple alpha blending. Now in the case of simple particle engine (common blending case) which involves one texture lookup and perhaps a single multiply times an interpolated color value (from the vertex shader), a majority of the ALU work would be for ROP and attribute interpolation.
Very true. However, like I said before such short shaders would be bandwidth limited anyway (except if the particle texture fits in a shared cache). Compared to how a GPU would perform this isn't bad at all. And for the many situations where alpha blending is disabled it becomes less than one operation per pixel. A GPU would have idle ALUs in the ROPs.
The point is that the extra software GPU work can easily exceed the amount of work actually done in the shader itself.
You have to look at the whole equation here. If a GPU had instead of ROPs that are idle half of the time, shader units that can perform the ROP's task, you'd have a higher total utilization. And I wouldn't be surprised at all if GPU designers are already looking at doing exactly that.
Even if Larrabee's peak Gflops is the same as GPUs, many clock cycles will get eaten by non-shader work.
Yes, it definitely needs more GFLOPS to keep up with the high-end GPUs. I personally don't expect that to happen for the first version though. But that wouldn't make it a failure. If the 'unify everything' approach works they can fairly easily crank up the GFLOPS and create a winner. It would take another year or three for the next major GPU architecture, but Larrabee would get faster with every new process and more advanced software.
 
By the way, the first scatter/gather implementation for SSE could be real simple. They could just serialize the memory operations.

That's exactly what I am thinking (16 times the cycles of an aligned 512bit fetch). Now the question is if it does this in the background while another thread runs, or do you have to manually do this with 16 loads (and a special instruction which fetches one of the 16 scaler slots)...

I'm not sure if this is true. Games don't add more transparent effects just because they can afford to. Crysis still looks a lot like Far Cry. Sure, the ROPs are tasked more but in comparison to the longer shaders I expect it to be relatively equal or even less.

You are somewhat right, they don't add transparent effects now because at the resolutions of todays LCDs, there isn't enough performance left for the overdraw. Because of this, doing transparent effects at say 1/4 the screen size, without per pixel shading, and one upsample + merge with full size framebuffer is common. When GPU performance increases, I'm guessing the next biggest thing is rendering the space between surfaces (atmosphere / participating media) = lots of transparent effects.

Shader length has to be long simply because ALU:ROP ratio is very high (its better to do four lights at one time than to do 4 passes with one light, ROP is expensive).

Doesn't G80 use multiple instructions for interpolation and perspective correction as well?

G80 has dedicated unit for interpolation in the ALU and uses ALU for divide I think, but would have to check to be sure.

Correct me if I'm wrong but doesn't a GPU have a lot of queues? For Larrabee, caches bind everything together and they should offer insane bandwidths for on-chip data.

The software pipeline crux is that you have to insure the pipeline keeps everything in cache. Blowing through L1 is a given, keeping everything in L2 will take some serious work. There cannot be much in between when a vertex gets processed and its triangle's pixels are rendered otherwise the intermediate data will get flushed out. This will be tough for programs which are mainly pixel bound. L2 is bound to be shared for texturing as well...

Very true. However, like I said before such short shaders would be bandwidth limited anyway (except if the particle texture fits in a shared cache). Compared to how a GPU would perform this isn't bad at all. And for the many situations where alpha blending is disabled it becomes less than one operation per pixel. A GPU would have idle ALUs in the ROPs.

Yeah that particle shader would be ROP bound regardless. Still you add software early-z, stencil, depth test, hierarchal z update, and ROP ... you probably are talking about 64 or more instructions without doing any shading work ... could be enough for 4 texture fetches and 2 RT ROP blends on a current GPU. This software overhead is not a small thing.

You have to look at the whole equation here. If a GPU had instead of ROPs that are idle half of the time, shader units that can perform the ROP's task, you'd have a higher total utilization. And I wouldn't be surprised at all if GPU designers are already looking at doing exactly that.

I agree that ROP will probably go programmable by 2010.
 
By the way, the first scatter/gather implementation for SSE could be real simple. They could just serialize the memory operations. This already saves code size and I'm sure it can be faster than today's extract/insert instructions. If the instruction is used frequently they can implement it for L1 cache. Anf if it becomes really popular they could optimize it for L2 as well. Lots of options in between too (like doing two fetches per clock instead of four).

You'll still end up with one instruction that can page fault 16 times. I think you're right that Intel will serialize it, or rather tell developers to serialize them, - in software.

Cheers
 
Itanium's write-through L1 data cache has 2 read ports and 2 write ports at a 1 cycle latency.

The low latency can be credited to its small size (16KB), and I've seen others also credit the fact IA-64 only has one memory addressing mode in helping shave off a cycle.
Yeah, IA-64 supports only indirect addressing w/o offsets and that removes an adder from the L/S pipe which coupled with the small L1 allows for single-cycle access.
 
You'll still end up with one instruction that can page fault 16 times.
That doesn't matter. You can only start working with the vector once you have all elements anyway. Like I've said before, it's important to optimize the case where all data is in L1 cache, but the rest (L2, RAM, disk) rapidly becomes irrelevant.
 
From a microarchitectural point of view it most certainly does.
If one of the elements misses L1 cache the accesses can be serialized, much like the already present insert/extract instructions.

Or am I missing something more serious here?
 
If one of the elements misses L1 cache the accesses can be serialized, much like the already present insert/extract instructions.

Or am I missing something more serious here?

I just think you're adding a lot of complexity for something that will never be fast. A sixteen ported D$ would be big and burn power compared to a single ported 512bit-wide D$.

Your gathering load instruction can produce 16 cache misses and, worse, 16 TLB misses. A scattering store is at least as bad. Worst case execution time would be appalling (interrupt response *ugh*).

I think it will be more likely that Larrabee is going to be optimized for locality, using dense vectors/matrices (and swizzled buffers) with one big fat 512 bit port in the D$.

A few primitives could help speed up a software based gathering load. Imagine you have a vector holding the indices for the vector you want to load.

1. Load a 512 bit vector at the first non-zero address in the index vector.
2. Compute permute vector based on the address used to load the vector.
3. Use permute vector to stuff floats in destination register (the gathering register)
4. Mask index vector with the address used in the load in (1).
5. If still non-zero entries left in the index vector repeat.

Scattering store would work similarly.

Cheers
 
Last edited by a moderator:
A sixteen ported D$ would be big and burn power compared to a single ported 512bit-wide D$.
I was thinking more of 128-bit SSE with four ports or 512-bit Larrabee SIMD with four ports. Both the CPU and Larrabee would benefit from that without extreme hardware cost.
Your gathering load instruction can produce 16 cache misses and, worse, 16 TLB misses. A scattering store is at least as bad. Worst case execution time would be appalling (interrupt response *ugh*).
If you insert/extract the elements serially you get the same number of cache and/or TLD misses. So you don't have to compare performance to a single full-width store but regard it as several memory accesses. In the best case, you get some parallelism and it's faster. In the worst case it's equally fast (slow).

The important thing is to lower average latency. L1 hits are very low latency but not if you're doing 4-16 serially. With a single access hit ratio of 90% there's a 65% chance four accesses can be done in parallel, at the latency of one access. With a latency of 1 cycle for a single access that's an average latency of only 2 cycles if you have four-way scatter/gather, instead of 4 cycles.

So with four cache ports Larrabee could do 16-way scatter/gather without bubbles if the ALU:MEM ratio is 4 (or less if not all memory operations are scatter/gather).
 
Back
Top