Larrabee at Siggraph

nAo · Aug 14, 2008

rpg.314 said:
Why did they not push for a library (standardized one on x86, atleast, with alternative implementations avl on PPC etc.). Too smitten by IPP revenues? I hope they do better with AVX. Why the hell should I rewrite every fucking bit of my most optimized code inside out just because they added new instructions?

I don't want to sound as a broken record but has anyone complaining about Intel not thinking about programming on many core architectures bothered to read this material?

Ct: C for Throughput Computing

Ct: A Flexible Parallel Programming Model for Tera-scale Architectures

While it's true that until they ship it it's just vaporware, it looks very promising to me.
They are clearing attacking the problem, and on top of that we (as developers) will probably have support for DX11's compute shaders and OpenCL.

3dilettante · Aug 14, 2008

Jawed said:
As far as the hardware thread is concerned, all of its extant fibres' states exist concurrently in memory (cache) and register file.

In memory, yes. Not knowing the parameters of the cache and just how extensive Larrabee's cache control instructions are, we can't assume that it's necessarily in cache. It's likely that it's usually there, so long as other accesses don't start scatter-gathering across the address space.

The paper says a tile is sized so that it can be fully loaded into cache, but does that mean the full load of fiber data and emulation code is part of that total?

The switch from one fibre to another is no more costly than in a GPU where a register starts to be used for the first time.

Hard to say that without implementation information. If we want intermediate values to persist for a fiber after a switch, some time must be taken to save the state and the new state must be read in.
Depending on the implementation, and depending on other factors that should be more clear as more information trickles out, it could incur a higher cost than using a register that is in a known place and is guaranteed to be there.

TimothyFarrar · Aug 14, 2008

I take it we are still missing any info on number of vector registers, ALU latency, and max number of outstanding memory reads. So we really don't know what types of latency Larrabee can hide yet.

I'm also going to have to agree with Marco here, I bet that nearly all game developers (ie majority of the market for the device) doing future gaming GPGPU and using Larrabee are going to use DX11 compute shader or OpenCL (assuming it is finished and established standard), unless Larrabee gets into a console and then all bets are off.

DX11 compute looks to be CUDA rewrapped. So all of the fine grain sync advantages of Larrabee end up for driver writers only, bandwidth "advantage" of the tiled renderer go away (not applicable to compute shaders / GPGPU), and Larrabee is left with the trade off between perhaps lack of ability compared to GPUs to hide latency but lower latency because of cache.

Also developers are going to have to design algorithms which use block sizes and shared register thread group sizing which is a common size which performs well on all hardware to see performance from the compute shaders on all platforms (or adapt the algorithm dynamically based on platform).

So I think the reality of the situation is that we are relying on great drivers...

rpg.314 · Aug 14, 2008

Jawed said:
With 16 elements per hardware thread, Larrabee will be paying the lowest cost of any GPU for DB incoherency.

Sorry for a n00b ques, but what is DB incoherency?

Having no dedicated interpolation and/or transcendental ALUs should save a fair bit of space.

They said they had dedicated texture units. Isn't interpolation part of it? But yes they might not have interpolation instruction as part of vector ISA.

And Larrabee will be way out in front on double-precision.

Absolutely. It'll make the lot of HPC folks happy, but games won't need it. But I guess ATi will come ahead of larrabee in DP speed beacuse their absolute SP speed will probably be much more even though their SP/DP ratio is lower.

3dilettante · Aug 14, 2008

rpg.314 said:
Sorry for a n00b ques, but what is DB incoherency?

Dynamic branching, where pixels within a batch/fiber take different paths.

They said they had dedicated texture units. Isn't interpolation part of it? But yes they might not have interpolation instruction as part of vector ISA.

The Larrabee paper stated that it's done in software.

Absolutely. It'll make the lot of HPC folks happy, but games won't need it. But I guess ATi will come ahead of larrabee in DP speed beacuse their absolute SP speed will probably be much more even though their SP/DP ratio is lower.

Intel's targeting something like 1 TFLOP DP.
You expect AMD's DP rate to quadruple?
Even then, without changing things, AMD's DP flops would be in some ways inferior to Intel's IEEE compliant math.

heliosphere · Aug 15, 2008

nAo said:
I don't want to sound as a broken record but has anyone complaining about Intel not thinking about programming on many core architectures bothered to read this material?

It seems like Intel has some conflicting incentives here. On the one hand they're making a big deal about the fact that Larabee is x86 and offers the chance of porting code, on the other they're introducing a new wide SIMD instruction set that would benefit from new programming models like Ct (which I haven't actually read much about yet) or, ironically, CUDA. The more they can abstract away the underlying architecture with a new programming model though the less their advantage from supporting x86.

It seems like NVIDIA stand to do well if any of CUDA, OpenCL, DirectX 11 compute shaders or any other portable abstraction layer is ultimately the most adopted technology. Intel's big advantage comes from persuading people *not* to move to a new programming model but to take existing C++ codebases targeting x86 and adapt it to Larrabee.

Jawed · Aug 15, 2008

3dilettante said:
In memory, yes. Not knowing the parameters of the cache and just how extensive Larrabee's cache control instructions are, we can't assume that it's necessarily in cache. It's likely that it's usually there, so long as other accesses don't start scatter-gathering across the address space.

Agreed: in memory, and with a following wind prolly in cache. What other accesses will be mucking things up though? The four threads in a core shouldn't be stomping over each other's cache lines.

The paper says a tile is sized so that it can be fully loaded into cache, but does that mean the full load of fiber data and emulation code is part of that total?

In general I guess that would be normal. The paper says that for pixel shading tile size hardly makes any difference in performance (sounding eerily similar to the "Xenos performance is fine if you use tiled rendering" argument - we know that's only mostly true...).

With everything statically compiled, the shader's ALU:TEX is known and therefore the worst-case TEX latency that needs hiding. That decides fibre count. If the shader also does constant buffer fetches, the compiler can work out how many of those are in-flight at any given point in a shader and add that into the cache allocation. If the compiler decides to deploy re-packing for DB then that's more cache to allocate. Whatever's left is for the tile.

Naturally there's a tipping point. All the GPUs have tipping points.

Hard to say that without implementation information. If we want intermediate values to persist for a fiber after a switch, some time must be taken to save the state and the new state must be read in.
Depending on the implementation, and depending on other factors that should be more clear as more information trickles out, it could incur a higher cost than using a register that is in a known place and is guaranteed to be there.

It doesn't make sense for the register file to be so small that every fibre switch incurs a flush of the outgoing fibre's registers to memory.

Registers should only be written to memory because:

the entire population of registers wasn't initially allocated for the shader - D3D10 requires the GPU to support 4096 vec4 fp32 registers per element. So GPUs are expected to page registers over the lifetime of a shader when subjected to this kind of duress
the code specifies the write, e.g. an indexed write - compute shaders in D3D11 should allow this I guess

A flush shouldn't happen as part of normal scheduling around latency.

L1 read bandwidth (and presumably write bandwidth) is only 1 operand (32 bit I guess) per element per clock. MAD rate would effectively drop to 1/3rd (1 cycle for the 1st MAD, then 2 cycles to write 2 operands to memory + 3 cycles to read them back = 6 cycles) if two MADs were separated by a TEX

If someone designed a GPU to be this slow on its most fundamental latency-hiding operation, they'd be shot I reckon.

---

I suspect Larrabee has the smallest register file of all these architectures, though. Because attribute interpolation will be "slow" on Larrabee I expect there will need to be relatively few fibres to hide TEX latency, as these instructions will naturally increase the shader's ALU:TEX.

Predicate evaluation latency shouldn't be huge, either, unless it causes a jump to a page of code that isn't in instruction cache. Other GPUs have to contend with this problem too.

Jawed

Mintmaster · Aug 15, 2008

Jawed said:
I think it's reasonable to assume the x86 part of each core (split across 4 hardware threads) is close to being fully dedicated to "managing" the execution of the 3D pipeline upon the VPU and all associated housekeeping.

I understand, but that housekeeping is a lot of work for the x86 part. armchair_architect is echoing my beliefs above.

I'm not sure how you're defining a batch here, because there is no "per-clock" switching.

Fibres switch only when latency is incurred. "One remaining issue is texture co-processor accesses, which can have hundreds of clocks of latency. This is hidden by computing multiple qquads [16 elements] on each hardware thread. Each qquad's shader is called a fiber. The different fibers on a thread co-operatively switch between themselves without any OS intervention. A fiber switch is performed after each texture read command, and processing passes to the other fibers running on the thread. Fibers execute in a circular queue. The number of fibers is chosen so that by the time control flows back to a fiber, its texture access has had time to execute and the results are ready for processing."

So "every clock" switching only occurs around a texture instruction.

Did you miss the part where each qquad's shader is a fibre? Larrabee doesn't have single cycle instruction latency, so it will be switching qquads every clock. We see similar things from ATI (a different 64 pixels every 4 clocks) and Nvidia.

At the very least we will see smaller cycles of 8 qquads switching every clock, and maybe new ones swapping in (due to texture fetches) at lower frequency, but there is no mention of this in the paper. Static SW scheduling can do this quite well when there's no DB and texture latency is consistent and predictable, but otherwise scheduling starts benefitting drastically from dedicated logic.

BTW, what exactly are you envisioning when you say "repacking"?

Barbarian · Aug 15, 2008

Lots of presentations from today's "Beyond Programmable Shaders" at Siggraph:
http://s08.idav.ucdavis.edu/

There are many interesting details about all the major architectures from NVidia, AMD/ATI and Intel.

psurge · Aug 15, 2008

nAo said:
While it's true that until they ship it it's just vaporware, it looks very promising to me.

I agree it looks really nice, but based on the limited details I've read about Ct and the limited stuff available to non-customers on RapidMind, Ct seems extremely similar - both have support for deterministic parallelism, nested data parallelism, and are embedded in C++. Am I missing some reason why Ct is a big advance over what is currently available?

Jawed · Aug 15, 2008

2-10 fibres required to hide TEX latency

Jawed

glw · Aug 15, 2008

heliosphere said:
It seems like NVIDIA stand to do well if any of CUDA, OpenCL, DirectX 11 compute shaders or any other portable abstraction layer is ultimately the most adopted technology. Intel's big advantage comes from persuading people *not* to move to a new programming model but to take existing C++ codebases targeting x86 and adapt it to Larrabee.

Intel's biggest advantage is that if some clever developers come up with a brilliant idea to make writing parallel programs much easier those clever developers don't have to layer their brilliant idea on top of everyone else's 'cruft'.

CUDA or CAL, the Windows only DirectX and whatever OpenCL turns into will all 'get in the way'.

Exposing x86 will assist library and toolkit programmers, even if application developers avoid it.

Jawed · Aug 15, 2008

Mintmaster said:
I understand, but that housekeeping is a lot of work for the x86 part. armchair_architect is echoing my beliefs above.

What else is the scalar pipeline doing? Also I suspect there's a lot less work than first imagined.

Did you miss the part where each qquad's shader is a fibre? Larrabee doesn't have single cycle instruction latency, so it will be switching qquads every clock. We see similar things from ATI (a different 64 pixels every 4 clocks) and Nvidia.

Fibres are an entirely software-based construct.

Let's say we've got a shader, which only uses 1 register:

Code:

texld r0, v0, s0
mad r0, r0, r0, r0

this translates into LRB native as:

Code:

intrp r0, v0 [macro]
texld r0, r0, s0
mad r0.x, r0, r0, r0
mad r0.y, r0, r0, r0
mad r0.z, r0, r0, r0
mad r0.w, r0, r0, r0

I'm too lazy to work out what the interpolation macro comes out as. Let's say it's 10 instructions per dimension, so 20 for a 2D lookup.

Now let's say that this is compiled for 4 fibres and we're looking at just the VPU code:

Code:

intrp r0 [f0r0], v0 [f0v0]
[texld ...]
intrp r1 [f1r0], v1 [f1v0]
[texld ...]
mad r0.x [f0r0], r0, r0, r0
mad r0.y [f0r0], r0, r0, r0
mad r0.z [f0r0], r0, r0, r0
mad r0.w [f0r0], r0, r0, r0
intrp r2 [f2r0], v2 [f2v0]
[texld ...]
mad r1.x [f1r0], r1, r1, r1
mad r1.y [f1r0], r1, r1, r1
mad r1.z [f1r0], r1, r1, r1
mad r1.w [f1r0], r1, r1, r1
intrp r3 [f3r0], v3 [f3v0]
[texld ...]
mad r2.x [f2r0], r2, r2, r2
mad r2.y [f2r0], r2, r2, r2
mad r2.z [f2r0], r2, r2, r2
mad r2.w [f2r0], r2, r2, r2
mad r3.x [f3r0], r3, r3, r3
mad r3.y [f3r0], r3, r3, r3
mad r3.z [f3r0], r3, r3, r3
mad r3.w [f3r0], r3, r3, r3

Notice this is statically compiled, circular, fibre-switching. Also notice that my example shader has a disastrous ALU:TEX ratio which will stall on all current GPUs

At the very least we will see smaller cycles of 8 qquads switching every clock, and maybe new ones swapping in (due to texture fetches) at lower frequency, but there is no mention of this in the paper. Static SW scheduling can do this quite well when there's no DB and texture latency is consistent and predictable, but otherwise scheduling starts benefitting drastically from dedicated logic.

Obviously, we don't know the length of the VPU pipeline, nor what the register read after write latency is. And whether Intel has implemented an in-pipe memory to function like ATI's "previous register", obviating simple consecutive-instruction RAW hazards.

BTW, what exactly are you envisioning when you say "repacking"?

Let's say you have the following predicates for 4 fibres, each of 16 elements:

1101010101110010
0100100101001011
1001011100101010
1000101110001101

you'd re-pack the elements to get:

1111111111111111
1111111111111111
0000000000000000
0000000000000000

(by pure luck I got 32 of each

)

Obviously, whether it's worth doing is a different question... If each clause is 1000 cycles then yes, you'd prolly want to halve execution time and the re-packing cost would be easily amortised.

Jawed

Mintmaster · Aug 15, 2008

Notice this is statically compiled, circular, fibre-switching.

Yes it is, but only works nicely when there isn't DB screwing around with each qquad's instruction sequence. Otherwise, if you want 16 pixel granularity, you need to find a new qquad to perform an instruction on every clock.

That's impossible for software. As the scheduler is written in the paper, the branching granularity is no better than G70's. Moreover, while a GPU only needs to cover the average texture latency, a circular scheduler needs to cover the maximum latency.

Let's say you have the following predicates for 4 fibres, each of 16 elements:

1101010101110010
0100100101001011
1001011100101010
1000101110001101

you'd re-pack the elements to get:

1111111111111111
1111111111111111
0000000000000000
0000000000000000

Okay, that's what I figured, but you are seriously underestimating the software power needed to do that. Not only are the combinatorics very ugly, but you need each fibre to be at the same branch location, which is either highly improbable or needs some seriously complicated scheduling to "realign" the IPs of different fibres. In terms of graphics, this repacking messes up a lot of things like sample masks, derivatives, texturing, etc. No longer is each fibre spatially confined to one 4x4 block.

It'll only be useful in very specialized circumstances, and those won't look much like shaders at all.

3dilettante · Aug 15, 2008

Jawed said:
Agreed: in memory, and with a following wind prolly in cache. What other accesses will be mucking things up though? The four threads in a core shouldn't be stomping over each other's cache lines.

A fully associative GPU cache or a set-associative cache?
The P54 had a 2-way L1 Dcache.
Probably with 4 times the threads each Larrabee core has a more associative cache.

In general I guess that would be normal. The paper says that for pixel shading tile size hardly makes any difference in performance (sounding eerily similar to the "Xenos performance is fine if you use tiled rendering" argument - we know that's only mostly true...).

In simulations, bad things hardly ever make a difference in performance...

If the compiler decides to deploy re-packing for DB then that's more cache to allocate. Whatever's left is for the tile.

Perhaps the general rule for DB will be to grin and bear it.
If pixels start migrating in sequence, overly aggressive packing will wreck locality, and that will start trashing cache lines.
If everything is done in sequence the compiler has a good chance of arranging it so cache lines can be cycled between the L1 and L2 in a mostly transparent fashion. Each pixel's memory can also be allocated in such a way to get around stride conflicts on a set-associative cache.

Perhaps pixel 1's data won't have a cache conflict with pixel 32 because the compiler knows pixel 1 will be done with a given line because it will be unneeded by the time pixel 32's evicting line is loaded if the pixel is hit in-order. If pixel 1 is repacked adjacent to 32, the two pixels are going to bash each other's operands.

It doesn't make sense for the register file to be so small that every fibre switch incurs a flush of the outgoing fibre's registers to memory.

Well, this is x86, though the VPU register count is up in the air.
Fibers switch cooperatively within a single hardware thread, which means the register pool is the 8, 16,or possibly 32 each hardware thread gets.
Something like 10 fibers on a thread with just 1 register per pixel would blow an 8 register file, and it would be a bit cramped on 16. 10 strands would allow one register operand per pixel as long as they only do 2-source operations.
Spilling registers in the middle of a fiber would not be a good thing.
Intel makes a lot of leaning on the L1, which is obviously a critical resource.

edit: One register per qquad

Jawed · Aug 15, 2008

Mintmaster said:
Yes it is, but only works nicely when there isn't DB screwing around with each qquad's instruction sequence. Otherwise, if you want 16 pixel granularity, you need to find a new qquad to perform an instruction on every clock.

When branching Larrabee pre-emptively switches hardware thread. So a new qquad (or other VPU work, e.g. the Setup hardware thread) switches in.

So if a hardware thread contains 4 fibres a repeat until with some nasty dependent texturing inside could look like this:

Code:

LOOP:
if f0
ALU f0 // texture coordinate math
TEX f0 // fetch
if f1
ALU f1
TEX f1
if f2
ALU f2
TEX f2
if f3
ALU f3
TEX f3
if f0
ALU f0 // process result of fetch
if f1
ALU f1
if f2
ALU f2
if f3
ALU f3
test to set predicate f0
test to set predicate f1
test to set predicate f2
test to set predicate f3
if not (f0 or f1 or f2 or f3) jmp EXIT
jmp LOOP
EXIT: // program continues after loop...
ALU f0
ALU f1
ALU f2
ALU f3

Undoubtedly, as some fibres finish their loops, this hardware thread will be able to hide less and less of the TEX latency caused by the remaining fibres. Luckily there's 3 other hardware threads running on the core too. If one fibre runs for 1000 iterations more than any other then you've definitely got a problem

If there was no TEX within that loop then the fibres would be completely independent and Larrabee would be seeing the full benefit of having 16-wide hardware threads. Other forms of latency (e.g. evaluating branch destination) are entirely hidden by hardware thread switching.

So the moral of the story is don't put dependent TEX inside a variable duration loop. ATI will definitely run this kind of code far far better - but it has 4x the number of elements per hardware thread, so it is usually far behind on the more common cases of incoherent DB with no TEX.

Okay, that's what I figured, but you are seriously underestimating the software power needed to do that.

I remembered the formal name for this, Conditional Routing - been bugging me for over a day now:

http://cva.stanford.edu/publications/2004/kapasi-thesis/

Not only are the combinatorics very ugly, but you need each fibre to be at the same branch location, which is either highly improbable or needs some seriously complicated scheduling to "realign" the IPs of different fibres.

By flushing state to memory and then using gather, driven by the predicates of the fibres, you can assemble state for all the fibres that share coherent branching. This is only required for the section of code inside the DB.

Also, because Larrabee is circular-scheduling, the IPs are aligned, as I showed in the code snippet above.

In terms of graphics, this repacking messes up a lot of things like sample masks, derivatives, texturing, etc. No longer is each fibre spatially confined to one 4x4 block.

But you're not allowed to do that stuff inside DB - you have to perform the fetch before entering the DB.

And anyway, after the DB has completed you can scatter state back to the original mapping of elements to fibres. It's reversible.

I'm not arguing it's computationally trivial - merely that incoherence can be so expensive that it can be worth doing.

Jawed

3dilettante · Aug 15, 2008

Jawed said:
When branching Larrabee pre-emptively switches hardware thread. So a new qquad (or other VPU work, e.g. the Setup hardware thread) switches in.

Where is stated that Larrabee preemptively switches on branches?

But you're not allowed to do that stuff inside DB - you have to perform the fetch before entering the DB.

But-but-it's x86, so it can do anything!

I'm not arguing it's computationally trivial - merely that incoherence can be so expensive that it can be worth doing.

I wonder if it's possible through some pathological code scheduling and inopportune scatter writes to inadvertently evict a qquad's state back to main memory.

Jawed · Aug 15, 2008

3dilettante said:
A fully associative GPU cache or a set-associative cache?
The P54 had a 2-way L1 Dcache.
Probably with 4 times the threads each Larrabee core has a more associative cache.

I was thinking of L2 rather than L1. Larrabee's special cache instructions appear to be targetted at controlling L2.

So, erm, I've got no idea whether a comparison with P54 is useful. Did it have L2?

Well, this is x86, though the VPU register count is up in the air.

And now that we've found out they're planning a maximum of about 10 fibres per hardware thread, i.e. >30 fibres per core (assuming 3 hardware threads are doing pixel shading), we can get a sense of the register file size.

At least 2 registers per element: 32 fibres * 16 elements * 2 registers * vec4 * 4 bytes ~ 16KB.

Jawed

Jawed · Aug 15, 2008

3dilettante said:
Where is stated that Larrabee preemptively switches on branches?

Section 3.2:
"Switching threads covers cases where the compiler is unable to schedule code without stalls". Because there's no branch prediction and everything is done with predication, a stall arises when jumping - but if switching hardware threads obviates the stall. This is statically compilable - just like in ATI.

Actually, hmm, "pre-emptively" isn't the correct term. Hmm, lol, I was thinking of the compiler pre-empting the branch stall - the core just acts dumb here, I don't think it's switching threads of its own accord. Sorry about that.

But-but-it's x86, so it can do anything!

Which is why I decided to do the dependent texturing example - because it does look like a disaster on Larrabee. I can't think of a decent solution to this if the program is allowed to have an arbitrary loop count.

Clearly this is likely to come in general computation, e.g. doing sparse gathers inside a loop. Ct seems to revel in doing this, so I guess I'm missing something.

I wonder if it's possible through some pathological code scheduling and inopportune scatter writes to inadvertently evict a qquad's state back to main memory.

Hmm, I think in the general case this is the default behaviour of Larrabee.

For example if you want to send register data from VPU to the scalar unit, you have to write it to memory. Obviously, that just means it goes out to cache most of the time.

Jawed

3dilettante · Aug 15, 2008

Jawed said:
I was thinking of L2 rather than L1. Larrabee's special cache instructions appear to be targetted at controlling L2.

So, erm, I've got no idea whether a comparison with P54 is useful. Did it have L2?

The L2 was off-die.

And now that we've found out they're planning a maximum of about 10 fibres per hardware thread, i.e. >30 fibres per core (assuming 3 hardware threads are doing pixel shading), we can get a sense of the register file size.

At least 2 registers per element: 32 fibres * 16 elements * 2 registers * vec4 * 4 bytes ~ 16KB.

That would indicate 64 64-byte vector registers per hardware thread. That's downright RISC-like.

Jawed said:
Hmm, I think in the general case this is the default behaviour of Larrabee.

I was making a distinction between cache and RAM. I think Larrabee would do best to not force an upcoming qquad's state to memory hundreds of cycles away.
Perhaps there is a way the core or compiler can ensure the furthest it can go is the L2.

For example if you want to send register data from VPU to the scalar unit, you have to write it to memory. Obviously, that just means it goes out to cache most of the time.

That's reminiscent of Xenon, where moving data between pipes requires a similar trip. The latency from that is pretty significant in the Xbox implementation.
Perhaps that is something we can expect to improve with LarrabeeII.

Larrabee at Siggraph

nAo

Nutella Nutellae

3dilettante

TimothyFarrar

rpg.314

3dilettante

heliosphere

Jawed

Mintmaster

Barbarian

psurge

Jawed

glw

Jawed

Mintmaster

3dilettante

Jawed

3dilettante

Jawed

Jawed

3dilettante

Similar threads