Larrabee at Siggraph

Jawed · Aug 15, 2008

3dilettante said:
I was making a distinction between cache and RAM. I think Larrabee would do best to not force an upcoming qquad's state to memory hundreds of cycles away.
Perhaps there is a way the core or compiler can ensure the furthest it can go is the L2.

This comes back to the programmer. e.g. they've designed a way to construct a D3D pipeline that sizes a tile to fit within cache alongside other stuff that's also going to use cache. So the "driver" must assess the pixel shader for its register payload versus the amount of texture latency it needs to hide, and trade those off against tile size. ATI and NVidia don't have a tile size to worry about, but they do have to worry about cache thrashing caused by the raggedness of the progress of the batches - i.e. what's the greatest difference in program counter amongst the extant batches and what effect that has on cache thrashing.

So in Larrabee the programmer is supposed to configure L2 cache lines to suit the types of fibres running. Once a core gets under way with a phase of rendering I get the impression that the cache lines are pretty much static - e.g. in pixel shading a block of lines for the tile data, another set of lines for texture results (parameters too) and some lines for general scheduling.

One thing that's occurred to me is that Larrabee's circular fibre scheduling could lead to under-utilisation of the texture units - this is the average versus worst-case latency hiding that Mintmaster was alluding to earlier, I think. Not sure, need to think about it more.

That's reminiscent of Xenon, where moving data between pipes requires a similar trip. The latency from that is pretty significant in the Xbox implementation.
Perhaps that is something we can expect to improve with LarrabeeII.

Depends on the interval between starting the move and the other unit consuming the data - i.e. whether this mostly stays within L1 or often ends up going to L2.

---

So, what happens on interrupts? I've got no idea what happens to x86 SSE registers in this situation, so not sure what to expect in Larrabee and the effect on VPU. Is it likely that Larrabee will turn off interrupts on most cores, e.g. leaving one core as able to accept them?

Jawed

randomhack · Aug 16, 2008

Scali said:
compiler/hardware take care of the rest (okay, overly simplified, but still simpler than dealing with regular C++ and SIMD extensions, or OpenMP and such).

Hows CUDA simpler than OpenMP?

randomhack · Aug 16, 2008

edit : removed post

armchair_architect · Aug 16, 2008

Jawed said:
I'm not sure how you're defining a batch here, because there is no "per-clock" switching.

...

I think you're confusing HW threads and SW threads.

EDIT: having finished reading the thread, I guess you're not. Not sure what you meant in this post though.

They're reserving the term "thread" for HW threads, which do indeed switch every cycle just like on a GPU. Each HW thread has real registers assigned to it (statically). The round-robin HW threads will hide instruction and L1 latency; I imagine they'll also need some non-dependent instructions (like fibers with >16 strands, see below) to fully hide L2 latency, unless their L2 latency is amazingly low.

Fibers are "SW threads", and switching between them is like a thread switch on a CPU: using normal instructions you write out any live registers to memory (cache, really) including any special registers like condition codes and the vector predicate register, then write out the address this fiber will resume execution at, then read in and jump to the incoming fiber's resume address, which points to code that will read in that fiber's live registers and keep going. It's going to be 10+ instructions for a save+restore cycle assuming only a handful of live registers each in the outgoing/incoming fibers.

It sounds like they'll also be doing a sort of hybrid, where if the fibers don't need all of the vector registers they'll have multiple sets of strands active and round-robin between them within a fiber. So a fiber can be more than just 16 strands. Very similar to how NVIDIA and AMD both run each instruction through the ALUs for 4 ALU-clocks.

Check out Tom Forsyth's course presentation for most of this.

armchair_architect · Aug 16, 2008

Jawed said:
But doesn't this distinction become moot as soon as you start programming the memory system for maximum performance - now you're trying to maximise cache hits, maximise scatter/gather bandwidth or traverse sparse data sets with maximum coherence. All those tasks become intimately tied into the SIMD width and corresponding path sizes/latencies within the memory hierarchy.

They're more about the cache line size. Which happens to match the SIMD width, for reasons that should be obvious (which is why SSE/AVX/Larrabee will all have the same issue). Yes, it does break the abstraction a bit. But like I said, in my experience the 80/20 rule applies. Getting the last 20% of performance by optimizing this takes 80% of the time .. but given the vast speedup vs. CPU available even without that 20% you can certainly ignore it for starters. That's of course not true for all problems, YMMV, etc.

armchair_architect · Aug 16, 2008

nAo said:
I don't want to sound as a broken record but has anyone complaining about Intel not thinking about programming on many core architectures bothered to read this material?

Ct: C for Throughput Computing

Ct: A Flexible Parallel Programming Model for Tera-scale Architectures

While it's true that until they ship it it's just vaporware, it looks very promising to me.
They are clearing attacking the problem, and on top of that we (as developers) will probably have support for DX11's compute shaders and OpenCL.

I've looked at it. Ct to me looks like a library version of old vector architectures married to the fancy template meta-programming linear algebra libraries that started popping up a few years ago. Works great for some problems, but doesn't seem as broadly applicable as CUDA. Maybe I just lack imagination though.

DX11 compute shaders and OpenCL both appear (from what we know) to essentially be the CUDA programming model. Not CUDA-the-language exactly, but very similar decomposition of parallelism and combination thread/memory hierarchy. To be more clear, what I like is the CUDA programming model (which includes DX11 compute and OpenCL), not just NVIDIA's current implementation of that programming model.

Jawed · Aug 16, 2008

armchair_architect said:
I think you're confusing HW threads and SW threads.

EDIT: having finished reading the thread, I guess you're not. Not sure what you meant in this post though.

Check out Tom Forsyth's course presentation for most of this.

Sod it, I deferred reading most of those decks until this weekend. ARGH, that's what I should have read first.

They're reserving the term "thread" for HW threads, which do indeed switch every cycle just like on a GPU. Each HW thread has real registers assigned to it (statically). The round-robin HW threads will hide instruction and L1 latency; I imagine they'll also need some non-dependent instructions (like fibers with >16 strands, see below) to fully hide L2 latency, unless their L2 latency is amazingly low.

Agreed with all that. The paper refers to thread switching as a way to hide L2->L1 latency and to obviate stalls caused by serially dependent instructions.

Interestingly, it's quite possible that only 1 thread is running, so the core is no longer switching threads each clock - it is merely evaluating which threads it can issue from each clock. So I don't think threads are scheduled in any kind of strict round-robin fashion. That's merely a possibility allowed under SMT.

Fibers are "SW threads", and switching between them is like a thread switch on a CPU: using normal instructions you write out any live registers to memory (cache, really) including any special registers like condition codes and the vector predicate register, then write out the address this fiber will resume execution at, then read in and jump to the incoming fiber's resume address, which points to code that will read in that fiber's live registers and keep going. It's going to be 10+ instructions for a save+restore cycle assuming only a handful of live registers each in the outgoing/incoming fibers.

My mind boggles at the sheer expense of wasted cycles switching contexts due to texturing

It sounds like they'll also be doing a sort of hybrid, where if the fibers don't need all of the vector registers they'll have multiple sets of strands active and round-robin between them within a fiber. So a fiber can be more than just 16 strands. Very similar to how NVIDIA and AMD both run each instruction through the ALUs for 4 ALU-clocks.

Yeah so they're trading-off the granularity of the register file versus the total number of strands in flight. A few fibres with lots of strands versus lots of fibres with few strands. The former case will have less wasted cycles due to context switching so will hurt less when there's a lot of texturing. If the shader's free of DB then lots of strands obviously won't impact performance.

OK, well that undoes all my thinking about fibre scheduling. This also means that the register file is prolly gonna be tiny as 3dilletante was originally asserting - much more like souped-up SSE than cut-down GPU SIMD. Oh well.

Jawed

nAo · Aug 16, 2008

armchair_architect said:
I've looked at it. Ct to me looks like a library version of old vector architectures married to the fancy template meta-programming linear algebra libraries that started popping up a few years ago. Works great for some problems, but doesn't seem as broadly applicable as CUDA. Maybe I just lack imagination though.

Weird, to me it looks like some sort of functional programming language....

Barbarian · Aug 17, 2008

Jawed said:
OK, well that undoes all my thinking about fibre scheduling. This also means that the register file is prolly gonna be tiny as 3dilletante was originally asserting - much more like souped-up SSE than cut-down GPU SIMD.

The size of the vector register file is rumored to be 32. Not huge but not small either.
In the end it might actually not matter. Low latency L1 cache plus mem-op capability of the instruction set should allow an easy L1-as-huge-register-file model.

MfA · Aug 17, 2008

I wonder why only the primary pipeline can do vector loads ... I'd expect the secondary pipeline to at least be able to do unformatted loads for reading flushed registers for a fibre, seems silly to waste the primary pipeline for that.

nAo · Aug 17, 2008

MfA said:
I wonder why only the primary pipeline can do vector loads.

How do you know that?

MfA · Aug 17, 2008

Well, they said so

All instructions can issue on the
primary pipeline, which minimizes the combinatorial problems
for a compiler. The secondary pipeline can execute a large subset
of the scalar x86 instruction set, including loads, stores, simple
ALU operations, branches, cache manipulation instructions, and
vector stores.

nAo · Aug 17, 2008

MfA said:
Well, they said so

Ok, fair enough, I didn't remember that part of the paper.

Jawed · Aug 17, 2008

MfA said:
I wonder why only the primary pipeline can do vector loads ...

Anything to do with the ability of the VPU to read one operand directly from L1?

Jawed

PeterT · Aug 19, 2008

armchair_architect said:
DX11 compute shaders and OpenCL both appear (from what we know) to essentially be the CUDA programming model. Not CUDA-the-language exactly, but very similar decomposition of parallelism and combination thread/memory hierarchy. To be more clear, what I like is the CUDA programming model (which includes DX11 compute and OpenCL), not just NVIDIA's current implementation of that programming model.

In the terminology you use here, are the manual shared memory management and coalescing requirements of CUDA part of the programming model or just of NVs current implementation? Because, as someone who's used both traditional GPGPU and CUDA on HPC problems that's the part of CUDA that I can't see as part of any future more-or-less "mainstream" parallel programming language/model. I also think those should be mentioned before talking about how well CUDA scales -- because a CUDA program that's actually optimized (and when it comes to coalescing and shared memory we're not talking about single-percent type optimizations but potentially orders of magnitude differences) won't come close to porting to another architecture optimally.

It's also interesting (and I haven't seen anyone explicitly mention it in this thread) to see the different programming trade offs between Larrabee and NV GPUs. As I currently understand it, on the former, you get automatic/hardware cache management but no hardware thread scheduling. On current NV hardware it's just the other way round. I'm not yet sure what's preferable, ideally you'd have (the option of using) both.

Kaotik · Aug 20, 2008

Slightly related, Larrabee (A many core Intel Architecture for Visual Computing) presentation was apparently too popular at IDF Fall 2008, a big portion of press got left outside due lack of space. Due this, the presentation will get another round thursday.

nAo · Aug 20, 2008

Kaotik said:
Slightly related, Larrabee (A many core Intel Architecture for Visual Computing) presentation was apparently too popular at IDF Fall 2008, a big portion of press got left outside due lack of space. Due this, the presentation will get another round thursday.

Hopefully they released some more details..

kyetech · Aug 20, 2008

Tech demo vid for us laymen to enjoy would be good!

nAo · Aug 20, 2008

kyetech said:
Tech demo vid for us laymen to enjoy would be good!

Tech video on emulated hardware? not exactly exciting

kyetech · Aug 20, 2008

just cos samples arnt out till november doesnt mean thay aint got nuthin now?

besides, i like seeing pretty graphics, real time or otherwise.

Larrabee at Siggraph

Jawed

randomhack

randomhack

armchair_architect

armchair_architect

armchair_architect

Jawed

nAo

Nutella Nutellae

Barbarian

MfA

nAo

Nutella Nutellae

MfA

nAo

Nutella Nutellae

Jawed

PeterT

Kaotik

Drunk Member

nAo

Nutella Nutellae

kyetech

nAo

Nutella Nutellae

kyetech

Similar threads