Larrabee at Siggraph

TimothyFarrar · Aug 13, 2008

Some points are valid and good to remember, it is not as if you are going to get good performance on Larrabee without good vectorization of the code...

nAo · Aug 13, 2008

Shaders wise I expect them to do a very good job, in then end it's the simplified programming model that made modern shaders so successful.

Jawed · Aug 13, 2008

nAo said:
Shaders wise I expect them to do a very good job [...]

And since it looks like Larrabee will be "serial-scalar done right" it'll prolly work quite well :smile:

Jawed

glw · Aug 13, 2008

"NVIDIA's approach to parallel computing has already proven to scale from 8 to 240 GPU cores."

How seriously can you take Nvidia's comments when they bend terminology like so? Change it to 30 cores or 10 three processor clusters and it doesn't sound quite so impressive.

The HPC people I know are very interested in Larrabee.

nAo · Aug 13, 2008

glw said:
"NVIDIA's approach to parallel computing has already proven to scale from 8 to 240 GPU cores."

How seriously can you take Nvidia's comments when they bend terminology like so? Change it to 30 cores or 10 three processor clusters and it doesn't sound quite so impressive.

I was just waiting a post from Aaron about this matter

MfA · Aug 13, 2008

They will get relatively good utilization of the compute resources with shaders (although small tris and incoherent branching will hit it like a ton of bricks just like other GPUs). How much compute resources will they end up with though? I certainly wouldn't be surprised if ATI gets near twice the FLOPs per mm2 in the end, even with a process disadvantage.

glw · Aug 13, 2008

MfA said:
I certainly wouldn't be surprised if ATI gets near twice the FLOPs per mm2 in the end, even with a process disadvantage.

I think that's quite probable, but it's delivered performance not peak that matters.

If Larrabee does more than play games it can succeed even if it isn't the fastest or most efficient 'GPU'. A 4 TFLOP GPU that you use 10% of the time or a 2 TFLOP CPU you use 50% of the time, which would you buy? I know most of you guys would say both.

A killer video or killer photo* app could sell a LOT of chips, and there are applications that aren't practical today that could become mass-market.

*Have a look at recent SIGGRAPH papers that are to do with imaging rather than rendering, there are some amazing things that could be done with lots of FLOPS.

ShaidarHaran · Aug 13, 2008

glw said:
I think that's quite probable, but it's delivered performance not peak that matters.

If Larrabee does more than play games it can succeed even if it isn't the fastest or most efficient 'GPU'. A 4 TFLOP GPU that you use 10% of the time or a 2 TFLOP CPU you use 50% of the time, which would you buy? I know most of you guys would say both.

That's a tough one... Peak performance wins benchmarks, and benchmark wins sell hardware. I guess we'd have to get into specific cases to really answer that question.

glw said:
A killer video or killer photo* app could sell a LOT of chips, and there are applications that aren't practical today that could become mass-market.

*Have a look at recent SIGGRAPH papers that are to do with imaging rather than rendering, there are some amazing things that could be done with lots of FLOPS.

I've been espousing this view about GPGPU for awhile now.

3dilettante · Aug 13, 2008

Right now, the GPU FLOPs are "cheap".
Actually, x86 clusters are popular for HPC because a lot of it benefits from "cheap flops".
GPU FLOPs in comparison are not just cheap but "folex" cheap.

Larrabee's FLOPs are IEEE compliant and they are tied to a hardware architecture with known memory consistency and coherency, as well as a pipeline capable of supporting precise exceptions and possibly better debugging instrumentation.

GPUs are still too rooted at a base architectural level to the idea that what they compute doesn't need to be taken seriously.

Jawed · Aug 13, 2008

With 16 elements per hardware thread, Larrabee will be paying the lowest cost of any GPU for DB incoherency. Additionally, with an x86 thread per core that's able (amongst other things) to "re-pack" elements to minimise incoherency, DB should end up reasonably useful - though I will admit non-D3D programmers will prolly find themselves having to implement re-packing.

As for ALU performance per mm, I think it'll be a lot closer than people are expecting. Having no dedicated interpolation and/or transcendental ALUs should save a fair bit of space. And Larrabee will be way out in front on double-precision.

Jawed

aaronspink · Aug 13, 2008

nAo said:
I was just waiting a post from Aaron about this matter

I've harped on it so much I can sit on the sideline and let other people bash it out.

I do think nvidia doth protest a bit much, fundamentally from a computation perspective what has been published on larrabee and G80/g100 are closer to each other than either are to ATIs 670/770.

But fundamentally, its not that complex, thread sequencing, x16 or x8 vectors, lots of them, balance area, power, frequency. have some texture units sitting somewhere do deal with all that 8b math no one cares about.

All three designs seem to share these general trends. you can try to sound different by saying your x8 simd unit is actually 8 separate processors but it doesn't really change anything.

aaronspink · Aug 13, 2008

3dilettante said:
GPUs are still too rooted at a base architectural level to the idea that what they compute doesn't need to be taken seriously.

very much true, I wouldn't want to rely on the current GPUs for anything life critical or computationally critical like nuclear simulation.

It will be interesting to see if the various companies bring the DP capabilities of their designs to rough area efficient equivalence with the SP capabilities (ie. ~1/2 DP flop for every SP flop).

Mintmaster · Aug 14, 2008

Jawed said:
With 16 elements per hardware thread, Larrabee will be paying the lowest cost of any GPU for DB incoherency. Additionally, with an x86 thread per core that's able (amongst other things) to "re-pack" elements to minimise incoherency, DB should end up reasonably useful - though I will admit non-D3D programmers will prolly find themselves having to implement re-packing.

There's a lot of unknowns about Larrabee and DB. That paper just uses simple cycling between the batches, so there's no real scheduling. Who knows how well the x86 thread can handle this task.

When running with 16 elements per hardware thread, remember that Larrabee switches batches every clock. Achieving that kind of scheduling throughput with a dynamic algorithm (i.e. not a predefined sequence) may not be possible.

armchair_architect · Aug 14, 2008

From reading the paper, it sounds like Larrabee puts more of a burden on the programmer for extracting efficiency than GPUs do (espcially in OGL/D3D but also CUDA). They've got smart people working hard to handle this for D3D and OGL apps, but if you go outside those or other libraries you take it on yourself -- and going outside these libraries is what Intel has been selling to developers.

The two big ones to me are:

(a) Hiding memory latency. They don't have lots of hardware-managed threads available per core to automatically hide latency. As far as I can tell from the paper, their renderer makes a "best effort" for general memory accesses, using prefetching and organizing the algorithms and data structures to maximize cache hits. They do have more cache than GPUs, but far less per "strand" than current CPUs do, and when they miss they're quickly going to start stalling. For texturing, they do SW context switching (between "fibers") on top of the HW threads, using a very simplistic scheduler -- just a ring of fibers per HW thread, blindly switch to the next after issuing a texture instruction. The SW is responsible for deciding when to switch and for actually performing the switch (consuming issue slots, etc.).

(b) Manually managing the predication stack. They have a vector predicate register, but it sounds like it's up to SW to push and pop it manually. They'll probably have some special support for this, but it looks like at least some of the burden is on SW to deal with this on every potentially-divergent branch/sync.

Neither of those sounds like much fun to me, or particularly easy. Both will affect efficiency and delivered "useful" FLOPs.

As Aaron pointed out, architecturally Larrabee and G80+ have much in common. The biggest difference, it seems to me, is going to be the programming model. CUDA hides the SIMD-ness of the hardware -- it's just an implementation detail. Nvidia can build wider or narrower cores (including non-SIMD) and the same code would run fine and scale pretty well. You need to be aware of SIMD to extract the maximum performance, but my experience so far is that you get pretty decent performance even if you don't think about it.

Larrabee (and SSE/AVX) on the other hand expose the SIMD-ness and you have to manage it yourself. The width is baked into your code: they can change number of execution units (at which point they're relying on ILP with the associated costs) but code doesn't automatically scale. We'll see this with AVX: all the existing SSE code won't magically take advantage of the wider vectors. If you ignore SIMD, you run on the scalar x86 units only, and get none of the benefit of all that throughput. This is why I think Intel's noise about Larrabee being x86 is pointless -- running pure x86 code on it is a waste, and SSE code won't even run (AFAIK) and would waste 3/4 of the throughput even if it did.

I think my preference is clear, though there's still a lot of unknowns about Larrabee that could change things. But regardless of which is better, the contrasts are fascinating.

EDIT: Oops, forgot one point I wanted to make. For the reasons above, I'd want to use the CUDA programming model even when programming on Larrabee, and have a compiler+runtime map it to Larrabee vector instructions. Which shouldn't actually be too hard, hopefully someone implements this so I don't have to.

nAo · Aug 14, 2008

armchair_architect said:
The SW is responsible for deciding when to switch and for actually performing the switch (consuming issue slots, etc.).

This operation is likely to be so rare that I wouldn't worry about it consuming slots.
For what we know it might have even been implemented as an operation that can be scheduled on the U pipe, so that in the same cycle an instruction that operates on the vec registers can be scheduled as well.

Larrabee (and SSE/AVX) on the other hand expose the SIMD-ness and you have to manage it yourself. The width is baked into your code: they can change number of execution units (at which point they're relying on ILP with the associated costs) but code doesn't automatically scale. We'll see this with AVX: all the existing SSE code won't magically take advantage of the wider vectors. If you ignore SIMD, you run on the scalar x86 units only, and get none of the benefit of all that throughput. This is why I think Intel's noise about Larrabee being x86 is pointless -- running pure x86 code on it is a waste, and SSE code won't even run (AFAIK) and would waste 3/4 of the throughput even if it did.

Isn't it a bit far fetched to say that the programming model you describe here will be the only one available? AFAIK Intel is part of the group working on OpenCL, they also have Ct, which I hope will see the light of the day as some point in the future.
We should also not forget DX11 compute shaders!

EDIT: Oops, forgot one point I wanted to make. For the reasons above, I'd want to use the CUDA programming model even when programming on Larrabee, and have a compiler+runtime map it to Larrabee vector instructions. Which shouldn't actually be too hard, hopefully someone implements this so I don't have to.

Exactly my thought, I'm no compilers expert but having CUDA or something CUDA-like sounds quite feasible.

Jawed · Aug 14, 2008

Mintmaster said:
There's a lot of unknowns about Larrabee and DB. That paper just uses simple cycling between the batches, so there's no real scheduling. Who knows how well the x86 thread can handle this task.

I think it's reasonable to assume the x86 part of each core (split across 4 hardware threads) is close to being fully dedicated to "managing" the execution of the 3D pipeline upon the VPU and all associated housekeeping.

When running with 16 elements per hardware thread, remember that Larrabee switches batches every clock. Achieving that kind of scheduling throughput with a dynamic algorithm (i.e. not a predefined sequence) may not be possible.

I'm not sure how you're defining a batch here, because there is no "per-clock" switching.

Fibres switch only when latency is incurred. "One remaining issue is texture co-processor accesses, which can have hundreds of clocks of latency. This is hidden by computing multiple qquads [16 elements] on each hardware thread. Each qquad's shader is called a fiber. The different fibers on a thread co-operatively switch between themselves without any OS intervention. A fiber switch is performed after each texture read command, and processing passes to the other fibers running on the thread. Fibers execute in a circular queue. The number of fibers is chosen so that by the time control flows back to a fiber, its texture access has had time to execute and the results are ready for processing."

So "every clock" switching only occurs around a texture instruction.

As far as the hardware thread is concerned, all of its extant fibres' states exist concurrently in memory (cache) and register file. The switch from one fibre to another is no more costly than in a GPU where a register starts to be used for the first time.

As far as I can tell the compiler provides directives to the processor for it to perform a hardware thread switch (I assume these are just normal stream-paradigm clause identifiers). "Finally, Larrabee supports four threads of execution, with separate register sets per thread. Switching threads covers cases where the compiler is unable to schedule code without stalls. [which I assume to mean the evaluation of predicates and choosing the resulting JMP destination, amongst other things] Switching threads also covers part of the latency to load from the L2 cache to the L1 cache, for those cases when data cannot be prefetched in the L1 cache in advance."

Doing fibre re-packing to optimise DB needs a combination of predicate evaluation and shuffling of state. I assume that state is manipulated in L1/L2 cache, i.e. this is programmed as a scatter from VPU register file into competing pools then consumed one pool at a time.

Clearly if the shader program has a very short DB clause then it's going to be slower to re-pack than to use predication. But with nested DB I presume it won't take many levels before it becomes better to re-pack. The issue, then is how much cache is consumed in pooling state as registers are pulled out of the VPU (sort of creating an F-buffer). Memory latency then becomes an issue too. The algorithm that determines the count of fibres allocated per thread presumably assesses the incoherency/nesting of DB if re-packing is desired.

Jawed

Jawed · Aug 14, 2008

armchair_architect said:
CUDA hides the SIMD-ness of the hardware [...]
Larrabee (and SSE/AVX) on the other hand expose the SIMD-ness and you have to manage it yourself.

But doesn't this distinction become moot as soon as you start programming the memory system for maximum performance - now you're trying to maximise cache hits, maximise scatter/gather bandwidth or traverse sparse data sets with maximum coherence. All those tasks become intimately tied into the SIMD width and corresponding path sizes/latencies within the memory hierarchy.

Jawed

pcchen · Aug 14, 2008

Jawed said:
But doesn't this distinction become moot as soon as you start programming the memory system for maximum performance - now you're trying to maximise cache hits, maximise scatter/gather bandwidth or traverse sparse data sets with maximum coherence. All those tasks become intimately tied into the SIMD width and corresponding path sizes/latencies within the memory hierarchy.

Yes. But a CUDA-like model is still preferred because you don't need to take care of SIMD width all the time. Sure, for best performance you have to use CUDA as a warp-sized SIMD machine, but you don't have to when you don't need the best performance, such as those marginal cases. This can greatly reduce programming time.

So I'd still like to see a CUDA-like programming model for Larrabee. It makes prototyping fast and still allows tuning for better performance.

Scali · Aug 14, 2008

Well yes, weren't people talking about a new paradigm to better handle things like multithreaded algorithms and SIMD code?
Cuda seems to be exactly that new paradigm, but I'm not sure if many people have ever thought of it that way.
Cuda hides the management of threads, synchronization, scheduling etc from the user, aswell as the scatter/gather operations and overall limited use of SIMD registers/instructions.
All you have to do is write the code that you want executed in parallel, and the compiler/hardware take care of the rest (okay, overly simplified, but still simpler than dealing with regular C++ and SIMD extensions, or OpenMP and such).

rpg.314 · Aug 14, 2008

armchair_architect said:
Larrabee (and SSE/AVX) on the other hand expose the SIMD-ness and you have to manage it yourself. The width is baked into your code: they can change number of execution units (at which point they're relying on ILP with the associated costs) but code doesn't automatically scale. We'll see this with AVX: all the existing SSE code won't magically take advantage of the wider vectors.

This is what drives me mad about the present state of SSE. Why the hell is the only way to code for it is to go to compiler intrinsics.

(-) non portable
(-) compiler specific
(-) reduced to writing assembly in 2008. From a company which supposedly has the best compilers in the world. WTF

Why did they not push for a library (standardized one on x86, atleast, with alternative implementations avl on PPC etc.). Too smitten by IPP revenues? I hope they do better with AVX.

Why the hell should I rewrite every fucking bit of my most optimized code inside out just because they added new instructions?
Which is why I love CUDA in this regard. They can go from 8 wide to 32 wide SIMD without anyone breaking a sweat. And it's programming model is such that further scaling simd width is much less painful.

Intel, however can win a lot of points if it were to release a CUDA compiler for nvidia. They will be able to co-opt all of nvidia's efforts at developer evangelism into a huge codebase running just fine on larrabee.

Larrabee at Siggraph

TimothyFarrar

nAo

Nutella Nutellae

Jawed

glw

nAo

Nutella Nutellae

MfA

glw

ShaidarHaran

hardware monkey

3dilettante

Jawed

aaronspink

aaronspink

Mintmaster

armchair_architect

nAo

Nutella Nutellae

Jawed

Jawed

pcchen

Moderator

Scali

rpg.314

Similar threads