Knights Landing Speculation

Yeah ISPC is clearly better, and it's better than the GPU computing languages too. If the whole persistent threads research has showed us anything it's that virtualizing (vs. parameterizing) the SIMD width is far too harmful to performance of non-trivial kernels.

Unfortunately HPC and other parties have too much sway as far as the standards go, and they are completely in the land of non-coding physicists who just want to trust compiler magic to get them a 2x even on 8+-wide SIMD... and again, I speak from experience on that as someone who has rewritten a lot of scientist-code :)

Very interesting, it does look like CUDA on CPU.

How much the performance gain in your application, comparing to pthreads/openmp with standard intel compiler optimization with auto-vectorization on? thanks.
 
How much the performance gain in your application, comparing to pthreads/openmp with standard intel compiler optimization with auto-vectorization on? thanks.
That depends entirely on how well the auto-vectorization stuff works for a given problem. In my experience, not very well beyond simple cases :)

I do highly encourage that you check out the examples and play with it a bit. I wrote the "deferred" one and it gets pretty much linear scaling - just like GPUs - while being able to express a more sophisticated "dynamic" tree algorithm via Cilk work stealing (on ICC) as well.
 
That depends entirely on how well the auto-vectorization stuff works for a given problem. In my experience, not very well beyond simple cases :)

I do highly encourage that you check out the examples and play with it a bit. I wrote the "deferred" one and it gets pretty much linear scaling - just like GPUs - while being able to express a more sophisticated "dynamic" tree algorithm via Cilk work stealing (on ICC) as well.

Autovectorization is a disaster. A complete, hopeless, FUBAR disaster.

Compilers can't autovectorize Mandelbrot. MANDLEBROT.
 
Yeah ISPC is clearly better, and it's better than the GPU computing languages too. If the whole persistent threads research has showed us anything it's that virtualizing (vs. parameterizing) the SIMD width is far too harmful to performance of non-trivial kernels.

How so?
 
A pile of ways, but the chief one being basically that "you write the outer loop" rather than trying to virtualize it with overkill SIMD width/parallelism. In the many algorithms that pay a work complexity hit for additional parallelism, this needlessly cripples the performance on any one implementation. Not to mention the additional hardware resources for keeping around more HW threads than needed by the algorithm (to minimally hide latency) are not free.

ISPC instead provides the SIMD width as a kernel constant and let's you decide how to handle it. You can obviously just throw a foreach (0 .. n) around everything and do it GPU style, but you can subsequently do a foreach (0 .. m) without "wasting" (n-m) threads or similar like you'd have to do in a compute shader (or flush all your local memory and launch a new kernel). Those loops need not be over constant-sized grids either, unlike GPUs.

Basically with GPUs when you start to write non-trivial code you discover that you really need to write warp-level code to be efficient (see persistent-threads and such), and that's both ugly in those programming models and technically not possible to write portably outside of CUDA. ISPCs model ultimately makes a lot more sense here.
 
A pile of ways, but the chief one being basically that "you write the outer loop" rather than trying to virtualize it with overkill SIMD width/parallelism. In the many algorithms that pay a work complexity hit for additional parallelism, this needlessly cripples the performance on any one implementation. Not to mention the additional hardware resources for keeping around more HW threads than needed by the algorithm (to minimally hide latency) are not free.

ISPC instead provides the SIMD width as a kernel constant and let's you decide how to handle it. You can obviously just throw a foreach (0 .. n) around everything and do it GPU style, but you can subsequently do a foreach (0 .. m) without "wasting" (n-m) threads or similar like you'd have to do in a compute shader (or flush all your local memory and launch a new kernel). Those loops need not be over constant-sized grids either, unlike GPUs.

Basically with GPUs when you start to write non-trivial code you discover that you really need to write warp-level code to be efficient (see persistent-threads and such), and that's both ugly in those programming models and technically not possible to write portably outside of CUDA. ISPCs model ultimately makes a lot more sense here.

I built some of the examples you provided (I guess you are one of the authors, congrats for this impressive piece of work btw), some of the performance improvements are impressive, comparable to what I expect from some hand-tunned optimizations.

However, in your example codes as well as the docs, all the native data type are of 32 bit width, is this the limitation of the current implentation of ISPC? if so, is there any plan to remove such limitations in your future release? thanks.
 
I'm really just an ISPC user but as I was one of the early folks playing with it they ended up using one of my kernels as an example, which is cool. Glad to hear you are impressed so far, but the kudos really goes to the Matt Pharr as well as the current ISPC team and contributors.

However, in your example codes as well as the docs, all the native data type are of 32 bit width, is this the limitation of the current implentation of ISPC? if so, is there any plan to remove such limitations in your future release? thanks.
I'll admit I'm not super-familiar with the state of the art ISPC release so you should probably check with them directly (or check out Embree for an even bigger production use case that targets MIC too). But yeah, it is designed to be somewhat MIC-style in terms of upconvert/downconvert on memory load/store but operating in registers using mostly 32-bit types.

You can definitely use smaller types (and certainly it's common to use them in memory as you can see in the example) but there may be some pack/unpack overhead and the vector width of your kernel won't change. See:
http://ispc.github.io/perfguide.html#avoid-computation-with-8-and-16-bit-integer-types

As far as byte and word types and doing 2x and 4x as many of those per cycle, I'm not sure if there's a way to express that right now (but again, check with the ISPC guys). While certainly for media-style kernels it's often desirable to use byte operations, mixing SIMD widths is a bit problematic in the SPMD model, thus I'm not totally convinced you can do a lot better than intrinsics anyways. How would you - for instance - expect some divergent control flow on 16-wide varying byte comparison to work with nested code that contained 4-wide 32-bit operations? I would think any operations that by necessity ended up "slicing" like that would be really unintuative in the programming model.

So yeah, I don't think they would claim you can do everything that hand-tweaked ASM can pull off (especially as far as mixing different SIMD widths of varying type sizes), but it covers a pretty good set of implementations and certainly everything the GPU languages can do.
 
Unfortunately HPC and other parties have too much sway as far as the standards go, and they are completely in the land of non-coding physicists who just want to trust compiler magic to get them a 2x even on 8+-wide SIMD... and again, I speak from experience on that as someone who has rewritten a lot of scientist-code :)

Perhaps, but the annonated loop model is being pushed at the ISO c++ committee by Intel, who should know better.

May be in another 15 years...
 
A pile of ways, but the chief one being basically that "you write the outer loop" rather than trying to virtualize it with overkill SIMD width/parallelism. In the many algorithms that pay a work complexity hit for additional parallelism, this needlessly cripples the performance on any one implementation. Not to mention the additional hardware resources for keeping around more HW threads than needed by the algorithm (to minimally hide latency) are not free.
A lot of threads is a limitation of hw and not an advantage of ISPC. Offering a simd width parameter is useful though.

MIC style cores obviously are a lot better for throughput computing than a GPU. It's a shame that the sw tools targeting it, ie autovectorizers, annonated for loops and cilk vector extensions are so crappy.

Basically with GPUs when you start to write non-trivial code you discover that you really need to write warp-level code to be efficient (see persistent-threads and such), and that's both ugly in those programming models and technically not possible to write portably outside of CUDA. ISPCs model ultimately makes a lot more sense here.
A warp is a fundamental architectural concept for a GPU, and as such, has to be exposed in a language.

Portability I can understand, but I think with a few syntactic tweaks the ugliness of writing warp based code will go away. It's a subjective question though.
 
Perhaps, but the annonated loop model is being pushed at the ISO c++ committee by Intel, who should know better.
Like I said, ICC in particular has always been big on the whole auto-vectorization silliness and it is often been hard to get HPC folks to do things in different ways, even if they are demonstrably better... (don't even get me started on scientists and doubles).

MIC style cores obviously are a lot better for throughput computing than a GPU. It's a shame that the sw tools targeting it, ie autovectorizers, annonated for loops and cilk vector extensions are so crappy.
Hey, at least you have ISPC and Cilk (proper part... spawn/sync/etc) now, which is progress over a few years ago. You have to know what you're looking for, but at least the tools exist.

A warp is a fundamental architectural concept for a GPU, and as such, has to be exposed in a language.
Absolutely that's what I'm arguing - that's pretty similar to how the ISPC model works. A "varying float" in ISPC is basically a "warp" (if you want to use those terms) at the hardware width. It's the virtualization beyond that (i.e. declare a work group of size 512 just in case!) and being tied to a static size for the entirety of a kernel that causes the inefficiency. Minimally if you could switch the size of a work group at group sync barriers that would get you a lot more power, but at that point it effectively just is a bad syntax for writing those loops/syncs :) I prefer the explicit way in ISPC.

And while warp-synchronous programming could be made somewhat prettier and more officially supported (hell, effectively the Optix core is exactly a software warp scheduler - it's basically more like what you want), at that point it's basically the same model as ISPC. i.e. if you're always writing to warps, you don't need a lot of the other concepts/mess. I think it would be great if things went that direction, but the CUDA grid model - and its various benefits and flaws - is sort of ingrained at this point.
 
Back
Top