HSA/HSAIL vs. OpenCL/SPIR vs. CUDA/PTX

Nick

Veteran
Hi all,

The HSA Foundation has recently released the HSAIL specification. I've only glanced over it so far, but it seems similar in nature to OpenCL SPIR and CUDA.

HSA has unified memory and can dispatch new work, but that might eventually find its way into OpenCL and CUDA as well. So is this just more fragmentation of the GPGPU ecosystem or is HSA going to make everything else redundant? I know OpenCL can be implemented on top of HSA, but will NVIDIA ever join the HSA Foundation and stack CUDA on top of HSAIL as well? And what about Intel, who only stands by OpenCL?

So basically, who do you think will win? Or will unified computing make this an irrelevant battle?

Cheers,
Nick
 
Is there a "none of the above" option? :p

I think it also depends on how one defines "win". CUDA will probably never see wide spread adoption but I don't think that's the goal. Thus, I don't know if it directly competes with the other APIs you listed. I would consider CUDA a success (it allowed Nvidia to legitimately enter the hpc market), but does that success count as a "win"?

OpenCL (in its current form) seems pretty dead. Maybe if Apple really starts pushing it in mobile (and even then...). Khronos needs to make some major changes for OpenCL to win.

HSA is completely unproven and founded by companies with less than stellar software records. I definitely think it has promise, but I worry about its actual implementations. It's too early imo for any predictions on HSA until we get more "fine print" details.

At this point, we might as well draw names from a hat to predict the winner.
 
FYI, CUDA already supports launching new work on GK110 and GK208 based parts, and plans for unified virtual memory were announced a while ago. I believe OpenCL will also support these things.

Also, HSA doesn't really have a language. As far as I know they're expecting people will build C++ AMP and OpenCL (and other stuff) on top of HSA, but there is no competing official HSA front end. In that sense, HSA doesn't compete with CUDA or OpenCL, it competes against PTX and SPIR.
 
HSA might have a future if they can persuade the sw vendors to target it. If MS, Google, Apple can be bothered to run with it, then it might do something.
 
As pointed out above, HSAIL is not a competitor to OpenCL or CUDA and I do not think it is fragmenting the market in anyway. AMD and HSA Foundation expect that OpenCL and C++ AMP implementations and compilers for higher level language implementations will generate HSAIL. I don't think anyone except compiler and perhaps OS writers will ever write HSAIL.

HSAIL is only a competitor to PTX. It is not even a competitor to OpenCL SPIR. SPIR is essentially a variant of LLVM bytecode and I expect it to sit one level above HSAIL. In other words, I expect HSA-compliant systems will implement the following steps in the toolchain: OpenCL -> SPIR -> HSAIL -> ISA toolchain.
 
Last edited by a moderator:
None of the above

Is there a "none of the above" option? :p
Unified computing. Basically just using a homogeneous hardware ISA instead of a heterogeneous virtual ISA. It can be used to implement every API, even HSA.

I think Intel is closest to achieving that. The Xeon E5-2670 can do 666 SP GFLOPS at 115 Watt, which is 5.8 GFLOPS/Watt, while the Xeon Phi 5110P can do 9.0 GFLOPS/Watt. But AVX2 will add FMA support to the CPU so it can probably meet or exceed what Phi can do, without the heterogeneous overhead or other awkwardness of the architecture.

AVX2 seems like just the beginning. They could double the width of the vector units to 512-bit to match Phi, and have two clusters per core at half frequency to save power, each dedicated to one thread. To compensate for the loss of Hyper-Threading they could execute AVX-1024 instructions in two cycles.

So the compute API becomes irrelevant. You just have the compiler vectorize/multithread your favorite programming language. Apparently LLVM has pretty good AVX2 support already.
OpenCL (in its current form) seems pretty dead. Maybe if Apple really starts pushing it in mobile (and even then...). Khronos needs to make some major changes for OpenCL to win.
You don't lose the war by losing one battle. OpenCL is the only API that is supported by every vendor. And there is plenty of time to flesh it out. Just look at how long it has taken OpenGL (ES) to make a comeback. OpenCL can survive regardless of whether HSA, CUDA or unified computing gains the upper hand. Anyway, I'd be interested to hear what kind of major changes you'd suggest.
HSA is completely unproven and founded by companies with less than stellar software records. I definitely think it has promise, but I worry about its actual implementations. It's too early imo for any predictions on HSA until we get more "fine print" details.
I fully agree. But isn't it a bit telling that we have to wait for actual implementations and fine print details to evaluate its chances? I mean, it was announced as the summum of heterogeneous computing. Instead, it just seems to join the ranks. If it really was everything it was hyped up to be, there should be less doubt about its future at this point.

AMD is betting the farm on HSA, and that's shaping up to be really risky without a compelling and durable technical advantage.
 
Last edited by a moderator:
You don't lose the war by losing one battle. OpenCL is the only API that is supported by every vendor. And there is plenty of time to flesh it out. Just look at how long it has taken OpenGL (ES) to make a comeback. OpenCL can survive regardless of whether HSA, CUDA or unified computing gains the upper hand. Anyway, I'd be interested to hear what kind of major changes you'd suggest.

Oh I don't think OpenCL "the project" is necessarily hopeless, but I do think it'll take a bit more than a few minor changes. My problem with OpenCL is its lack of robustness. The hard part with OpenCL *should be* getting optimal results across a broad range of architectures with minimal code modification. Too often developers have to battle just to get their code working on various architectures (and right now it's only amd cpu/gpu, intel cpu/gpu, and nvidia gpu...can't wait for the embedded market to catch up!). I'm not expecting perfect results on the first try, but I'd like to be reasonably sure that my code at least will run across multiple architectures.

Thus, I would be in favor of reducing/eliminating some of the undefined behavior (e.g. various pointer behaviors). In addition while I don't know exactly how rigid OpenCL conformance tests are, they don't seem rigid enough to me (iirc nv's and amd's compilers allow some non-legal behavior). This tricks programmers into thinking they are writing good code (and causes confusion when they later discover it doesn't work on x platform). Khronos also needs a better system of incorporating extensions faster into the core to avoid further fragmentation. Releasing two minor versions over 6 years is not acceptable. Lastly, it would be nice if Khronos could get all the major players to at least pretend they support the OpenCL platform (*cough* nv *cough*). Having various architectures at various levels of OpenCL (when only politics is to blame) isn't helpful and, again, only causes more fragmentation.

But isn't it a bit telling that we have to wait for actual implementations and fine print details to evaluate its chances?

I understand where you are coming from. One would hope that all ambiguity would be eliminated by now (and it begs the question, does the HSA foundation even know exactly what HSA is?). Still (and perhaps this is more of the developer talking), I'm excited to see what HSA can muster. I want to believe! :D

...AVX9001 is the future

I feel like this has been discussed to death on rwt (and really on b3d too). I'm not sure what other value I can add to the discussion. From a developer's perspective, I hope you're right! I'm just not convinced yet (either way). I think we're still a few years a way before we can really have that "fight".
 
OpenCL is the only API that is supported by every vendor.

One of the things that bothers me about OpenCL is that people think of it as an API. What we really need are full fledged programming models that work on CPUs and GPUs, etc. Calling OpenCL an API is really a legacy perspective, from back when graphics cards were fixed function accelerators. Now, they are simply parallel processors, and we don't just use them with an API, we program them to do completely new things.

C has a library, so I guess you could call C an API, but that'd be weird. I don't think any of these models will really be adopted until the API mindset is expunged and people accept GPUs as legitimate general purpose processors.
 
My problem with OpenCL is its lack of robustness. The hard part with OpenCL *should be* getting optimal results across a broad range of architectures with minimal code modification. Too often developers have to battle just to get their code working on various architectures...
Neither of those things should be the hard part. Developers should be able to concentrate on functionality, not on getting things to work nice with the hardware or the framework. Don't get me wrong, I agree it starts with robustness and stuff actually working across architectures. I'm just saying the bar is higher than that for becoming truly successful.
Thus, I would be in favor of reducing/eliminating some of the undefined behavior (e.g. various pointer behaviors). In addition while I don't know exactly how rigid OpenCL conformance tests are, they don't seem rigid enough to me (iirc nv's and amd's compilers allow some non-legal behavior). This tricks programmers into thinking they are writing good code (and causes confusion when they later discover it doesn't work on x platform). Khronos also needs a better system of incorporating extensions faster into the core to avoid further fragmentation. Releasing two minor versions over 6 years is not acceptable. Lastly, it would be nice if Khronos could get all the major players to at least pretend they support the OpenCL platform (*cough* nv *cough*). Having various architectures at various levels of OpenCL (when only politics is to blame) isn't helpful and, again, only causes more fragmentation.
Agreed. I don't think "Khronos" is to blame though. It's a gathering of promoters, implementers and adopters, and unless their interests align, not much gets done. Apple needed OpenCL for things like Final Cut Pro, and seems to have lost interest since. Other parties try to keep things going, but they lack the leverage and resources Apple had.

So I guess I do agree that OpenCL is pretty dead in that respect, for now. I was just comparing it against the alternatives and noticed it's the only one that has a chance of surviving in the long run, if the issues get sorted out (no matter how long that takes). NVIDIA and Intel aren't HSA Foundation members, and CUDA is proprietary.
I feel like this has been discussed to death on rwt (and really on b3d too). I'm not sure what other value I can add to the discussion. From a developer's perspective, I hope you're right! I'm just not convinced yet (either way). I think we're still a few years a way before we can really have that "fight".
Yes, unified computing has been discussed before, but the previous shortsighted observations don't settle the argument. The convergence is ongoing and there's new data to support it with every architecture. I'm afraid that a lot of people are in denial, and this can hit them faster than they expect. If Skylake has 512-bit SIMD units, then that will make the "fight" very obvious, but the way I see it, it will have already been won. GPUs seem to have no defense against it that would revert the slow but steady convergence.
 
One of the things that bothers me about OpenCL is that people think of it as an API. What we really need are full fledged programming models that work on CPUs and GPUs, etc. Calling OpenCL an API is really a legacy perspective, from back when graphics cards were fixed function accelerators. Now, they are simply parallel processors, and we don't just use them with an API, we program them to do completely new things.

C has a library, so I guess you could call C an API, but that'd be weird. I don't think any of these models will really be adopted until the API mindset is expunged and people accept GPUs as legitimate general purpose processors.
It's not about mindset. The API is necessary as long as the O.S. can't just let processes have the GPU execute binary code directly. This requires being able to treat GPU threads the same way as CPU threads, with full support for things like preemption. The problem is that GPU threads are too numerous and too big to be treated that way. It might be possible for the O.S. to handle them differently, but this seems to be a chicken-or-egg issue. They won't do it until the hardware fully supports it, and the hardware manufacturers won't do it until the O.S. offers a suitable solution. For now they both seem quite happy with the API approach, ignoring the cries of application developers...

In the meantime, CPUs don't suffer from this issue, and they are quickly increasing the SIMD processing power. They don't need any specific API at all, and developers can use any programming language they like.
 
I favor Intel's approach to assimilation of GPU's taxonomy into the CPU.
It just makes more sense; a seamless homogeneous architecture .

This makes me wonder what will happen to Nvidia if Intel's right. At least AMD can dump HSA and use VEX and whatever future extension there'll be.
 
Last edited by a moderator:
AVX2 seems like just the beginning. They could double the width of the vector units to 512-bit to match Phi

Would you mind explaining this in a little more detail please? My understanding of the subject is a little hazy. I know both a Phi core and a Haswell core are capable of the same 32 flops per cycle so what other differences account for one being 256-bit wide and the other being 512-bit?

Presumably if Haswell is upped to 512-bit it would then be capable of double the flops per cycle/core as Phi?
 
Would you mind explaining this in a little more detail please? My understanding of the subject is a little hazy. I know both a Phi core and a Haswell core are capable of the same 32 flops per cycle so what other differences account for one being 256-bit wide and the other being 512-bit?

Presumably if Haswell is upped to 512-bit it would then be capable of double the flops per cycle/core as Phi?

Because Xeon Phi top model has a full 512 bit SIMD unit for each of its 61 cores, thus hovering over 2 TFLOPS at 32bit.
Haswell has two 256 bit SIMD units per core.
 
Back
Top