View Full Version : Microsoft published C++ AMP spec
Microsoft published an open spec (http://blogs.msdn.com/b/somasegar/archive/2012/02/03/c-amp-open-specification.aspx) of C++ AMP (Accelerated Massive Parallelism), which is implemented in Visual Studio 11.
Spec here (PDF) (http://download.microsoft.com/download/4/0/E/40EA02D8-23A7-4BD2-AD3A-0BFFFB640F28/CppAMPLanguageAndProgrammingModel.pdf)
Microsoft published an open spec (http://blogs.msdn.com/b/somasegar/archive/2012/02/03/c-amp-open-specification.aspx) of C++ AMP (Accelerated Massive Parallelism), which is implemented in Visual Studio 11.
They really seems to promote an open standard:
Copyright License. Microsoft grants you a license under its copyrights in the specification to (a) make copies of this specification to develop your implementation of this specification, and (b) distribute portions of this specification in your implementation or your documentation of your implementation.
(even if the following part where the license talks about patents isn't totally clear to me).
Writing an implementation over OpenCL looks like a straightforward process.
(even if the following part where the license talks about patents isn't totally clear to me).
IANAL but it seems to me that they are saying is: we promise not to sue you for any patent infringement w.r.t. this spec if you don't sue us :)
It's really important for this kind of things to be open if it really wants to be successful.
That's how I read it too, under the same restriction. Pretty sure a lawyer would find a way to twist that but:smile:
On a separate note, I like AMP quite a bit more than I like OpenCL, in spite of it being currently more limited. It remains to be seen if any partie other than Microsoft will pick up the compiler writing mantle though. Some g++amp.exe would be useful...
Is that like the patent protection you can buy from microsoft
"Now you have the option to acquire Xandros Desktop offerings together with Microsoft patent assurance. This assurance enables you to use Xandros Desktop software with confidence. This program is available for $50. Learn more by reading Microsoft's covenant."
http://www.microsoft.com/about/legal/en/us/IntellectualProperty/IPLicensing/customercovenant/xandros.aspx
On a separate note, I like AMP quite a bit more than I like OpenCL, in spite of it being currently more limited.
AMP looks promising however it doesn't seem exactly a direct OpenCL competitor. It is a bit like comparing Java Vs. Assembler. Ok, may be there isn't such a huge difference but OpenCL exposes a lot of hardware details.
P.S. anyone remembers the old days when C++ compilers were just front-end for translating the code in C ? Having something similar from C++ AMP to OpenCL would be quite useful.
AMP looks promising however it doesn't seem exactly a direct OpenCL competitor. It is a bit like comparing Java Vs. Assembler. Ok, may be there isn't such a huge difference but OpenCL exposes a lot of hardware details.
P.S. anyone remembers the old days when C++ compilers were just front-end for translating the code in C ? Having something similar from C++ AMP to OpenCL would be quite useful.
That ties into the limited aspect. That being said, I'm not that keen on the way OpenCL ends up exposing things (oh look, we're really close to the metal really...only that we're not really that close once one looks at is), and to be honest I have no confidence in its evolutionary path being anything worthwhile.
The whole Khronos boardism means a neverending tug of war between IHVs (just look at how nicely OpenGL did as a comitee driven effort). Apple had the potential to make things right by being the ultimate shepard / vetoer, but they seem utterly incapable to do so. What AMP has going for itself is primarly the same thing that made DX succeed: whilst it is consultation driven, MS ends up calling what happens when and how. Once AMP actually ends up firmly matching the featureset exposed by DirectCompute, I'd be surprised if CL earns anything worthwhile on GPUs. For CPUs it's likely that you may end up getting better performance than whatever WARP gives you, however, to be honest, I'd rather use something like ISPC there, if one doesn't want to get to intrinsincs.
On the other hand, this is a question of taste on my part (IMHO most of the programming language / tool warfare falls into this category, ultimately work can be done with almost anything unless it's hugely bad) so I do apologize if I end up sounding like other programming tool nazis:smile:
rpg.314
07-Feb-2012, 01:23
Promise not to sue != License. If MS really wanted to push AMP as an open standard, then they would have given a license grant predicated on non-litigation.
Promise not to sue != License. If MS really wanted to push AMP as an open standard, then they would have given a license grant predicated on non-litigation.
Well, Microsoft already promised not to sue anyone for patent infringement over any compliant implementation of this spec. I think that's good enough if all you want is to make an AMP implementation. Providing free license is probably better but I understand that it's probably too much for Microsoft to do (after all, a free license means one may be allowed to use the patent freely for other projects if he has made an implementation of AMP, and that's certainly not what Microsoft meant to do.)
I don't see how there's a future for this. Every time the hardware becomes less limited there will be a new version, further fragmenting the software ecosystem. That will only stop when eventually we're be back where we started: C++.
I don't see how there's a future for this. Every time the hardware becomes less limited there will be a new version, further fragmenting the software ecosystem. That will only stop when eventually we're be back where we started: C++.
In my opinion, C++/C should have native vector types (i.e float4, etc.) and other few new features that we have seen to pop up in OpenCL C/CUDA C/C++ AMP, etc.
It would be useful also for developing classic CPU software (i.e. SSE, AVX, etc.)
So, may be we will go back to square one but I hope with some new feature gained on the way.
In my opinion, C++/C should have native vector types (i.e float4, etc.)...
Why? Creating your own vector types is trivial.
...and other few new features that we have seen to pop up in OpenCL C/CUDA C/C++ AMP, etc.
Such as?
It would be useful also for developing classic CPU software (i.e. SSE, AVX, etc.)
Any auto-vectorizing compiler worth its salt already uses vector instructions. Visual Studio 11 will finally join the ranks too.
So, may be we will go back to square one but I hope with some new feature gained on the way.
Aside from adding the 'restrict' keyword to the C++ standard, and perhaps adding a 'vectorize' pragma, I can't think of much that would be useful in the long run.
rpg.314
13-Feb-2012, 02:48
Why? Creating your own vector types is trivial.
Then it should be a part of C++ stdlib.
Any auto-vectorizing compiler worth its salt already uses vector instructions. Visual Studio 11 will finally join the ranks too.Autovectorization is fragile.
Aside from adding the 'restrict' keyword to the C++ standard, and perhaps adding a 'vectorize' pragma, I can't think of much that would be useful in the long run.
Lambda's perhaps....
There are a lot of things that C++ could use. If you look outside your own niche, you'll find them useful.
Then it should be a part of C++ stdlib.
Should? How do you determine which composite type should become a standard? And why exactly?
Autovectorization is fragile.
It shouldn't be any more fragile than GPGPU.
Autovectorization is fragile.
Yup, in all my tests, I have always seen the compilers to produce horrible and inefficient code compared to hand written SSE code.
The only GPU compiler that had to really do some sort of autovectorization was one for AMD GPU with code for VLIW ... and one of the reasons they dropped VLIW in HD7xxx was because writing good compilers was really hard, expansive, time consuming, etc.
Then it should be a part of C++ stdlib.Should all trivial things be part of standard library? I don't see a reason for it.
Lambda's perhaps....c++11 has them and a metric TON more of awesome stuff
Yup, in all my tests, I have always seen the compilers to produce horrible and inefficient code compared to hand written SSE code.
That's all going to change with AVX2. It has vector equivalents of every scalar instruction, so it can trivially parallelize any loop with independent iterations, identical to how GPUs do it.
Yup, in all my tests, I have always seen the compilers to produce horrible and inefficient code compared to hand written SSE code.Could be a stupid question but did you use intrinsics for the hand-written SSE or straight assembly calls? I know that at least with later versions of GCC writing code with intrinsics it'll be hard to beat the compiler in generated code efficiency. Obviously hoping it can figure out to use the SSE instructions itself is a whole different matter :)
Could be a stupid question but did you use intrinsics for the hand-written SSE or straight assembly calls?
Intrinsics, for instance to do the intersection of a ray with 4 triangles in a single shot, ray/4xBounding box, etc. GCC can not even start to figure out how to auto-vectorize the code written in plain C++. Indeed, it isn't really a GCC fault, you need native float4 data type to write something where SSE/AVX can be used.
Side note: it is also noticeable how hard is to read the code written with intrinsics compared to something written with native float4 (for instance with OpenCL C). From my point of view, this alone, could be seen as a good reason to introduce native vector data types.
RecessionCone
13-Feb-2012, 19:25
That's all going to change with AVX2. It has vector equivalents of every scalar instruction, so it can trivially parallelize any loop with independent iterations, identical to how GPUs do it.
What is the AVX2 vector equivalent of a store instruction? AFAIK, it doesn't exist.
AVX2 is a step forward, and will help vectorizing compilers, but it still isn't as good a compile target as either GPUs or Larrabee.
Side note: it is also noticeable how hard is to read the code written with intrinsics compared to something written with native float4 (for instance with OpenCL C). From my point of view, this alone, could be seen as a good reason to introduce native vector data types.
Nah, just write your own vector class and use inline operators with intrinsics. Same performance, much cleaner code.
Ethatron
13-Feb-2012, 21:12
Yup, in all my tests, I have always seen the compilers to produce horrible and inefficient code compared to hand written SSE code.
The only GPU compiler that had to really do some sort of autovectorization was one for AMD GPU with code for VLIW ... and one of the reasons they dropped VLIW in HD7xxx was because writing good compilers was really hard, expansive, time consuming, etc.
The HLSL compiler to assembler is also auto-vectorizing, and it's not that complex. The complex piece in the mentioned equation is VLIW.
Of course HLSL is primitive in comparison to C++, and it's easier to have small pattern-databases (peep-hole auto-vectorization would that be called I guess). Making decisions about optimality isn't that streightforward in C++. I don't think the AV itself is really the problem.
What is the AVX2 vector equivalent of a store instruction? AFAIK, it doesn't exist.
Indeed, there's no scatter in AVX2, but that's not an issue in practice because it should be avoided anyway. For best performance you should store results linearly and read sparse data with gather. You can also use the new permute instructions. And you can always fall back to scalar extract and store instructions.
AVX2 is a step forward, and will help vectorizing compilers, but it still isn't as good a compile target as either GPUs or Larrabee.
Why? I don't think any GPU has actual scatter support, and it might have been a macro in LRB. The problem is you can't achieve memory ordering consistency for scatter without blocking the load ports. So I don't think you lose anything from not actually supporting it.
Or were you thinking of something else that AVX2 is lacking?
RecessionCone
13-Feb-2012, 23:25
Indeed, there's no scatter in AVX2, but that's not an issue in practice because it should be avoided anyway. For best performance you should store results linearly and read sparse data with gather. You can also use the new permute instructions. And you can always fall back to scalar extract and store instructions.
Why? I don't think any GPU has actual scatter support, and it might have been a macro in LRB. The problem is you can't achieve memory ordering consistency for scatter without blocking the load ports. So I don't think you lose anything from not actually supporting it.
Or were you thinking of something else that AVX2 is lacking?
GPUs have real scatter support. Which can't be replaced by permute instructions or linear writes + gathers in the general case. Of course, GPUs don't have memory consistency problems either. ;)
rpg.314
14-Feb-2012, 01:45
Should? How do you determine which composite type should become a standard? And why exactly?Because a stnadard library's job is to provide good defaults for code that is widely used. Like STL.
It shouldn't be any more fragile than GPGPU.Let us agree to disagree.
rpg.314
14-Feb-2012, 01:46
The only GPU compiler that had to really do some sort of autovectorization was one for AMD GPU with code for VLIW ... and one of the reasons they dropped VLIW in HD7xxx was because writing good compilers was really hard, expansive, time consuming, etc.
No. That is not auto vectorization. That is variable packing more like it. Which is even more fragile.
Because a stnadard library's job is to provide good defaults for code that is widely used. Like STL.
The STL is built on top of the C++ language. It doesn't add any native types. The main advantage of that is that it doesn't burden the implementation of the compiler, and doesn't dirty the top namespace. So if the standard committee wants to define a standard vector library they can knock themselves out for all I care.
I just don't think it will help auto-vectorization.
Let us agree to disagree.
Actually I'd rather agree with you, but then I'd like to understand your reasoning.
No. That is not auto vectorization. That is variable packing more like it. Which is even more fragile.
So... What ATI has been doing for the longest time is more fragile than auto-vectorization, which in turn is more fragile than GPGPU?
I really don't follow your reasoning. Anything you write in a GPGPU language is basically just an implicit loop with independent iterations, right? So why not write that loop in plain C++, have the compiler detect that the iterations are independent, and then trivially vectorize it? A few keywords like 'restrict' or 'foreach' can help a lot with determining independence, but I don't think you need much else. Definitely not something as invasive and restrictive as C++ AMP.
rpg.314
15-Feb-2012, 01:46
The STL is built on top of the C++ language. It doesn't add any native types. The main advantage of that is that it doesn't burden the implementation of the compiler, and doesn't dirty the top namespace. So if the standard committee wants to define a standard vector library they can knock themselves out for all I care.
I just don't think it will help auto-vectorization.We are not using consistent terminology here.
For me, auto vectorization is compiler detecting independent iterations of loops and generating SSE/AVX. Also, I don't consider these instructions to be vector. So obviously, a standard float4 class is not going to help auto vectorization.
Actually I'd rather agree with you, but then I'd like to understand your reasoning.
Because compilers aren't perfect and they have to be very conservative. Which pales in front of the trap of un intended serial implementation of a potentially vectorizable algorithm. SPMD, by forcing independce of lanes, allows much more robust vectorization.
So... What ATI has been doing for the longest time is more fragile than auto-vectorization,
Of course. Why do you think they got rid of it? It's really hard to pack general variables into VLIWish form while honoring register file constraints.
A few keywords like 'restrict' or 'foreach' can help a lot with determining independence, but I don't think you need much else. Definitely not something as invasive and restrictive as C++ AMP.
There is a lot more C++ needs, and could use. You just have to consider integer codes.
For me, auto vectorization is compiler detecting independent iterations of loops and generating SSE/AVX. Also, I don't consider these instructions to be vector.
Then what do you consider them to be, and what would it take to make them "vector"?
Because compilers aren't perfect and they have to be very conservative. Which pales in front of the trap of un intended serial implementation of a potentially vectorizable algorithm. SPMD, by forcing independce of lanes, allows much more robust vectorization.
All you need is a compiler hint that you're expecting a certain loop to be vectorized, and if it isn't then a descriptive warning should be generated (the same way __forceinline works).
I don't think explicit SPMD is a good idea. Compilation should never fail. The thing is, software development is getting harder every year. So we need all the help from compilers we can get, and not have them make things more complicated. A large portion of developers will hardly care whether a loop was vectorized or not. Only if the performance doesn't meet the target, we need gentle tools to get the desired results.
GPGPU hasn't taken a big flight yet because (a) it's very time consuming to learn and then rewrite your algorithms and tune them, and (b) there's a lot of fragmentation due to hardware-specific limitations/capabilities reflected in the languages/APIs, impeding a flourishing software economy. So there's a need to lower the bar and make things device-independent.
There is a lot more C++ needs, and could use. You just have to consider integer codes.
What do you mean?
Ethatron
15-Feb-2012, 15:11
All you need is a compiler hint that you're expecting a certain loop to be vectorized, and if it isn't then a descriptive warning should be generated (the same way __forceinline works).
What you are thinking of are only the low hanging fruits. Loops don't make the majority of code to be auto-vectorized. In fact till higher shader models there wasn't loops in HLSL for example.
The challenge for a auto-vectorizer is to look at a blob of seemingly serial code and break this in independent pieces (parallelizable) which then are overlayed and tries to find out if a match of operations can be achieved at the same moment in the sequence of operants. Then those operants can run in a vector.
Said that, auto-vectorization, the complex part, is fine-grained auto-parallelism of carefully aligned (matched) operations.
The loop-thing (and the regular code as well of course) can also be made more complex by introducing branches in the loop. Again, loops are not branch-free in the majority. They will have branches. Then the compiler need to be very clever if he can make the branches more or less independent of operation-streams after the branch, or if the branches can be converted into simple data-moves. Sometimes that is not possible.
When you ask for an auto-vectorizer as part of a compiler which only treats branchless loops, I doubt anyone producing compilers sees the real use in only that. One has to offer the whole thing. The whole thing in HLSL is easy, it's hard in C++, when you think of all the additons (templates, operator overloading, custom types etc.)
Your loop+no-branches can easily be implemented in a vector library (Havok did this, very elegant), no need to torture a compiler with it.
What you are thinking of are only the low hanging fruits. Loops don't make the majority of code to be auto-vectorized. In fact till higher shader models there wasn't loops in HLSL for example.
I'm not thinking of loops in the SPMD program. I'm talking about the implicit loop(s) (http://en.wikipedia.org/wiki/GPGPU#Kernels) that surround it which iterate through the data elements.
The challenge for a auto-vectorizer is to look at a blob of seemingly serial code and break this in independent pieces (parallelizable) which then are overlayed and tries to find out if a match of operations can be achieved at the same moment in the sequence of operants. Then those operants can run in a vector.
No, it can do the much simpler job of making sure that multiple instances of the kernel can run in parallel on individual vector lanes.
I'm afraid rpg.314 is right that we don't have consistent terminology here. I fully realize that you can also take the kernel code and try to find sequences of identical operations and put that in a vector. But in light of C++ AMP and how it can evolve back into plain C++ that is not what we're looking for.
rpg.314
16-Feb-2012, 01:13
Then what do you consider them to be, and what would it take to make them "vector"?Glorified VLIW. Scatter/gather/predication.
rpg.314
16-Feb-2012, 03:33
All you need is a compiler hint that you're expecting a certain loop to be vectorized, and if it isn't then a descriptive warning should be generated (the same way __forceinline works).
I don't think explicit SPMD is a good idea. Compilation should never fail. The thing is, software development is getting harder every year. So we need all the help from compilers we can get, and not have them make things more complicated. A large portion of developers will hardly care whether a loop was vectorized or not. Only if the performance doesn't meet the target, we need gentle tools to get the desired results.Who said anything about compilation failing? SPMD will run fine even if there are no independent lanes to work with. But its vectorization is robust.
SPMD is no more harder than actually ensuring that the loop you wrote is actually parallelizable. But for a compiler to make sure that all loops suggested to be parallel actually are, is very much harder.
GPGPU hasn't taken a big flight yet because (a) it's very time consuming to learn and then rewrite your algorithms and tune them, and (b) there's a lot of fragmentation due to hardware-specific limitations/capabilities reflected in the languages/APIs, impeding a flourishing software economy. So there's a need to lower the bar and make things device-independent.There has been a single revision of Direct Compute so far. And no major revisions of OCL. All of these already are device independent. So that assertion doesn't hold.
What do you mean?Discriminated unions, better type inference, typeclasses....
Glorified VLIW.
It's not VLIW at all; there's only one opcode. Perhaps it's glorified SIMD. In any case it has the foundations for loop vectorization.
Scatter/gather/predication.
Gather is added to AVX2, which is the one where it will start to truly matter. Scatter doesn't seem necessary / worth it to me (yet). And Intel's CPUs can do two blend instructions per clock which is plenty for implementing branches.
Who said anything about compilation failing?
Compilation of amp-restricted functions can fail for many reasons:
- There's no support for char or short types, and some bool limitations apply as well.
- There's no support for pointers to pointers.
- There's no support for pointers in compound types.
- There's no support for casting between integers and pointers.
- There's no support for bitfields.
- There's no support for variable argument functions.
- There's no support for virtual functions, function pointers, or recursion.
- There's no support for exceptions.
- There's no support for goto.
The list goes on, and there are also device-specific limitations on data size and such.
SPMD is no more harder than actually ensuring that the loop you wrote is actually parallelizable. But for a compiler to make sure that all loops suggested to be parallel actually are, is very much harder.
Just because it's complex doesn't mean it's impossible. Writing compilers has always been hard. But you should have a look at the amazing achievements of the LLVM developers (and take a peek at the Polly project). I'd rather let those experts deal with the device limitations as much as possible than have it reflected in the language.
There has been a single revision of Direct Compute so far. And no major revisions of OCL. All of these already are device independent. So that assertion doesn't hold.
Minor revisions also cause fragmentation. We have three versions of OpenCL, plus a bunch of extensions. There's six versions of CUDA, and no doubt Kepler will bring a seventh. And there's already a versioning system in place for C++ AMP, with the mention that "it is likely that C++ AMP will evolve over time, and that the set of features that are allowed inside amp-restricted functions will grow". And HSA's unification of the x86-64 addressing space will also lift numerous limitations.
This fragmentation really isn't helping the adoption of general purpose throughput computing. And an ecosystem in which code can be exchanged (commercial or otherwise) is close to non-existent. I can only see this change for the better when the language has minimal restrictions (preferably none at all) and abstracts the device capabilities. Vendor lock-in isn't going to work anyway and it's all evolving back to generic languages.
Discriminated unions, better type inference, typeclasses....
Meh. It's C++, a lot of things are done explicitly. And these features are not even relevant to vector processing. Also note that general purpose throughput computing doesn't have to be limited to C/C++. Auto-vectorizing compilers can have many front-ends. Yes that's fragmentation too but at least it's driven by language features and not evolving device limitations/capabilities.
rpg.314
17-Feb-2012, 01:42
Compilation of amp-restricted functions can fail for many reasons:
- There's no support for char or short types, and some bool limitations apply as well.
- There's no support for pointers to pointers.
- There's no support for pointers in compound types.
- There's no support for casting between integers and pointers.
- There's no support for bitfields.
- There's no support for variable argument functions.
- There's no support for virtual functions, function pointers, or recursion.
- There's no support for exceptions.
- There's no support for goto.
The list goes on, and there are also device-specific limitations on data size and such.We are not talking bout the same thing here. I said compilation wouldn't fail if the loop wasn't vectorizable.
Just because it's complex doesn't mean it's impossible. Writing compilers has always been hard. But you should have a look at the amazing achievements of the LLVM developers (and take a peek at the Polly project). I'd rather let those experts deal with the device limitations as much as possible than have it reflected in the language.They know there are a lot of limits to what they can achieve and what they will achieve will be ultimately less than that.
Their amazing achievements notwithstanding.
By refusing to change the language/programming model to match evolution of hw, we are back to automagical parallelization of generic C. There is no reason to believe that sucess is this regard will be any more than what has been achieved so far, no matter who works on it.
Minor revisions also cause fragmentation. We have three versions of OpenCL, plus a bunch of extensions. There's six versions of CUDA, and no doubt Kepler will bring a seventh. And there's already a versioning system in place for C++ AMP, with the mention that "it is likely that C++ AMP will evolve over time, and that the set of features that are allowed inside amp-restricted functions will grow". And HSA's unification of the x86-64 addressing space will also lift numerous limitations.
This fragmentation really isn't helping the adoption of general purpose throughput computing. And an ecosystem in which code can be exchanged (commercial or otherwise) is close to non-existent. I can only see this change for the better when the language has minimal restrictions (preferably none at all) and abstracts the device capabilities. Vendor lock-in isn't going to work anyway and it's all evolving back to generic languages.
We will have to agree to disagree again.
JVM and CLR have seen far more revisions that what vendor neutral GPU compute has seen so far. I simply don't see how these assertions hold up in front of established facts.
Meh. It's C++, a lot of things are done explicitly. And these features are not even relevant to vector processing. Also note that general purpose throughput computing doesn't have to be limited to C/C++. Auto-vectorizing compilers can have many front-ends. Yes that's fragmentation too but at least it's driven by language features and not evolving device limitations/capabilities.A language that refuses to add features needed by developers who do not care for vector processing is not very a general purpose language. Personal preferences aside, vector processing isn't everything.
We are not talking bout the same thing here. I said compilation wouldn't fail if the loop wasn't vectorizable.
So you think all the limitations I summed up will go away? Anyhow, under what situation would an SPMD program be considered non-vectorizable and still compile?
There is no reason to believe that sucess is this regard will be any more than what has been achieved so far, no matter who works on it.
So you think the lack of success from auto-vectorization is due to fundamental compilation challenges, rather than the lack of wide vectors, gather, and a vector equivalent of every significant scalar operation?
We will have to agree to disagree again.
JVM and CLR have seen far more revisions that what vendor neutral GPU compute has seen so far. I simply don't see how these assertions hold up in front of established facts.
The JVM and CLR don't affect the language syntax and semantics.
A language that refuses to add features needed by developers who do not care for vector processing is not very a general purpose language. Personal preferences aside, vector processing isn't everything.
Your point being?
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.