Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 14-Feb-2012, 01:46   #26
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,274
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Dade View Post
The only GPU compiler that had to really do some sort of autovectorization was one for AMD GPU with code for VLIW ... and one of the reasons they dropped VLIW in HD7xxx was because writing good compilers was really hard, expansive, time consuming, etc.
No. That is not auto vectorization. That is variable packing more like it. Which is even more fragile.
rpg.314 is offline   Reply With Quote
Old 14-Feb-2012, 20:00   #27
Nick
Senior Member
 
Join Date: Jan 2003
Location: Montreal, Quebec
Posts: 1,881
Default

Quote:
Originally Posted by rpg.314 View Post
Because a stnadard library's job is to provide good defaults for code that is widely used. Like STL.
The STL is built on top of the C++ language. It doesn't add any native types. The main advantage of that is that it doesn't burden the implementation of the compiler, and doesn't dirty the top namespace. So if the standard committee wants to define a standard vector library they can knock themselves out for all I care.

I just don't think it will help auto-vectorization.
Quote:
Let us agree to disagree.
Actually I'd rather agree with you, but then I'd like to understand your reasoning.
Quote:
Originally Posted by rpg.314 View Post
No. That is not auto vectorization. That is variable packing more like it. Which is even more fragile.
So... What ATI has been doing for the longest time is more fragile than auto-vectorization, which in turn is more fragile than GPGPU?

I really don't follow your reasoning. Anything you write in a GPGPU language is basically just an implicit loop with independent iterations, right? So why not write that loop in plain C++, have the compiler detect that the iterations are independent, and then trivially vectorize it? A few keywords like 'restrict' or 'foreach' can help a lot with determining independence, but I don't think you need much else. Definitely not something as invasive and restrictive as C++ AMP.
Nick is offline   Reply With Quote
Old 15-Feb-2012, 01:46   #28
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,274
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Nick View Post
The STL is built on top of the C++ language. It doesn't add any native types. The main advantage of that is that it doesn't burden the implementation of the compiler, and doesn't dirty the top namespace. So if the standard committee wants to define a standard vector library they can knock themselves out for all I care.

I just don't think it will help auto-vectorization.
We are not using consistent terminology here.

For me, auto vectorization is compiler detecting independent iterations of loops and generating SSE/AVX. Also, I don't consider these instructions to be vector. So obviously, a standard float4 class is not going to help auto vectorization.

Quote:
Actually I'd rather agree with you, but then I'd like to understand your reasoning.
Because compilers aren't perfect and they have to be very conservative. Which pales in front of the trap of un intended serial implementation of a potentially vectorizable algorithm. SPMD, by forcing independce of lanes, allows much more robust vectorization.

Quote:
So... What ATI has been doing for the longest time is more fragile than auto-vectorization,
Of course. Why do you think they got rid of it? It's really hard to pack general variables into VLIWish form while honoring register file constraints.


Quote:
A few keywords like 'restrict' or 'foreach' can help a lot with determining independence, but I don't think you need much else. Definitely not something as invasive and restrictive as C++ AMP.
There is a lot more C++ needs, and could use. You just have to consider integer codes.
rpg.314 is offline   Reply With Quote
Old 15-Feb-2012, 07:04   #29
Nick
Senior Member
 
Join Date: Jan 2003
Location: Montreal, Quebec
Posts: 1,881
Default

Quote:
Originally Posted by rpg.314 View Post
For me, auto vectorization is compiler detecting independent iterations of loops and generating SSE/AVX. Also, I don't consider these instructions to be vector.
Then what do you consider them to be, and what would it take to make them "vector"?
Quote:
Because compilers aren't perfect and they have to be very conservative. Which pales in front of the trap of un intended serial implementation of a potentially vectorizable algorithm. SPMD, by forcing independce of lanes, allows much more robust vectorization.
All you need is a compiler hint that you're expecting a certain loop to be vectorized, and if it isn't then a descriptive warning should be generated (the same way __forceinline works).

I don't think explicit SPMD is a good idea. Compilation should never fail. The thing is, software development is getting harder every year. So we need all the help from compilers we can get, and not have them make things more complicated. A large portion of developers will hardly care whether a loop was vectorized or not. Only if the performance doesn't meet the target, we need gentle tools to get the desired results.

GPGPU hasn't taken a big flight yet because (a) it's very time consuming to learn and then rewrite your algorithms and tune them, and (b) there's a lot of fragmentation due to hardware-specific limitations/capabilities reflected in the languages/APIs, impeding a flourishing software economy. So there's a need to lower the bar and make things device-independent.
Quote:
There is a lot more C++ needs, and could use. You just have to consider integer codes.
What do you mean?
Nick is offline   Reply With Quote
Old 15-Feb-2012, 15:11   #30
Ethatron
Member
 
Join Date: Jan 2010
Posts: 454
Default

Quote:
Originally Posted by Nick View Post
All you need is a compiler hint that you're expecting a certain loop to be vectorized, and if it isn't then a descriptive warning should be generated (the same way __forceinline works).
What you are thinking of are only the low hanging fruits. Loops don't make the majority of code to be auto-vectorized. In fact till higher shader models there wasn't loops in HLSL for example.
The challenge for a auto-vectorizer is to look at a blob of seemingly serial code and break this in independent pieces (parallelizable) which then are overlayed and tries to find out if a match of operations can be achieved at the same moment in the sequence of operants. Then those operants can run in a vector.
Said that, auto-vectorization, the complex part, is fine-grained auto-parallelism of carefully aligned (matched) operations.

The loop-thing (and the regular code as well of course) can also be made more complex by introducing branches in the loop. Again, loops are not branch-free in the majority. They will have branches. Then the compiler need to be very clever if he can make the branches more or less independent of operation-streams after the branch, or if the branches can be converted into simple data-moves. Sometimes that is not possible.

When you ask for an auto-vectorizer as part of a compiler which only treats branchless loops, I doubt anyone producing compilers sees the real use in only that. One has to offer the whole thing. The whole thing in HLSL is easy, it's hard in C++, when you think of all the additons (templates, operator overloading, custom types etc.)

Your loop+no-branches can easily be implemented in a vector library (Havok did this, very elegant), no need to torture a compiler with it.
Ethatron is offline   Reply With Quote
Old 15-Feb-2012, 17:17   #31
Nick
Senior Member
 
Join Date: Jan 2003
Location: Montreal, Quebec
Posts: 1,881
Default

Quote:
Originally Posted by Ethatron View Post
What you are thinking of are only the low hanging fruits. Loops don't make the majority of code to be auto-vectorized. In fact till higher shader models there wasn't loops in HLSL for example.
I'm not thinking of loops in the SPMD program. I'm talking about the implicit loop(s) that surround it which iterate through the data elements.
Quote:
The challenge for a auto-vectorizer is to look at a blob of seemingly serial code and break this in independent pieces (parallelizable) which then are overlayed and tries to find out if a match of operations can be achieved at the same moment in the sequence of operants. Then those operants can run in a vector.
No, it can do the much simpler job of making sure that multiple instances of the kernel can run in parallel on individual vector lanes.

I'm afraid rpg.314 is right that we don't have consistent terminology here. I fully realize that you can also take the kernel code and try to find sequences of identical operations and put that in a vector. But in light of C++ AMP and how it can evolve back into plain C++ that is not what we're looking for.
Nick is offline   Reply With Quote
Old 16-Feb-2012, 01:13   #32
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,274
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Nick View Post
Then what do you consider them to be, and what would it take to make them "vector"?
Glorified VLIW. Scatter/gather/predication.
rpg.314 is offline   Reply With Quote
Old 16-Feb-2012, 03:33   #33
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,274
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Nick View Post
All you need is a compiler hint that you're expecting a certain loop to be vectorized, and if it isn't then a descriptive warning should be generated (the same way __forceinline works).

I don't think explicit SPMD is a good idea. Compilation should never fail. The thing is, software development is getting harder every year. So we need all the help from compilers we can get, and not have them make things more complicated. A large portion of developers will hardly care whether a loop was vectorized or not. Only if the performance doesn't meet the target, we need gentle tools to get the desired results.
Who said anything about compilation failing? SPMD will run fine even if there are no independent lanes to work with. But its vectorization is robust.

SPMD is no more harder than actually ensuring that the loop you wrote is actually parallelizable. But for a compiler to make sure that all loops suggested to be parallel actually are, is very much harder.

Quote:
GPGPU hasn't taken a big flight yet because (a) it's very time consuming to learn and then rewrite your algorithms and tune them, and (b) there's a lot of fragmentation due to hardware-specific limitations/capabilities reflected in the languages/APIs, impeding a flourishing software economy. So there's a need to lower the bar and make things device-independent.
There has been a single revision of Direct Compute so far. And no major revisions of OCL. All of these already are device independent. So that assertion doesn't hold.
Quote:
What do you mean?
Discriminated unions, better type inference, typeclasses....
rpg.314 is offline   Reply With Quote
Old 16-Feb-2012, 13:08   #34
Nick
Senior Member
 
Join Date: Jan 2003
Location: Montreal, Quebec
Posts: 1,881
Default

Quote:
Originally Posted by rpg.314 View Post
Glorified VLIW.
It's not VLIW at all; there's only one opcode. Perhaps it's glorified SIMD. In any case it has the foundations for loop vectorization.
Quote:
Scatter/gather/predication.
Gather is added to AVX2, which is the one where it will start to truly matter. Scatter doesn't seem necessary / worth it to me (yet). And Intel's CPUs can do two blend instructions per clock which is plenty for implementing branches.
Nick is offline   Reply With Quote
Old 16-Feb-2012, 17:35   #35
Nick
Senior Member
 
Join Date: Jan 2003
Location: Montreal, Quebec
Posts: 1,881
Default

Quote:
Originally Posted by rpg.314 View Post
Who said anything about compilation failing?
Compilation of amp-restricted functions can fail for many reasons:

- There's no support for char or short types, and some bool limitations apply as well.
- There's no support for pointers to pointers.
- There's no support for pointers in compound types.
- There's no support for casting between integers and pointers.
- There's no support for bitfields.
- There's no support for variable argument functions.
- There's no support for virtual functions, function pointers, or recursion.
- There's no support for exceptions.
- There's no support for goto.

The list goes on, and there are also device-specific limitations on data size and such.
Quote:
SPMD is no more harder than actually ensuring that the loop you wrote is actually parallelizable. But for a compiler to make sure that all loops suggested to be parallel actually are, is very much harder.
Just because it's complex doesn't mean it's impossible. Writing compilers has always been hard. But you should have a look at the amazing achievements of the LLVM developers (and take a peek at the Polly project). I'd rather let those experts deal with the device limitations as much as possible than have it reflected in the language.
Quote:
There has been a single revision of Direct Compute so far. And no major revisions of OCL. All of these already are device independent. So that assertion doesn't hold.
Minor revisions also cause fragmentation. We have three versions of OpenCL, plus a bunch of extensions. There's six versions of CUDA, and no doubt Kepler will bring a seventh. And there's already a versioning system in place for C++ AMP, with the mention that "it is likely that C++ AMP will evolve over time, and that the set of features that are allowed inside amp-restricted functions will grow". And HSA's unification of the x86-64 addressing space will also lift numerous limitations.

This fragmentation really isn't helping the adoption of general purpose throughput computing. And an ecosystem in which code can be exchanged (commercial or otherwise) is close to non-existent. I can only see this change for the better when the language has minimal restrictions (preferably none at all) and abstracts the device capabilities. Vendor lock-in isn't going to work anyway and it's all evolving back to generic languages.
Quote:
Discriminated unions, better type inference, typeclasses....
Meh. It's C++, a lot of things are done explicitly. And these features are not even relevant to vector processing. Also note that general purpose throughput computing doesn't have to be limited to C/C++. Auto-vectorizing compilers can have many front-ends. Yes that's fragmentation too but at least it's driven by language features and not evolving device limitations/capabilities.
Nick is offline   Reply With Quote
Old 17-Feb-2012, 01:42   #36
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,274
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Nick View Post
Compilation of amp-restricted functions can fail for many reasons:

- There's no support for char or short types, and some bool limitations apply as well.
- There's no support for pointers to pointers.
- There's no support for pointers in compound types.
- There's no support for casting between integers and pointers.
- There's no support for bitfields.
- There's no support for variable argument functions.
- There's no support for virtual functions, function pointers, or recursion.
- There's no support for exceptions.
- There's no support for goto.

The list goes on, and there are also device-specific limitations on data size and such.
We are not talking bout the same thing here. I said compilation wouldn't fail if the loop wasn't vectorizable.
Quote:
Just because it's complex doesn't mean it's impossible. Writing compilers has always been hard. But you should have a look at the amazing achievements of the LLVM developers (and take a peek at the Polly project). I'd rather let those experts deal with the device limitations as much as possible than have it reflected in the language.
They know there are a lot of limits to what they can achieve and what they will achieve will be ultimately less than that.
Their amazing achievements notwithstanding.

By refusing to change the language/programming model to match evolution of hw, we are back to automagical parallelization of generic C. There is no reason to believe that sucess is this regard will be any more than what has been achieved so far, no matter who works on it.
Quote:
Minor revisions also cause fragmentation. We have three versions of OpenCL, plus a bunch of extensions. There's six versions of CUDA, and no doubt Kepler will bring a seventh. And there's already a versioning system in place for C++ AMP, with the mention that "it is likely that C++ AMP will evolve over time, and that the set of features that are allowed inside amp-restricted functions will grow". And HSA's unification of the x86-64 addressing space will also lift numerous limitations.

This fragmentation really isn't helping the adoption of general purpose throughput computing. And an ecosystem in which code can be exchanged (commercial or otherwise) is close to non-existent. I can only see this change for the better when the language has minimal restrictions (preferably none at all) and abstracts the device capabilities. Vendor lock-in isn't going to work anyway and it's all evolving back to generic languages.
We will have to agree to disagree again.

JVM and CLR have seen far more revisions that what vendor neutral GPU compute has seen so far. I simply don't see how these assertions hold up in front of established facts.

Quote:
Meh. It's C++, a lot of things are done explicitly. And these features are not even relevant to vector processing. Also note that general purpose throughput computing doesn't have to be limited to C/C++. Auto-vectorizing compilers can have many front-ends. Yes that's fragmentation too but at least it's driven by language features and not evolving device limitations/capabilities.
A language that refuses to add features needed by developers who do not care for vector processing is not very a general purpose language. Personal preferences aside, vector processing isn't everything.
rpg.314 is offline   Reply With Quote
Old 17-Feb-2012, 04:26   #37
Nick
Senior Member
 
Join Date: Jan 2003
Location: Montreal, Quebec
Posts: 1,881
Default

Quote:
Originally Posted by rpg.314 View Post
We are not talking bout the same thing here. I said compilation wouldn't fail if the loop wasn't vectorizable.
So you think all the limitations I summed up will go away? Anyhow, under what situation would an SPMD program be considered non-vectorizable and still compile?
Quote:
There is no reason to believe that sucess is this regard will be any more than what has been achieved so far, no matter who works on it.
So you think the lack of success from auto-vectorization is due to fundamental compilation challenges, rather than the lack of wide vectors, gather, and a vector equivalent of every significant scalar operation?
Quote:
We will have to agree to disagree again.

JVM and CLR have seen far more revisions that what vendor neutral GPU compute has seen so far. I simply don't see how these assertions hold up in front of established facts.
The JVM and CLR don't affect the language syntax and semantics.
Quote:
A language that refuses to add features needed by developers who do not care for vector processing is not very a general purpose language. Personal preferences aside, vector processing isn't everything.
Your point being?
Nick is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 10:41.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.