Microsoft published C++ AMP spec

Discussion in 'GPGPU Technology & Programming' started by pcchen, Feb 6, 2012.

  1. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Location:
    Montreal, Quebec
    Nah, just write your own vector class and use inline operators with intrinsics. Same performance, much cleaner code.
     
  2. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    676
    The HLSL compiler to assembler is also auto-vectorizing, and it's not that complex. The complex piece in the mentioned equation is VLIW.
    Of course HLSL is primitive in comparison to C++, and it's easier to have small pattern-databases (peep-hole auto-vectorization would that be called I guess). Making decisions about optimality isn't that streightforward in C++. I don't think the AV itself is really the problem.
     
  3. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Location:
    Montreal, Quebec
    Indeed, there's no scatter in AVX2, but that's not an issue in practice because it should be avoided anyway. For best performance you should store results linearly and read sparse data with gather. You can also use the new permute instructions. And you can always fall back to scalar extract and store instructions.
    Why? I don't think any GPU has actual scatter support, and it might have been a macro in LRB. The problem is you can't achieve memory ordering consistency for scatter without blocking the load ports. So I don't think you lose anything from not actually supporting it.

    Or were you thinking of something else that AVX2 is lacking?
     
  4. RecessionCone

    Regular

    Joined:
    Feb 27, 2010
    Messages:
    478
    GPUs have real scatter support. Which can't be replaced by permute instructions or linear writes + gathers in the general case. Of course, GPUs don't have memory consistency problems either. ;)
     
  5. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Because a stnadard library's job is to provide good defaults for code that is widely used. Like STL.

    Let us agree to disagree.
     
  6. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    No. That is not auto vectorization. That is variable packing more like it. Which is even more fragile.
     
  7. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Location:
    Montreal, Quebec
    The STL is built on top of the C++ language. It doesn't add any native types. The main advantage of that is that it doesn't burden the implementation of the compiler, and doesn't dirty the top namespace. So if the standard committee wants to define a standard vector library they can knock themselves out for all I care.

    I just don't think it will help auto-vectorization.
    Actually I'd rather agree with you, but then I'd like to understand your reasoning.
    So... What ATI has been doing for the longest time is more fragile than auto-vectorization, which in turn is more fragile than GPGPU?

    I really don't follow your reasoning. Anything you write in a GPGPU language is basically just an implicit loop with independent iterations, right? So why not write that loop in plain C++, have the compiler detect that the iterations are independent, and then trivially vectorize it? A few keywords like 'restrict' or 'foreach' can help a lot with determining independence, but I don't think you need much else. Definitely not something as invasive and restrictive as C++ AMP.
     
  8. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    We are not using consistent terminology here.

    For me, auto vectorization is compiler detecting independent iterations of loops and generating SSE/AVX. Also, I don't consider these instructions to be vector. So obviously, a standard float4 class is not going to help auto vectorization.

    Because compilers aren't perfect and they have to be very conservative. Which pales in front of the trap of un intended serial implementation of a potentially vectorizable algorithm. SPMD, by forcing independce of lanes, allows much more robust vectorization.

    Of course. Why do you think they got rid of it? It's really hard to pack general variables into VLIWish form while honoring register file constraints.


    There is a lot more C++ needs, and could use. You just have to consider integer codes.
     
  9. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Location:
    Montreal, Quebec
    Then what do you consider them to be, and what would it take to make them "vector"?
    All you need is a compiler hint that you're expecting a certain loop to be vectorized, and if it isn't then a descriptive warning should be generated (the same way __forceinline works).

    I don't think explicit SPMD is a good idea. Compilation should never fail. The thing is, software development is getting harder every year. So we need all the help from compilers we can get, and not have them make things more complicated. A large portion of developers will hardly care whether a loop was vectorized or not. Only if the performance doesn't meet the target, we need gentle tools to get the desired results.

    GPGPU hasn't taken a big flight yet because (a) it's very time consuming to learn and then rewrite your algorithms and tune them, and (b) there's a lot of fragmentation due to hardware-specific limitations/capabilities reflected in the languages/APIs, impeding a flourishing software economy. So there's a need to lower the bar and make things device-independent.
    What do you mean?
     
  10. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    676
    What you are thinking of are only the low hanging fruits. Loops don't make the majority of code to be auto-vectorized. In fact till higher shader models there wasn't loops in HLSL for example.
    The challenge for a auto-vectorizer is to look at a blob of seemingly serial code and break this in independent pieces (parallelizable) which then are overlayed and tries to find out if a match of operations can be achieved at the same moment in the sequence of operants. Then those operants can run in a vector.
    Said that, auto-vectorization, the complex part, is fine-grained auto-parallelism of carefully aligned (matched) operations.

    The loop-thing (and the regular code as well of course) can also be made more complex by introducing branches in the loop. Again, loops are not branch-free in the majority. They will have branches. Then the compiler need to be very clever if he can make the branches more or less independent of operation-streams after the branch, or if the branches can be converted into simple data-moves. Sometimes that is not possible.

    When you ask for an auto-vectorizer as part of a compiler which only treats branchless loops, I doubt anyone producing compilers sees the real use in only that. One has to offer the whole thing. The whole thing in HLSL is easy, it's hard in C++, when you think of all the additons (templates, operator overloading, custom types etc.)

    Your loop+no-branches can easily be implemented in a vector library (Havok did this, very elegant), no need to torture a compiler with it.
     
  11. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Location:
    Montreal, Quebec
    I'm not thinking of loops in the SPMD program. I'm talking about the implicit loop(s) that surround it which iterate through the data elements.
    No, it can do the much simpler job of making sure that multiple instances of the kernel can run in parallel on individual vector lanes.

    I'm afraid rpg.314 is right that we don't have consistent terminology here. I fully realize that you can also take the kernel code and try to find sequences of identical operations and put that in a vector. But in light of C++ AMP and how it can evolve back into plain C++ that is not what we're looking for.
     
  12. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Glorified VLIW. Scatter/gather/predication.
     
  13. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Who said anything about compilation failing? SPMD will run fine even if there are no independent lanes to work with. But its vectorization is robust.

    SPMD is no more harder than actually ensuring that the loop you wrote is actually parallelizable. But for a compiler to make sure that all loops suggested to be parallel actually are, is very much harder.

    There has been a single revision of Direct Compute so far. And no major revisions of OCL. All of these already are device independent. So that assertion doesn't hold.
    Discriminated unions, better type inference, typeclasses....
     
  14. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Location:
    Montreal, Quebec
    It's not VLIW at all; there's only one opcode. Perhaps it's glorified SIMD. In any case it has the foundations for loop vectorization.
    Gather is added to AVX2, which is the one where it will start to truly matter. Scatter doesn't seem necessary / worth it to me (yet). And Intel's CPUs can do two blend instructions per clock which is plenty for implementing branches.
     
  15. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Location:
    Montreal, Quebec
    Compilation of amp-restricted functions can fail for many reasons:

    - There's no support for char or short types, and some bool limitations apply as well.
    - There's no support for pointers to pointers.
    - There's no support for pointers in compound types.
    - There's no support for casting between integers and pointers.
    - There's no support for bitfields.
    - There's no support for variable argument functions.
    - There's no support for virtual functions, function pointers, or recursion.
    - There's no support for exceptions.
    - There's no support for goto.

    The list goes on, and there are also device-specific limitations on data size and such.
    Just because it's complex doesn't mean it's impossible. Writing compilers has always been hard. But you should have a look at the amazing achievements of the LLVM developers (and take a peek at the Polly project). I'd rather let those experts deal with the device limitations as much as possible than have it reflected in the language.
    Minor revisions also cause fragmentation. We have three versions of OpenCL, plus a bunch of extensions. There's six versions of CUDA, and no doubt Kepler will bring a seventh. And there's already a versioning system in place for C++ AMP, with the mention that "it is likely that C++ AMP will evolve over time, and that the set of features that are allowed inside amp-restricted functions will grow". And HSA's unification of the x86-64 addressing space will also lift numerous limitations.

    This fragmentation really isn't helping the adoption of general purpose throughput computing. And an ecosystem in which code can be exchanged (commercial or otherwise) is close to non-existent. I can only see this change for the better when the language has minimal restrictions (preferably none at all) and abstracts the device capabilities. Vendor lock-in isn't going to work anyway and it's all evolving back to generic languages.
    Meh. It's C++, a lot of things are done explicitly. And these features are not even relevant to vector processing. Also note that general purpose throughput computing doesn't have to be limited to C/C++. Auto-vectorizing compilers can have many front-ends. Yes that's fragmentation too but at least it's driven by language features and not evolving device limitations/capabilities.
     
  16. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    We are not talking bout the same thing here. I said compilation wouldn't fail if the loop wasn't vectorizable.
    They know there are a lot of limits to what they can achieve and what they will achieve will be ultimately less than that.
    Their amazing achievements notwithstanding.

    By refusing to change the language/programming model to match evolution of hw, we are back to automagical parallelization of generic C. There is no reason to believe that sucess is this regard will be any more than what has been achieved so far, no matter who works on it.
    We will have to agree to disagree again.

    JVM and CLR have seen far more revisions that what vendor neutral GPU compute has seen so far. I simply don't see how these assertions hold up in front of established facts.

    A language that refuses to add features needed by developers who do not care for vector processing is not very a general purpose language. Personal preferences aside, vector processing isn't everything.
     
  17. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Location:
    Montreal, Quebec
    So you think all the limitations I summed up will go away? Anyhow, under what situation would an SPMD program be considered non-vectorizable and still compile?
    So you think the lack of success from auto-vectorization is due to fundamental compilation challenges, rather than the lack of wide vectors, gather, and a vector equivalent of every significant scalar operation?
    The JVM and CLR don't affect the language syntax and semantics.
    Your point being?
     

Share This Page

Loading...