OpenCL 3.0 [2020]

Discussion in 'Rendering Technology and APIs' started by DmitryKo, Apr 27, 2020.

  1. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    902
    Likes Received:
    1,076
    Location:
    55°38′33″ N, 37°28′37″ E
    LLVM back-end is based on platform-independent bytecode, and both Vulkan/OpenCL SPIR-V bytecode and CUDA PTX bytecode are versions of LLVM bytecode. It knows nothing about high-level languages - any architecture-specific back-end will support whatever language the front-end implements.

    Intel thinks otherwise and their HPC efforts are based on SYCL and Clang/LLVM/SPIR-V, not another proprietary API and compiler. Their upcoming Xe-HPC accelerators have won the Aurora supercomputer contract from USDOE - it they can rival Nvidia Tesla/DGX and AMD Radeon Pro/Instinct, this could be a game changer for the API landscape.

    I'm certainly not going to ditch Windows for a server OS on my desktop, I have work to do.

    Also I'm not sure why it's Microsoft's fault when NVIDIA and AMD advertise HSA features but cannot implement a proper heterogeneous MMU with a page table entry format that supports page faults in system memory, and have to rely on a Linux-specific kernel component to fix the issues that arise. Microsoft works around it with memory residency API in DXGI and Direct3D 12 to manage large heap allocations.


    CUDA enjoys a large following but code migration tools work both ways, so native SYCL with CUDA conversion could become equally popular.

    Code analysis and optimization tools are computer-architecture specific, they are not defined in or required by any high-level language specification.

    Which severely impact application performance.

    It takes time and the specs are not final yet.


    Yes, it looks like CodeXL repository was archived and developers were directed to ROCm repository or GPUOpen.com website (which relaunches May 11 after a redesign).

    Vulkan only needs to support SPIR-V bytecode as a target for OpenCL/SYCL toolsets.
     
    #21 DmitryKo, May 9, 2020
    Last edited: May 9, 2020
  2. Lurkmass

    Regular Newcomer

    Joined:
    Mar 3, 2020
    Messages:
    320
    Likes Received:
    357
    That's not good enough to say that every LLVM back-end will support all of the necessary features required to fully implement the said languages.

    Intel's SYCL implementation might as well be another proprietary API in practice since it's still not even classified as a conformant implementation and DPC++ makes some fairly radical extensions to it as well.

    Well that's the trade off you have to make as most of the high-end compute work is done on Linux.

    Also, even if WDDM supports GPU page faults it still doesn't support system-wide atomics which is one of the key features to unified memory. Stuff like Infinity Fabric and NVLink relies on these mechanisms to offer advanced communication functionality.

    Codeplay's SYCL for CUDA project shows otherwise which is a fork of Intel's LLVM compiler back-end and even Intel wants to convert CUDA into an extended form of SYCL known as DPC++.

    Both AMD and Intel wants developers to write CUDA so that they can write these conversion tools. It's lower maintenance for the developers and vendors like AMD/Intel so it's a win-win situation all around given the circumstances. The people behind ROCm and the oneAPI compute stack admire CUDA too so they imitate it as an expression of flattery!


    Maybe CUDA should be standardized after all since others like AMD or Intel are willing to copy it ?


    If it's a part of the standard then an implementation needs to make the feature work to advertise full support.
     
  3. Lurkmass

    Regular Newcomer

    Joined:
    Mar 3, 2020
    Messages:
    320
    Likes Received:
    357
    Radeon Rays 4.0 has officially deprecated OpenCL support.
     
    xpea and DavidGraham like this.
  4. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    435
    Likes Received:
    500
    Another proof that OpenCL is dead.
    AMD is doing the right thing. Develop a comprehensive set of robust APIs and frmeworks for a wide range of needs. Then maybe, if they put enough effort on it, and keep working on it for several generation of hardware, they will finally be a worthy CUDA alternative...
     
  5. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    902
    Likes Received:
    1,076
    Location:
    55°38′33″ N, 37°28′37″ E
    According to SYCLcon 2020 presentations and tutorials, at least three of the five language extensions - unified memory access, sub-groups, and in-order queues - will be included in SYCL 2020. It will also support mutliple back-ends - i.e. CUDA PTX, Vulkan SPIR-V, and C++ for OpenCL - in addition to standard OpenCL C 1.2, as well as an executable module format with several target binaries.

    SYCL 2020: More than meets the eye
    iwocl.org/wp-content/uploads/iwocl-syclcon-2020-brown-15-slides.pd
    youtu.be/MfAzQMKn-ho

    Data Parallel C++: Enhancing SYCL Through Extensions for Productivity and Performance
    iwocl.org/wp-content/uploads/wocl-syclcon-2020-brodman-22-slides.pdf
    youtu.be/V2PFduPi5QA

    SYCL Tutorial: Unified Shared Memory
    youtu.be/t69Ts6acmR0


    This is not a driver problem, it's the same MMU hardware limitation. Atomics in Unified Memory require a cache coherence protocol between CPU and GPU, and NVIDIA only supports hardware coherence on the NVlink 2.0 bus systems, like one of these POWER8 supercomputers. On the x86_64 platform, CUDA's system atomics are implemented in software by moving system memory pages to local graphics memory.

    This won't change until actual GPUs and CPUs have MMUs that both support GenZ/CXL, Infinity Fabric, or similar protocols that work over PCIe, with fast system memory to match PCIe 5.0/6.0 bandwidth of 63/126 GByte/s (like Infinity Architecture 3 in Zen4 EPYC CPUs and CDNA2 GPUs for HP/Cray El Capitan supercomputer, or CXL over PCIE 6.0 in Xe-HPC GPUs and Xeon CPUs for the above-mentioned Aurora supercomputer). So far NVIDIA indicated no plans to implement one of these protocols, and Intel / AMD have no plans to implement NVLink or OpenCAPI.


    If changes to LLVM infrastructure are required, compiler implementations make them in the open, like Intel does for DPC++ tools .

    It's because HPC clusters run under Linux. Desktop PCs are hardly "high-end compute" today.

    If that was the case, Intel would simply implement the CUDA API but rename all references to avoid trademark infringement, like AMD did with HIP.

    RTTI is x20 slower than static types, always gets disabled for embedded/kernel code (the primary reason why C++ is slowly being replaced by Rust in systems programming).

    It's a different incompatible helper library, with binaries for Vulkan, Direct3D 12, and Metal.

    It had to be implemented with compute shaders right from the start.
     
    #25 DmitryKo, May 14, 2020
    Last edited: Nov 2, 2020
    BRiT likes this.
  6. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,169
    Likes Received:
    1,354
  7. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    902
    Likes Received:
    1,076
    Location:
    55°38′33″ N, 37°28′37″ E
    If you click on Intel® oneAPI DPC++ Compiler and/or watch introductory videos, it does say that

    DPC++ =
    ISO C++ and
    Khronos SYCL and
    community extensions
     
    #27 DmitryKo, May 15, 2020
    Last edited: May 15, 2020
    JoeJ likes this.
  8. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,169
    Likes Received:
    1,354
    oh, sorry then :)
    I like Intels direction and hoped it could strengthen some general compute API progress...
     
  9. Lurkmass

    Regular Newcomer

    Joined:
    Mar 3, 2020
    Messages:
    320
    Likes Received:
    357
    Precisely ...

    SYCL alone is no where near powerful enough to express the same feature set as CUDA and Intel proposes several vendor specific extensions included in DPC++ to close this gap. Intel's end goal also includes developers migrating from OpenCL to their 'portable' DPC++ standard even though no other vendor has implemented SYCL or let alone DPC++. Intel also tried leading others into believing that their oneAPI compute stack was some sort of industry-wide or community driven project but anyone can see that it's just another part of their corporate agenda. OpenCL and SYCL are just a means for Intel achieving their ends (DPC++) ...

    OpenCL is on a deathbed. Hopefully, the alternatives such as DPC++ or HIP for a portable solution turn out to be better in end.

    "Community extensions" ? All lies and not that useful compared to contributing true community projects like Mesa's clover stack ...
     
  10. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,169
    Likes Received:
    1,354
    I expected Intel, if successful, would strengthen OpenCL because CYCL builds on top of it. But i see it is not that simple.
    GPU programming only decreases in accessability. The situation does not improve. Even after a decade it only becomes worse.
    I considered OpenCL to be the only vendor and OS independent standard, with no alternative for many like me. I don't know if my stuff has to run on NV+AMD, Win or Linux, inside a DX, VK or GL application.
    Ditching plans to port some solvers to GPU - will remain on multi threaded CPU, should be fast enough.

    Well, i want Vendor gfx APIs and general compute API. But i get the opposite of that. :|
     
  11. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    435
    Likes Received:
    500
    ...Well maybe for you but not for Nvidia as their >1 billion last datacenter quarterly revenue proves. CUDA adoption is accelerating because it offers a comprehensive set of tools and nothing else is comparable. It has been a decade and billions invested in GPU computing, not to give free pass to lazy competitors, but to sell their own solutions. The day someone will come up with a better HW/SW stack then they will also enjoy the benefit of their work... Isn't the difinition of capitalism where the best wins ?
     
  12. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,169
    Likes Received:
    1,354
    After some progress my solvers are no longer the bottleneck. Porting them to GPU is no longer attractive for me actually.

    In general, NVs announcement of C++ support for Ampere is very interesting. This could convince me to ignore the vendor lock downside, because you can assume code will be easy to port to a future standard and easy to maintain.
    Did not look at this yet, and probably it's not as optimal as CUDA, but for offline application good enough and will improve. I also would not complain about missing features on the C++ side.
    That's interesting for many i guess. :)
     
  13. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    417
    Likes Received:
    475
    What do you even mean by that? Effectively they are only once more raising the supported C++ standard in CUDA code. Add some syntactic sugar for transforming STL parallel for into kernel launches, which merely saves some boilerplate code, but at no discernible difference to doing that manually. And same constraints obviously, you can only access modules which may go into a kernel, you need to decide on kernel sizes at compile time etc. Full C++17 level constexpr support is arguably the most compelling part of that update.

    And it only sounds nice until you realize that this is then still tied to using memory allocations made explicitly against CUDA. Which still makes you painfully aware that this is still just implicit copies and/or pinned memory under the hood. Which heavily influences your application design, as you struggle with the resulting latency and utilization issues. Turning this, regardless of the API, into an almost entirely implementation specific schedule optimization problem.

    Followed by the realization that you still need to explicitly use LDS and intrinsics to get any reasonable performance for most tasks, at which point calling that stuff "future standard" or "vendor independent" already becomes moot, as not even the boilerplate code you had saved earlier actually goes away. There is still no such thing as compilers doing any form of inter-thread optimizations even just within a wavefront, or any other form of architecture specific optimization even remotely close to what you get on the CPU side.

    In the end, this isn't vendor independent by any means. May look similar, and individual vendors will try to provide "conversion tools", but they only help you as far as to get a "working prototype". Cause that's the point-of-no-return at which upper management is usually convinced to proceed from "evaluation" to "conversion". And then you get to repeat the dreadful cycle of vendor specific optimizations, until you have once again reached the previous performance level. AMD with HIP, and Intel with DPC++, they know very well that their tools are only good for a rapid prototyping proof of concept.

    Even Intels "oneAPI" concept is just a beautiful lie, built on the idea that while in the early prototype stage, you can easily switch between different hardware stacks, to find the one best matching your performance / price point targets. Once you've done that, you still end up with a system which is only hypothetically portable, but in reality tailored far too tightly towards the actual target platforms constraints. Usually, you would have to rethink the entire memory model.
     
    #33 Ext3h, May 26, 2020
    Last edited: May 26, 2020
  14. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,169
    Likes Received:
    1,354
    Only the things mentioned in this post: https://forum.beyond3d.com/posts/2126815/
    I don't know what it means exactly, but i'd hope its easy to use, easy to port, easy to reuse and share with CPU implementation.

    The things you mention are expected, and exposing HW details like LDS or subgroups is welcome to me. I do not request C++ on GPU or hardware abstractions - i only want an alternative to something like Vulkan just to do some minor things on GPU that are not graphics related.
    OpenCL would be exactly this.
    But i have heard many people that were not happy because it's too low level for them. I guess that's a big reason of bad adoption. People that are serious about GPU development either use CUDA or work on games. The rest is waiting until the end of time until GPUs become like CPUs, it seems.
     
  15. Lurkmass

    Regular Newcomer

    Joined:
    Mar 3, 2020
    Messages:
    320
    Likes Received:
    357
    I know what NVC++ is ...

    It's another non-standard C++ compiler but the major takeaway is that you can now write C++17 parallel STL algorithms and execute them on GPUs. The NVC++ compiler still violates several standard C++ conventions so it's not a proper C++ implementation which may mean that only a subset of standard C++ will be supported.

    An Nvidia representative still recommends that you use the NVCC compiler and write CUDA C++ if you want the most optimized kernels for their hardware which is no different for the other vendor's recommendations. That means using HCC/HIP C++ for AMD and oneAPI compiler/DPC++ for Intel if you also want the highest performance.
     
  16. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    902
    Likes Received:
    1,076
    Location:
    55°38′33″ N, 37°28′37″ E
    Please be careful with your wishes. If you look at the ISO C++ committee draft Unified executors proposal P0443 / P0761 and note Michael Garland and Jared Hoberock as principal contributors - would you be willing to exchange current CUDA conventions for these 'standard C++23' conventions, beyond using their future heterogeneous implementation of parallel STL ?

    AMD and Intel have a hardware advantage here, since their datacenter solutions should be fully heterogeneous with cache coherent unified memory access over fast interconnect buses. This allows for further compiler and runtime library optimisations.

    Portability doesn't imply performance or market viability - even if your code can successfully run on a $200,000 embedded HPC board with six high-end GPUs down to a $50 mobile phone, what kind of end product would need to target both of these platforms?

    The competition is already there, and offers source code compatibility with CUDA.

    CUDA toolkit had a C++ compiler since version 1.0, libcu++ is not a new development either.
     
    #36 DmitryKo, May 29, 2020
    Last edited: May 29, 2020
  17. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    902
    Likes Received:
    1,076
    Location:
    55°38′33″ N, 37°28′37″ E
    Suddenly, NVidia released Windows/Linux R465 drivers which support OpenCL 3.0 (i.e. baseline OpenCL 1.2 with a few optional features from OpenCL 2.x):
    https://developer.nvidia.com/blog/nvidia-is-now-opencl-3-0-conformant/
    There are also OpenCL 3.0 implementations from Intel (Linux/Windows) and Imagination (Linux ARM):
    https://www.khronos.org/conformance/adopters/conformant-products/opencl
     
    #37 DmitryKo, Apr 19, 2021
    Last edited: Apr 19, 2021
    Malo likes this.
  18. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,169
    Likes Received:
    1,354
    Pretty cool, although no device side enqueue ofc.
    I'd hope AMD would up(or down :)grade to 3.0 too. After them dropping CPU support, i considered CL to be dead for them. Not sue if i can rely on future GPU support, so not sure if i could use easy CL or better bite into using clumsy Vulkan for minor GPGPU stuff.
     
  19. Lurkmass

    Regular Newcomer

    Joined:
    Mar 3, 2020
    Messages:
    320
    Likes Received:
    357
    AMD have no plans to submit their products in the future anymore for OpenCL conformance testing so OpenCL is as good as dead for them ... (OpenCL will stop getting bug fixes in a year or two and will be left to rot in their drivers after that)

    OpenCL was amazingly inconsistent since it's source language had so many unresolved corner cases that would lead to tons of trivial bugs popping in different driver implementations. A SPIR/SPIR-V kernel compiler could've fixed some of the unpredictable behaviours. Features like "C++ for OpenCL" were intended to work with SPIR-V and virtually no one else but Intel are interested in making an OpenCL SPIR-V compiler ...

    Vulkan is getting better to use since we recently just got pointers in SPIR-V shaders but they still don't support unstructured control flow ...
     
    DavidGraham and JoeJ like this.
  20. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    902
    Likes Received:
    1,076
    Location:
    55°38′33″ N, 37°28′37″ E
    This is your personal interpretation of their usual 'there are no immediate plans to support...' mantra. AMD is still devoting their resources to delivering ROCm 4.x software HPC stack to be used in HP/Cray supercomputers, and ROCm specifically targets Radeon Instinct series accelerators (i.e. Fiji, Polaris, Vega and CDNA) and binary compilation only. The ROCm OpenCL runtime was recently updated to v2.2, though SPIR-V is not supported by underlying ROCm compiler infrastructure.
     
    JoeJ likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...