OpenCL 3.0 [2020]

DmitryKo · May 9, 2020

Lurkmass said:
]not all of them like Nvidia's PTX backend can support standard C++. Clang/LLVM support does not confer support for standard C++

LLVM back-end is based on platform-independent bytecode, and both Vulkan/OpenCL SPIR-V bytecode and CUDA PTX bytecode are versions of LLVM bytecode. It knows nothing about high-level languages - any architecture-specific back-end will support whatever language the front-end implements.

I don't see your issue of having to maintain another vendor specific standard.
Intel should be the one to implement HIP instead of having AMD waste time with another potentially zombie standard
Compute standards just aren't meant to be community projects

Intel thinks otherwise and their HPC efforts are based on SYCL and Clang/LLVM/SPIR-V, not another proprietary API and compiler. Their upcoming Xe-HPC accelerators have won the Aurora supercomputer contract from USDOE - it they can rival Nvidia Tesla/DGX and AMD Radeon Pro/Instinct, this could be a game changer for the API landscape.

Which is exactly why you should ditch Windows
Microsoft doesn't care about high-end compute

I'm certainly not going to ditch Windows for a server OS on my desktop, I have work to do.

Also I'm not sure why it's Microsoft's fault when NVIDIA and AMD advertise HSA features but cannot implement a proper heterogeneous MMU with a page table entry format that supports page faults in system memory, and have to rely on a Linux-specific kernel component to fix the issues that arise. Microsoft works around it with memory residency API in DXGI and Direct3D 12 to manage large heap allocations.

CUDA arguably is a stronger proposition in terms of portability compared to either Khronos' OpenCL or SYCL standard

CUDA enjoys a large following but code migration tools work both ways, so native SYCL with CUDA conversion could become equally popular.

it is absolutely not required to support SYCL

Code analysis and optimization tools are computer-architecture specific, they are not defined in or required by any high-level language specification.

GPU kernel languages are missing standard C++ features

Which severely impact application performance.

I don't see any tests C++ for OpenCL

It takes time and the specs are not final yet.

Lurkmass said:
CodeXL isn't being maintained anymore by AMD

Yes, it looks like CodeXL repository was archived and developers were directed to ROCm repository or GPUOpen.com website (which relaunches May 11 after a redesign).

JoeJ said:
I hoped Vulkan would adopt OpenCL C shaders

Vulkan only needs to support SPIR-V bytecode as a target for OpenCL/SYCL toolsets.

Lurkmass · May 13, 2020

DmitryKo said:
LLVM back-end is based on platform-independent bytecode, and both Vulkan/OpenCL SPIR-V bytecode and CUDA PTX bytecode are versions of LLVM bytecode. It knows nothing about high-level languages - any architecture-specific back-end will support whatever language the front-end implements.

That's not good enough to say that every LLVM back-end will support all of the necessary features required to fully implement the said languages.

Intel thinks otherwise and their HPC efforts are based on SYCL and Clang/LLVM/SPIR-V, not another proprietary API and compiler. Their upcoming Xe-HPC accelerators have won the Aurora supercomputer contract from USDOE - it they can rival Nvidia Tesla/DGX and AMD Radeon Pro/Instinct, this could be a game changer for the API landscape.

Intel's SYCL implementation might as well be another proprietary API in practice since it's still not even classified as a conformant implementation and DPC++ makes some fairly radical extensions to it as well.

I'm certainly not going to ditch Windows for a server OS on my desktop, I have work to do.

Also I'm not sure why it's Microsoft's fault when NVIDIA and AMD advertise HSA features but cannot implement a proper heterogeneous MMU with a page table entry format that supports page faults in system memory, and have to rely on a Linux-specific kernel component to fix the issues that arise. Microsoft works around it with memory residency API in DXGI and Direct3D 12 to manage large heap allocations.

Well that's the trade off you have to make as most of the high-end compute work is done on Linux.

Also, even if WDDM supports GPU page faults it still doesn't support system-wide atomics which is one of the key features to unified memory. Stuff like Infinity Fabric and NVLink relies on these mechanisms to offer advanced communication functionality.

CUDA enjoys a large following but code migration tools work both ways, so native SYCL with CUDA conversion could become equally popular.

Codeplay's SYCL for CUDA project shows otherwise which is a fork of Intel's LLVM compiler back-end and even Intel wants to convert CUDA into an extended form of SYCL known as DPC++.

Both AMD and Intel wants developers to write CUDA so that they can write these conversion tools. It's lower maintenance for the developers and vendors like AMD/Intel so it's a win-win situation all around given the circumstances. The people behind ROCm and the oneAPI compute stack admire CUDA too so they imitate it as an expression of flattery!

Maybe CUDA should be standardized after all since others like AMD or Intel are willing to copy it ?

Which severely impact application performance.

If it's a part of the standard then an implementation needs to make the feature work to advertise full support.

Lurkmass · May 13, 2020

Radeon Rays 4.0 has officially deprecated OpenCL support.

xpea · May 14, 2020

Lurkmass said:
Radeon Rays 4.0 has officially deprecated OpenCL support.

Another proof that OpenCL is dead.
AMD is doing the right thing. Develop a comprehensive set of robust APIs and frmeworks for a wide range of needs. Then maybe, if they put enough effort on it, and keep working on it for several generation of hardware, they will finally be a worthy CUDA alternative...

DmitryKo · May 14, 2020

Lurkmass said:
DPC++ makes some fairly radical extensions to it as well

According to SYCLcon 2020 presentations and tutorials, at least three of the five language extensions - unified memory access, sub-groups, and in-order queues - will be included in SYCL 2020. It will also support mutliple back-ends - i.e. CUDA PTX, Vulkan SPIR-V, and C++ for OpenCL - in addition to standard OpenCL C 1.2, as well as an executable module format with several target binaries.

SYCL 2020: More than meets the eye
iwocl.org/wp-content/uploads/iwocl-syclcon-2020-brown-15-slides.pd
youtu.be/MfAzQMKn-ho

Data Parallel C++: Enhancing SYCL Through Extensions for Productivity and Performance
iwocl.org/wp-content/uploads/wocl-syclcon-2020-brodman-22-slides.pdf
youtu.be/V2PFduPi5QA

SYCL Tutorial: Unified Shared Memory
youtu.be/t69Ts6acmR0

even if WDDM supports GPU page faults it still doesn't support system-wide atomics which is one of the key features to unified memory.

This is not a driver problem, it's the same MMU hardware limitation. Atomics in Unified Memory require a cache coherence protocol between CPU and GPU, and NVIDIA only supports hardware coherence on the NVlink 2.0 bus systems, like one of these POWER8 supercomputers. On the x86_64 platform, CUDA's system atomics are implemented in software by moving system memory pages to local graphics memory.

This won't change until actual GPUs and CPUs have MMUs that both support GenZ/CXL, Infinity Fabric, or similar protocols that work over PCIe, with fast system memory to match PCIe 5.0/6.0 bandwidth of 63/126 GByte/s (like Infinity Architecture 3 in Zen4 EPYC CPUs and CDNA2 GPUs for HP/Cray El Capitan supercomputer, or CXL over PCIE 6.0 in Xe-HPC GPUs and Xeon CPUs for the above-mentioned Aurora supercomputer). So far NVIDIA indicated no plans to implement one of these protocols, and Intel / AMD have no plans to implement NVLink or OpenCAPI.

That's not good enough to say that every LLVM back-end will support all of the necessary features required to fully implement the said languages.

If changes to LLVM infrastructure are required, compiler implementations make them in the open, like Intel does for DPC++ tools .

that's the trade off you have to make as most of the high-end compute work is done on Linux

It's because HPC clusters run under Linux. Desktop PCs are hardly "high-end compute" today.

Both AMD and Intel wants developers to write CUDA

If that was the case, Intel would simply implement the CUDA API but rename all references to avoid trademark infringement, like AMD did with HIP.

If it's a part of the standard then an implementation needs to make the feature work to advertise full support.

RTTI is x20 slower than static types, always gets disabled for embedded/kernel code (the primary reason why C++ is slowly being replaced by Rust in systems programming).

Lurkmass said:
Radeon Rays 4.0 has officially deprecated OpenCL support.

It's a different incompatible helper library, with binaries for Vulkan, Direct3D 12, and Metal.

xpea said:
Another proof that OpenCL is dead.

It had to be implemented with compute shaders right from the start.

JoeJ · May 14, 2020

Email from Intel:

Sign up for this on-demand session on using the Intel® DPC++ Compatibility Tool.

They propose to use DPC++ to port from CUDA.
Then, visiting the link to ONE API Toolkit, no more mention of CYCL:

https://software.intel.com/content/www/us/en/develop/tools/oneapi/base-toolkit.html

Another indication? Dress up for OpenCL funeral?

DmitryKo · May 15, 2020

If you click on Intel® oneAPI DPC++ Compiler and/or watch introductory videos, it does say that

DPC++ =

ISO C++ and
Khronos SYCL and
community extensions

JoeJ · May 15, 2020

oh, sorry then

I like Intels direction and hoped it could strengthen some general compute API progress...

Lurkmass · May 16, 2020

JoeJ said:
Email from Intel:

They propose to use DPC++ to port from CUDA.
Then, visiting the link to ONE API Toolkit, no more mention of CYCL:

https://software.intel.com/content/www/us/en/develop/tools/oneapi/base-toolkit.html

Precisely ...

SYCL alone is no where near powerful enough to express the same feature set as CUDA and Intel proposes several vendor specific extensions included in DPC++ to close this gap. Intel's end goal also includes developers migrating from OpenCL to their 'portable' DPC++ standard even though no other vendor has implemented SYCL or let alone DPC++. Intel also tried leading others into believing that their oneAPI compute stack was some sort of industry-wide or community driven project but anyone can see that it's just another part of their corporate agenda. OpenCL and SYCL are just a means for Intel achieving their ends (DPC++) ...

Another indication? Dress up for OpenCL funeral?

OpenCL is on a deathbed. Hopefully, the alternatives such as DPC++ or HIP for a portable solution turn out to be better in end.

"Community extensions" ? All lies and not that useful compared to contributing true community projects like Mesa's clover stack ...

JoeJ · May 16, 2020

I expected Intel, if successful, would strengthen OpenCL because CYCL builds on top of it. But i see it is not that simple.
GPU programming only decreases in accessability. The situation does not improve. Even after a decade it only becomes worse.
I considered OpenCL to be the only vendor and OS independent standard, with no alternative for many like me. I don't know if my stuff has to run on NV+AMD, Win or Linux, inside a DX, VK or GL application.
Ditching plans to port some solvers to GPU - will remain on multi threaded CPU, should be fast enough.

Well, i want Vendor gfx APIs and general compute API. But i get the opposite of that. :|

xpea · May 24, 2020

JoeJ said:
I expected Intel, if successful, would strengthen OpenCL because CYCL builds on top of it. But i see it is not that simple.
GPU programming only decreases in accessability. The situation does not improve. Even after a decade it only becomes worse.

...Well maybe for you but not for Nvidia as their >1 billion last datacenter quarterly revenue proves. CUDA adoption is accelerating because it offers a comprehensive set of tools and nothing else is comparable. It has been a decade and billions invested in GPU computing, not to give free pass to lazy competitors, but to sell their own solutions. The day someone will come up with a better HW/SW stack then they will also enjoy the benefit of their work... Isn't the difinition of capitalism where the best wins ?

JoeJ · May 24, 2020

After some progress my solvers are no longer the bottleneck. Porting them to GPU is no longer attractive for me actually.

In general, NVs announcement of C++ support for Ampere is very interesting. This could convince me to ignore the vendor lock downside, because you can assume code will be easy to port to a future standard and easy to maintain.
Did not look at this yet, and probably it's not as optimal as CUDA, but for offline application good enough and will improve. I also would not complain about missing features on the C++ side.
That's interesting for many i guess.

Ext3h · May 26, 2020

JoeJ said:
In general, NVs announcement of C++ support for Ampere is very interesting.

What do you even mean by that? Effectively they are only once more raising the supported C++ standard in CUDA code. Add some syntactic sugar for transforming STL parallel for into kernel launches, which merely saves some boilerplate code, but at no discernible difference to doing that manually. And same constraints obviously, you can only access modules which may go into a kernel, you need to decide on kernel sizes at compile time etc. Full C++17 level constexpr support is arguably the most compelling part of that update.

And it only sounds nice until you realize that this is then still tied to using memory allocations made explicitly against CUDA. Which still makes you painfully aware that this is still just implicit copies and/or pinned memory under the hood. Which heavily influences your application design, as you struggle with the resulting latency and utilization issues. Turning this, regardless of the API, into an almost entirely implementation specific schedule optimization problem.

Followed by the realization that you still need to explicitly use LDS and intrinsics to get any reasonable performance for most tasks, at which point calling that stuff "future standard" or "vendor independent" already becomes moot, as not even the boilerplate code you had saved earlier actually goes away. There is still no such thing as compilers doing any form of inter-thread optimizations even just within a wavefront, or any other form of architecture specific optimization even remotely close to what you get on the CPU side.

In the end, this isn't vendor independent by any means. May look similar, and individual vendors will try to provide "conversion tools", but they only help you as far as to get a "working prototype". Cause that's the point-of-no-return at which upper management is usually convinced to proceed from "evaluation" to "conversion". And then you get to repeat the dreadful cycle of vendor specific optimizations, until you have once again reached the previous performance level. AMD with HIP, and Intel with DPC++, they know very well that their tools are only good for a rapid prototyping proof of concept.

Even Intels "oneAPI" concept is just a beautiful lie, built on the idea that while in the early prototype stage, you can easily switch between different hardware stacks, to find the one best matching your performance / price point targets. Once you've done that, you still end up with a system which is only hypothetically portable, but in reality tailored far too tightly towards the actual target platforms constraints. Usually, you would have to rethink the entire memory model.

JoeJ · May 26, 2020

Ext3h said:
What do you even mean by that?

Only the things mentioned in this post: https://forum.beyond3d.com/posts/2126815/
I don't know what it means exactly, but i'd hope its easy to use, easy to port, easy to reuse and share with CPU implementation.

The things you mention are expected, and exposing HW details like LDS or subgroups is welcome to me. I do not request C++ on GPU or hardware abstractions - i only want an alternative to something like Vulkan just to do some minor things on GPU that are not graphics related.
OpenCL would be exactly this.
But i have heard many people that were not happy because it's too low level for them. I guess that's a big reason of bad adoption. People that are serious about GPU development either use CUDA or work on games. The rest is waiting until the end of time until GPUs become like CPUs, it seems.

Lurkmass · May 27, 2020

JoeJ said:
Only the things mentioned in this post: https://forum.beyond3d.com/posts/2126815/

I know what NVC++ is ...

It's another non-standard C++ compiler but the major takeaway is that you can now write C++17 parallel STL algorithms and execute them on GPUs. The NVC++ compiler still violates several standard C++ conventions so it's not a proper C++ implementation which may mean that only a subset of standard C++ will be supported.

An Nvidia representative still recommends that you use the NVCC compiler and write CUDA C++ if you want the most optimized kernels for their hardware which is no different for the other vendor's recommendations. That means using HCC/HIP C++ for AMD and oneAPI compiler/DPC++ for Intel if you also want the highest performance.

DmitryKo · May 29, 2020

Lurkmass said:
you can now write C++17 parallel STL algorithms and execute them on GPUs. The NVC++ compiler still violates several standard C++ conventions so it's not a proper C++ implementation

Please be careful with your wishes. If you look at the ISO C++ committee draft Unified executors proposal P0443 / P0761 and note Michael Garland and Jared Hoberock as principal contributors - would you be willing to exchange current CUDA conventions for these 'standard C++23' conventions, beyond using their future heterogeneous implementation of parallel STL ?

Ext3h said:
it only sounds nice until you realize that this is then still tied to using memory allocations made explicitly against CUDA
you still need to explicitly use LDS and intrinsics to get any reasonable performance for most tasks

AMD and Intel have a hardware advantage here, since their datacenter solutions should be fully heterogeneous with cache coherent unified memory access over fast interconnect buses. This allows for further compiler and runtime library optimisations.

Even Intels "oneAPI" concept is just a beautiful lie, built on the idea that while in the early prototype stage, you can easily switch between different hardware stacks, to find the one best matching your performance / price point targets
you still end up with a system which is only hypothetically portable, but in reality tailored far too tightly towards the actual target platforms constraints

Portability doesn't imply performance or market viability - even if your code can successfully run on a $200,000 embedded HPC board with six high-end GPUs down to a $50 mobile phone, what kind of end product would need to target both of these platforms?

xpea said:
someone will come up with a better HW/SW stack then they will also enjoy the benefit of their work

The competition is already there, and offers source code compatibility with CUDA.

JoeJ said:
NVs announcement of C++ support for Ampere is very interesting. This could convince me to ignore the vendor lock downside

CUDA toolkit had a C++ compiler since version 1.0, libcu++ is not a new development either.

DmitryKo · Apr 19, 2021

Suddenly, NVidia released Windows/Linux R465 drivers which support OpenCL 3.0 (i.e. baseline OpenCL 1.2 with a few optional features from OpenCL 2.x):
https://developer.nvidia.com/blog/nvidia-is-now-opencl-3-0-conformant/

Highlights of NVIDIA’s OpenCL 3.0 Implementation:

In addition to full OpenCL 1.2 compatibility, NVIDIA’s OpenCL 3.0 drivers now deliver significant optional OpenCL 3.0 functionality:

RGBA vector component naming in OpenCL C kernels

pragma_unroll hint

opencl_3d_image_writes

clCreate*WithProperties APIs which can be used as replacement for the existing clCreateBuffer/Image APIs.

clSetContextDestructorCallback

clCloneKernel (from OpenCL 2.1)

clEnqueueSVMMigrateMem (from OpenCL 2.1)

There are also OpenCL 3.0 implementations from Intel (Linux/Windows) and Imagination (Linux ARM):
https://www.khronos.org/conformance/adopters/conformant-products/opencl

JoeJ · Apr 19, 2021

DmitryKo said:
Suddenly, NVidia released Windows/Linux R465 drivers which support OpenCL 3.0 (i.e. baseline OpenCL 1.2 with a few optional features from OpenCL 2.x):

Pretty cool, although no device side enqueue ofc.
I'd hope AMD would up(or down

grade to 3.0 too. After them dropping CPU support, i considered CL to be dead for them. Not sue if i can rely on future GPU support, so not sure if i could use easy CL or better bite into using clumsy Vulkan for minor GPGPU stuff.

Lurkmass · Apr 20, 2021

JoeJ said:
Pretty cool, although no device side enqueue ofc.
I'd hope AMD would up(or down grade to 3.0 too. After them dropping CPU support, i considered CL to be dead for them. Not sue if i can rely on future GPU support, so not sure if i could use easy CL or better bite into using clumsy Vulkan for minor GPGPU stuff.

AMD have no plans to submit their products in the future anymore for OpenCL conformance testing so OpenCL is as good as dead for them ... (OpenCL will stop getting bug fixes in a year or two and will be left to rot in their drivers after that)

OpenCL was amazingly inconsistent since it's source language had so many unresolved corner cases that would lead to tons of trivial bugs popping in different driver implementations. A SPIR/SPIR-V kernel compiler could've fixed some of the unpredictable behaviours. Features like "C++ for OpenCL" were intended to work with SPIR-V and virtually no one else but Intel are interested in making an OpenCL SPIR-V compiler ...

Vulkan is getting better to use since we recently just got pointers in SPIR-V shaders but they still don't support unstructured control flow ...

DmitryKo · Apr 20, 2021

This is your personal interpretation of their usual 'there are no immediate plans to support...' mantra. AMD is still devoting their resources to delivering ROCm 4.x software HPC stack to be used in HP/Cray supercomputers, and ROCm specifically targets Radeon Instinct series accelerators (i.e. Fiji, Polaris, Vega and CDNA) and binary compilation only. The ROCm OpenCL runtime was recently updated to v2.2, though SPIR-V is not supported by underlying ROCm compiler infrastructure.