OpenCL 3.0 [2020]

Lurkmass · Apr 20, 2021

DmitryKo said:
This is your personal interpretation of their usual 'there are no immediate plans to support...' mantra. AMD is still devoting their resources to delivering ROCm 4.x software HPC stack to be used in HP/Cray supercomputers, and ROCm specifically targets Radeon Instinct series accelerators (i.e. Fiji, Polaris, Vega and CDNA) and binary compilation only. The ROCm OpenCL runtime was recently updated to v2.2, though SPIR-V is not supported by underlying ROCm compiler infrastructure.

After not submitting a conformance test in over 5 years, I don't think they can legally advertise OpenCL support on any of their new products and they used to support SPIR 1.2/OpenCL on gfx8 in their older drivers but not anymore so OpenCL is probably as good as dead on AMD ... (ROCm OpenCL driver is a disaster when we take a look at Blender Cycles/Darktable/GIMP/Davinci Resolve/Autodesk Maya/SideFX Houdini since it doesn't work with any of these apps)

I find it hard to believe that any of HP/Cray's customers would use OpenCL to only be able to run code exclusively for a single platform. SYCL is virtually useless without SPIR-V kernel compiler as well so projects like hipSYCL will eventually fall into the same roadblocks like other OpenCL driver implementations did that prevents it from being production ready ...

DmitryKo · Apr 20, 2021

SYCL 2020 is certainly not limited to OpenCL / SPIR(-V). There can be multiple back-ends that output LLVM bytecode (such as OpenCL 1.x SPIR, Vulkan/OpenCL 2.x SPIR-V, or CUDA PTX), native GPU binary code, or transcoded source code such as OpenCL C or HIP (which is how hipSYCL on ROCm works). This was one of the significant new features in this release.
SPIR-V is not required for native binary code compilation either. OpenCL/SPIR-V target is a design choice made by CodePlay and Intel DP++ to support multiple CPU, GPU and FPGA architectures.

Unannounced plans are, well, unannounced. I doubt even AMD's own HPC team knows at this point whether they are going to support current revisions of OpenCL, SPIR-V or SYCL; these decisions could only be made after they ship the complete ROCm software stack.

As for quality of AMD driver implementations, unfortunately, it's hardly any news. AMD regular Adrenalin OpenCL driver does not work with Blender Cycles renderer on Polaris 11 (GCN4) cards either, while the same driver works fine on Vega (GCN5) and Navi10 (RDNA) cards...

Lurkmass · Apr 24, 2021

Blender has announced that they'll be dropping their OpenCL backend in their Cycles rendering engine ...

DmitryKo · May 1, 2021

Yep, Cycles-X for Blender 3.0 will be based exclusively on OptiX and CUDA. They're also considering SYCL, HIP and Metal, but this won't come in initial release.

GROMACS, the GPU back-end for molecular dynamics, is also going to deprecate OpenCL and rebase their code on SYCL and CUDA - see their IWOCL 2021 session video (4:30) and slides (page 7).

It's not going to be a problem for Intel - their DPC++ framework is essentially SYCL 2020 and it works down to UHD500 (gen9, Skylake). As for AMD, they still have a full year ahead to make up their minds about their priorities in API support; I guess they could at least update their OpenCL drivers with latest LLVM/Clang infrastructure to support SPIR-V and 'C++ for OpenCL', so that third-party SYCL runtimes like HipSYCL could target existing AMD hardware (not just Radeon Instinct series on Linux).

BTW IWOCL 2021 didn't have major announcements this year; OpenCL WG session (video and slides) and SYCL WG session (video and slides) were basically a retelling of the 2020 presentations above, but there were some sessions on implementing C++20 support and/or C++ standard libraries (libcxx).

Lurkmass · May 2, 2021

AMD were considering releasing an OpenCL extension for ray tracing. They were also experimenting with HIP for Blender's Cycles renderer and Brecht disagrees with AMD's suggestion to do offline compilation for rendering kernels because he thinks it's too expensive. Metal is unlikely to be a feasible target for Cycles since it's toy shading language doesn't properly support pointers or unstructured control flow. Metal doesn't have a single source programming model either like CUDA or HIP so Metal shaders would need to be in a separate file ...

As for GROMACS, I think they'll be disappointed at the end if they truly expect to "write once and run anywhere" (between AMD/Intel) with SYCL. They are already struggling with performance on more complex kernels, needing vendor specific code branches, and are seeing large regressions with hipSYCL for a very early implementation on their SYCL backend. I think the team should consider making an abstraction for other potential GPU backends like DPC++, HIP, and including CUDA as well if they want to get the most performance out of all vendors ...

AMD have been committed to sticking with their HIP API with no signs of changing and making an OpenCL SPIR-V compiler would be a monumental task with the end result being lower performance and more maintenance which is why they don't favour that approach. This direction could change depending if other vendors (ideally Nvidia) are willing to implement an OpenCL SPIR-V compiler and OpenCL 2.x too. They also started shipping their HIP libraries and runtimes on Windows as well ...

Alessio1989 · May 2, 2021

AMD struggle to invest on GPGPU on Windows despite it is a big market (Atuodesk and Adobe software are just a small slice of the cake), thanks to WSL something is moving finally... But investing on a butchered dead horse like OpenCl does not make any sense anymore.

DmitryKo · May 9, 2021

AMD have been committed to sticking with their HIP API with no signs of changing and making an OpenCL SPIR-V compiler would be a monumental task with the end result being lower performance and more maintenance which is why they don't favour that approach.
This direction could change depending if other vendors (ideally Nvidia) are willing to implement an OpenCL SPIR-V compiler and OpenCL 2.x too.

I don't really buy it that itermediate representations like LLVM / SPIR / DXIL and SPIR-V would result in lower GPU performance. The recommended object code format in CUDA is PTX (Parallel Thread Execution), [EDIT] an abstracted machine code format, but it's still translated from LLVM IR by NVPTX / NVVM back-ends in Clang/LLVM and NVCC.

https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#instruction-set

(LLVM IR is an assembly language based on an abstracted general purpose central processor; SPIR-V arguably is an even higher level abstraction suitable for GPU parallel processing. Both are a full equivalent of your C/C++, GLSL, HLSL etc. source code, converted to a translator-friendly, machine-readable binary format.)

Khronos Group maintains an open-source bidirectional SPIR-V to LLVM translator, an SPIR-V back-end for LLVM, SPIR-V Cross converter (to HLSL/GLSL/Metal), etc. LLVM maintains the Multi-Level IR Compiler Framework that supports SPIR-V to LLVM conversion, and the AMDGPU back-end. AMD just needs to devote more developer resources to these open source projects, rather than start their own implementations and abandon them soon.

https://github.com/KhronosGroup/SPIRV-LLVM-Translator
https://github.com/KhronosGroup/SPIRV-Cross
https://github.com/KhronosGroup/SPIR/

https://mlir.llvm.org/docs/SPIRVToLLVMDialectConversion/
https://mlir.llvm.org/docs/Dialects/SPIR-V/

https://github.com/KhronosGroup/LLVM-SPIRV-Backend
https://www.phoronix.com/scan.php?page=news_item&px=Intel-2021-LLVM-SPIR-V-Backend

https://llvm.org/docs/AMDGPUUsage.html
https://rocmdocs.amd.com/en/latest/ROCm_Compiler_SDK/ROCm-Native-ISA.html#memory-model

They also started shipping their HIP libraries and runtimes on Windows as well

amdhip64.dll has been included with Radeon Adrenalin drivers since 2019, however ROCm/HIP development is still not supported on Windows, and the HIPCC compiler officially only supports Radeon Instinct MI series, i.e. GCN4 (Polaris 11), GCN5 (Vega) and CDNA.

https://github.com/ROCm-Developer-Tools/HIP/commit/e2bf34cd5e6444ee04adae7c0c496bf52cff4f31
https://github.com/illuhad/hipSYCL/issues/78#issuecomment-582810780
https://github.com/ROCm-Developer-Tools/HIP/issues/84

As for GROMACS, I think they'll be disappointed at the end if they truly expect to "write once and run anywhere" (between AMD/Intel) with SYCL.
I think the team should consider making an abstraction for other potential GPU backends like DPC++, HIP, and including CUDA as well

GROMACS always based their GPGPU path on CUDA. Their IWOCS 2021 presentation says that support for OpenCL 1.x was implemented on top of their pre-existing CUDA abstractions, and so is support for SYCL 2020 and DPC++.

AFAIK, AMD HIPCC compiler also supports OpenCL C 2.x, so it can be made to support C++ for OpenCL, and hipSYCL would be better off directly targeting the OpenCL dialect of C++ , rather than jump hoops trying to abstract SYCL/OpenCL on top of HIP/CUDA...

AMD were considering releasing an OpenCL extension for ray tracing

AMD just need to take the OptiX API and call it HippiX.

https://developer.nvidia.com/blog/how-to-get-started-with-optix-7/

[EDIT] Clarify that PTX is an abstracted assembly code format

Lurkmass · May 9, 2021

DmitryKo said:
I don't really buy it that itermediate representations like LLVM and SPIR-V would result in lower performance. The recommended object code format in CUDA is PTX (Parallel Thread Execution), an intermediate format which is a subset of LLVM 3.7 (used by NVPTX back-end in Clang/LLVM and NVVM back-end in the NVCC compiler), even though CUDA also supports architecture-specific machine code.

There's a big difference between SPIR-V and PTX. SPIR-V is designed by a committee where every participating member has to make a compromise and choose to expose features that all vendors can support which will constrain compiler design. The specifications for PTX are solely controlled by Nvidia and is designed to be forward compatible with future Nvidia HW but even better is that every new iteration of PTX doesn't have to retain backwards compatibility with their older HW either. Your example between SPIR-V and PTX only underlines the apparent tradeoff that compiler designers have to make. Pick your poison (SPIR-V/portability or PTX/performance/simplicity) ...

Even if SPIR-V doesn't have a performance deficit, it doesn't mean that it'll have the same maintenance cost compared to a compiler that only emits native code. Making an offline compiler takes far less effort and is less error prone as well while making a compiler for an IR takes more resources and introduces more bugs in the process too ...

DmitryKo said:
Khronos Group maintains an open-source SPIR-V to LLVM translator, an SPIR-V back-end for LLVM, SPIR-V Cross converter (to HLSL/GLSL/Metal), etc. LLVM maintains the Multi-Level IR Compiler Framework that supports SPIR-V to LLVM conversion, and the AMDGPU back-end. AMD just needs to devote more developer resources to these open source projects, rather than start their own implementations and abandon them soon.

SPIR-V Cross doesn't support the OpenCL dialect of SPIR-V at all which has features like pointers for local and private memory or unstructured control flow along with several other things. Most vendors only support SPIR-V's shader capabilities so they can't do any of the fun things that are exclusive to it's kernel capabilities that we see in OpenCL. None of these projects are relevant to implementing a SPIR-V compiler with kernel capabilities and AMD aren't avoiding these projects because they want to but it's because making this SPIR-V compiler is too much work for virtually no return. LLVM is just an infrastructure for a collection different of backends. AMD still has to implement a SPIR-V compiler over there ...

AMD is using LLVM for it's HIP-Clang backend but they aren't going to make a SPIR-V backend. Same goes for Nvidia where they have a CUDA-Clang backend for LLVM but there's no SPIR-V backend for them either. Only Intel has a SPIR-V compiler in one of the LLVM backends. It's amazing how LLVM can be used for many different frontends (CUDA/DPC++/HIP) to target their unique backends (GCN/PTX/SPIR-V) as well so LLVM alone isn't going to get us closer to portability ...

DmitryKo said:
amdhip64.dll has been included with Radeon Adrenalin drivers since 2019, however ROCm/HIP development is still not supported on Windows, and the HCC compiler officialy only supports Radeon Instinct MI series, i.e. GCN4 (Polaris 11), GCN5 (Vega) and CDNA.

HCC has been deprecated in favour of the HIP-Clang compiler and there is a project that advertises the usage of ROCm on Windows ...

DmitryKo said:
GROMACS always based their GPGPU path on CUDA. Their IWOCS 2021 presentation says that support for OpenCL 1.x was implemented on top of their pre-existing CUDA abstractions, and so is support for SYCL 2020 and DPC++.

They should go beyond just their CUDA abstraction. SYCL doesn't have much of a future outside of Intel and even then DPC++ is their superior version of SYCL. Even the GROMACS team won't dare use SYCL (PTX backend) on Nvidia HW since they don't totally buy into it's portability claims either so I think they're bound to learn this the really hard way when experimenting with SYCL for AMD/Intel ...

DmitryKo said:
AFAIK, AMD HCC compiler also supports OpenCL C 2.x, so it can be made to support C++ for OpenCL, and hipSYCL would be better off directly targeting the OpenCL dialect of C++ , rather than jump hoops trying to abstract SYCL/OpenCL on top of HIP/CUDA...

AMD HCC only supports offline compilation into GCN bytecode. What good are all these source languages for if vendors can't agree to have one IR to rule all implementations ? You do realize that both ARM and x86 support C++ as a source language but they are by no means consistent across each other even with the same C++ code or compiler as well. The Clang compiler can detect undefined behaviour in the C++ code too for this express purpose ...

DmitryKo · May 12, 2021

Lurkmass said:
SPIR-V is designed by a committee where every participating member has to make a compromise and choose to expose features that all vendors can support which will constrain compiler design.

SPIR-V is nothing like a common set of machine instructions to be supported by all vendors. It's rather an intermediate representation for general operations / functions / variables in OpenCL C, 'C++ for OpenCL', and GLSL/HLSL. Each generic operation is translated to specific machine code sequence according to the capabilities of specific vendor's hardware.

HSAIL format was indeed similar to RISC assembly code, just like PTX bytecode which implements an abstract ISA.

The specifications for PTX are solely controlled by Nvidia and is designed to be forward compatible with future Nvidia HW but even better is that every new iteration of PTX doesn't have to retain backwards compatibility with their older HW either.

NVIDIA simply captured a certain version of the LLVM framework ~~LLVM IR couldn't work as a redistributable binary if they didn't 'freeze' up the specs, it's a moving target that would subtly change with each major release of LLVM~~ and mapped it to their device-independent PTX bytecode.

And just like SPIR 1.0/2.0 and DXIL, NVVM implements a restricted subset of LLVM, omitting features only required for certain processors and operating systems which would make no sense for GPGPU programming.

AMD is using LLVM for it's HIP-Clang backend but they aren't going to make a SPIR-V backend

They don't need to directly translate SPIR-V to machine code. SPIR-V can be mapped to LLVM IR, so AMD can either use the bidirectional SPIR-V / LLVM translator from Khronos or the LLVM MLIR framework (used in the TensorFlow runtime) to convert SPIR-V to LLVM, and then translate from LLVM to machine code.

LLVM alone isn't going to get us closer to portability

Of course it will. Intel is working on the Khronos open-source SPIR-V back-end and the DPC++ front-end- so Clang\LLVM will be able to produce SPIR-V redistributables (in addition to CUDA PTX and GCN machine code) from OpenCL C 2.0, 'C++ for OpenCL' and SYCL/DPC++ source code.
https://clang.llvm.org/docs/SYCLSupport.html
https://clang.llvm.org/docs/OpenCLSupport.html

What good are all these source languages for if vendors can't agree to have one IR to rule all implementations

All these vendors are using standard Clang for GPGPU, which supports their preferred distribution formats fairly well. I'd rather see them implement all the different C++ abstractions (CUDA/HIP, 'C++ for OpenCL', and SYCL/DP++) and make them work with MLIR SPIR-V dialect, and eventually standard C++23/26 language constructs (i.e. executors)...

AMD HCC only supports offline compilation into GCN bytecode.

AFAIK it's not byte code, it's machine language (GCN assembly), since HIP/ROCm compiler is built on AMDGPU back-end.

HCC has been deprecated in favour of the HIP-Clang compiler

HIPCC is still the same Clang/LLVM based compiler, though with a different set of C++ template libraries.

SYCL doesn't have much of a future outside of Intel and even then DPC++ is their superior version of SYCL.

So far most DPC++ extensions ended up in a recent SYCL release, and Intel has open-sourced the DPC++ front-end.

LLVM is just an infrastructure for a collection different of backends. AMD still has to implement a SPIR-V compiler over there

It is both a compiler infrastructure and an intermediate language specification shared between front- and back-end layers.

Pick your poison (SPIR-V/portability or PTX/performance/simplicity)

There is no such dilemma, portability and performance are independent variables.

SPIR-V Cross doesn't support the OpenCL dialect of SPIR-V

So I said, 'translator to HLSL/GLSL/Metal'.

Lurkmass · May 12, 2021

DmitryKo said:
SPIR-V is nothing like a common set of machine instructions to be supported by all vendors. It's rather an intermediate representation for general operations / functions / variables in OpenCL C, 'C++ for OpenCL', and GLSL/HLSL. Each general operation is translated to specific machine code sequence according to the capabilities of specific vendor's hardware.

HSAIL format was indeed similar to RISC assembly code.

SPIR-V is a binary format that needs to be supported by all participating vendors regardless of their native HW instruction format ...

DmitryKo said:
NVIDIA simply captured a certain version of the LLVM framework. LLVM IR couldn't work as a redistributable binary if they didn't 'freeze' up the specs, it's a moving target that would subtly change with each major release of LLVM.

And just like SPIR 1.0/2.0, NVVM implements a restricted subset of LLVM, omitting features only required for certain processors and operating systems which would make no sense for GPGPU programming.

You're making this more confusing than it unnecessarily has to be. LLVM IR is straight up not designed to be ingested by the drivers and is meant solely for internal usage by the Clang compiler. No drivers will flat out accept LLVM IR. Meanwhile GCN/PTX/SPIR-V kernels are intended for driver consumption ...

DmitryKo said:
They don't need to directly translate SPIR-V to machine code. SPIR-V can be mapped to LLVM, so AMD can either use the bidirectional SPIR-V / LLVM translator from Khronos or the LLVM MLIR framework (used in the TensorFlow runtime) to convert SPIR-V to LLVM, and then translate from LLVM to machine code.

Yes, they do and don't be silly since no drivers will accept either LLVM or MLIR. Their only supported endpoints are GCN/PTX/SPIR-V depending on the drivers. The magic doesn't happen in the compilers like you seem to think but the magic happens in their drivers ...

DmitryKo said:
Of course it will. Intel is working on the Khronos open-source SPIR-V back-end and the DPC++ front-end- so Clang\LLVM will be able to produce SPIR-V redistributables (in addition to CUDA PTX and GCN machine code) from OpenCL C 2.0, 'C++ for OpenCL' and SYCL/DPC++ source code.

Congrats on Intel making their SPIR-V compiler public which only works with their HW ? AMD and Nvidia still have to make SPIR-V compilers for their own hardware if they care about interoperability or portability ...

DmitryKo said:
All these vendors are using standard Clang for GPGPU, which supports their preferred distribution formats fairly well. I'd rather see them implement all the different C++ abstractions (CUDA/HIP, 'C++ for OpenCL', and SYCL/DP++) and make them work with MLIR SPIR-V dialect, and eventually standard C++23/26 language constructs (i.e. executors)...

AMD and Nvidia still aren't going to make a MLIR or a SPIR-V kernel compiler. Source languages like the CUDA kernel language are tied tightly to a specific hardware vendor so it's useless for other vendors to support them since programs written in them won't run properly for their HW and those programs rely on behaviour that is specific to HW other than their own. There's a very good reason why AMD developed their own source language (HIP) instead adopting an existing one like CUDA or the others since they cannot simultaneously meet all of AMD's goals such as exposing a portable subset of CUDA, exposing low level features for their HW, and being able to run properly on both AMD and Nvidia hardware. Here's the obvious rundown on each source language ...

CUDA: existing kernels don't work properly on AMD HW
C++ for OpenCL: too many design differences compared to CUDA
SYCL: ditto from the above
DPC++: likely won't work properly on AMD HW

DmitryKo said:
AFAIK it's not byte code, it's machine language (GCN assembly), since HIP/ROCm compiler is built on AMDGPU back-end.

HIPCC is still the same Clang/LLVM based compiler, though with a different set of C++ template libraries.

Just because they use the same compiler doesn't mean that they'll be portable. Remember what I said about undefined behaviours in C++ ?

DmitryKo said:
So far most DPC++ extensions ended up in a recent SYCL release, and Intel has open-sourced the DPC++ front-end.

Means very little in practice. Tons of Vulkan extensions end up in the core Vulkan API but they still remain as optional capabilities. SYCL might as well not have a committee behind it at all because no other vendors officially support it! I guess whenever AMD or Nvidia get serious about supporting SYCL, they'll just revert the requirements imposed by Intel like they did recently for OpenCL ...

DmitryKo said:
It is both a compiler infrastructure and an intermediate language specification shared between front- and back-end layers.

It's mostly a compiler infrastructure and drivers will never ingest LLVM IR ...

DmitryKo said:
There is no such dilemma, portability and performance are independent variables.

AMD and Nvidia don't share your opinion. Their backends in the Clang compiler only emits GCN or PTX code. If they truly believed you then AMD/Nvidia wouldn't support exposing low level features like inline GCN/PTX assembly in their source languages as seen in CUDA/HIP. Your response also doesn't address the other fundamental issue with vendor agnostic IRs such as maintenance cost as well ...

DmitryKo said:
So I said, 'translator to HLSL/GLSL/Metal'.

SPIR-V Cross doesn't do that either. The purpose behind SPIR-V Cross is to convert one shader format into another shader format. It is impossible to convert OpenCL/SPIR-V compute kernels into less flexible shader formats ...

DmitryKo · May 14, 2021

Lurkmass said:
SPIR-V is a binary format that needs to be supported by all participating vendors
LLVM IR is straight up not designed to be ingested by the drivers and is meant solely for internal usage by the Clang compiler.
No drivers will flat out accept LLVM IR. Meanwhile GCN/PTX/SPIR-V kernels are intended for driver consumption
Their only supported endpoints are GCN/PTX/SPIR-V depending on the drivers.

Of course LLVM and MLIR are designed for external use just like SPIR-V. They all define both a text-based 'assembly language' / 'Toy language' and a 'bytecode/bitcode' binary interchange format.

NVVM, SPIR 2.x (the OpenCL version preceding SPIR-V) and DXIL are directly based on LLVM IR - all of these specs simply define which subset of respective LLVM spec they support, excluding features like DLLs entry points which only make sense for CPUs.

If they truly believed you then AMD/Nvidia wouldn't support exposing low level features like inline GCN/PTX assembly in their source languages as seen in CUDA/HIP

Again, PTX binaries do not contain machine code, it's ~~just plain NVVM intermediate binary format (a subset of LLVM IR)~~ assembly bytecode for some abstracted ISA.

CUDA allso allows binary code objects (CuBin) which do include machine code (presented as SASS assembly language in the tools), just like OpenCL and HIP, but you can only run it on the specific architecture chosen at compile time - again, just like OpenCL and HIP.
https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#instruction-set-ref

So far everyone is using PTX since it offers both portability and performance, owing to the quality of NVidia's proprietary machine code compiler.

There's a very good reason why AMD developed their own source language (HIP) instead adopting an existing one like CUDA or the others since they cannot simultaneously meet all of AMD's goals such as exposing a portable subset of CUDA, exposing low level features for their HW, and being able to run properly on both AMD and Nvidia hardware.

HIP does not have any 'low level features' specific to AMD hardware, it can be directly compiled to CUDA PTX using the native NVCC compiler.

HIP is just an subset of CUDA 8.x features implemented on top of AMD open-source OpenCL 2.x driver infrastructure for GCN+ GPUs. AMD just took what they could implement in a viable timeframe, rather than include each and every API and language feature.

Here's the obvious rundown on each source language

I understand your point that each vendor needs to fully implement the latest version of CUDA (and design their hardware to resemble latest NVidia GPUs as well).

In reality we will still continue to have different hardware designs and APIs which are slowly converging to a common core, obviously much influenced by CUDA.

The magic doesn't happen in the compilers like you seem to think but the magic happens in their driver

It's just word juggling. There are components/libraries to translate source or intermediary code into machine instructions - call them compiler, driver, compiler driver, or anything you like.

Congrats on Intel making their SPIR-V compiler public which only works with their HW

The translator doesn't use Vulkan SPIR-V extensions until instructed. The back-end would use OpenCL C/C++ model, which doesn't define SPIR-V extensions at all.

Remember what I said about undefined behaviours in C++ ?

Not a compiler's fault, programmers have to avoid undefined behaviour in the first place.

SYCL might as well not have a committee behind it at all because no other vendors officially support it

Vendors only need to support OpenCL C source code, which the 'Big Three' do.

Your response also doesn't address the other fundamental issue with vendor agnostic IRs such as maintenance cost as well

Same cost as maintaining the Clang/LLVM compiler infrastructure and the AMDGPU back-end as used by ROCm.

It is impossible to convert OpenCL/SPIR-V compute kernels into less flexible shader formats

Yes, though they also tried to support OpenCL but that path was deprecated. Did I have to copy the entire project description?

Lurkmass · May 22, 2021

DmitryKo said:
PTX/NVVM and SPIR 2.x (the OpenCL version preceding SPIR-V) are directly based on LLVM IR - both of these specs simply define which subset of respective LLVM spec they support, excluding features like DLLs entry points which only make sense for CPUs.

I'm pretty sure Nvidia designed PTX to be closer against SASS rather than LLVM IR. LLVM IR is straight up sub-optimal stuff when we take a look at AMDVLK's open source SPIR-V shader compiler which causes very long compilation times! AMDVLK-Pro's proprietary shader compiler or RADV's ACO shader compiler on the other hand have super fast compilation times since it isn't based on LLVM at all ...

DmitryKo said:
Again, PTX binaries do not contain any 'assembly' or machine code, it's just plain NVVM intermediate binary format (a subset of LLVM IR).

CUDA allso allows binary code objects (CuBin) which do include machine code (presented as SASS assembly language in the tools), just like OpenCL and HIP, but you can only run it on the specific architecture chosen at compile time - again, just like OpenCL and HIP.

So far everyone is using PTX since it offers both portability and performance, owing to the quality of NVidia's proprietary optimising machine code compiler.

PTX can be described as a "pseudo-assembly language" and the line gets even more blurrier when we talk about cases like x86 ISA implementations. You can can make assembler that only works on AMD or Intel processors exclusively in several cases because of these properties so do we now start calling the x86 ISA an IR instead of an assembly language ?

(Zen 1/2 is JITing every PEXT/PDEP opcode as an example)

Also, only Nvidia is using PTX and it only offers portability between Nvidia hardware so no other hardware vendors can implement PTX with optimal performance ... (neither AMD or Intel will support PTX ingestion for their drivers)

DmitryKo said:
HIP does not have any 'low level features' specific to AMD hardware, it can be directly compiled to CUDA PTX using the native NVCC compiler.

HIP is just an subset of CUDA 8.x features implemented on top of AMD open-source OpenCL 2.x driver infrastructure for GCN+ GPUs. AMD just took what they could implement in a viable timeframe, rather than include each and every API and language feature.

HIP kernels absolutely supports inline assembly ...

DmitryKo said:
It's just word juggling. There are components/libraries to translate source or intermediary code into machine instructions - call them compiler, driver, compiler driver, or anything you like.

There's a clear difference between a compiler and the driver. A compiler can produce code for driver ingestion but the driver always possesses the exclusive capability to launch shaders or kernels for execution based on the ingested code in this case ...

DmitryKo said:
Not a compiler's fault, programmers have to avoid undefined behaviour in the first place.

So you're arguing that we should just blindly trust that the programmers will somehow get it right ? I hope you realize that years of data with standards like OpenGL/GLSL and OpenCL/C suggests that this is not the case. GLSL compilers prior to Vulkan were never consistent across drivers despite having a massive specification backing the source language. Meanwhile on Direct3D, HLSL is a mess with no specification but the bytecode (DXBC/DXIL) is well defined which made it far more successful at portability. Khronos Group took a page out of Microsoft's book and standardized an IR (SPIR-V) for Vulkan GLSL so that driver implementations didn't have to roll out their own GLSL compilers ...

It's industry consensus that we reap more benefits in terms of interoperability and consistency if we standardize the IR over the source language so models like OpenGL/GLSL or OpenCL/C aren't going to be sustainable in the future ...

DmitryKo said:
Vendors only need to support OpenCL C source code, which the 'Big Three' do.

Are you implying that we should implement SYCL over OpenCL C ? Well I can't see that been helpful at all since SYCL doesn't address the underlying problem which is OpenCL C itself being a failure as I briefed on before. There's very little confidence to be had with SYCL when trying to rely on subpar OpenCL implementations ...

DmitryKo said:
Same cost as maintaining the Clang/LLVM compiler infrastructure and the AMDGPU back-end as used by ROCm.

I think you have it mixed up. It's AMDGPU that ended up being lucky enough to reuse the work from the ROCm team. If it weren't for the team behind ROCm for their contributions to LLVM, the AMDGPU kernel driver wouldn't even have an open source shader compiler so I fail to see how it'd be the "same cost" to maintain yet another additional format (SPIR-V compute kernels) ...

DmitryKo said:
Yes, though they also tried to support OpenCL but that path was deprecated. Did I have to copy the entire project description?

Just checking if we're on the same page since unnecessary ambiguity is being introduced ...

DmitryKo · May 24, 2021

Lurkmass said:
I'm pretty sure Nvidia designed PTX to be closer against SASS rather than LLVM IR.
PTX can be described as a "pseudo-assembly language"
do we now start calling the x86 ISA an IR instead of an assembly language

I stand corrected - PTX is indeed a machine code format which implements an abstracted massively parallel ISA, so it's closer to the actual machine code than LLVM / NVVM intermediate representations.
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
Sorry for the confusion on my part.

PTX assembly is still generated from C++ source by LVVM / NVVM intermediate code in the NVPTX back-end, and all optimisation happens at front-end / mid-end level. PTX is just simpler to convert into actual machine code at runtime.

So you're arguing that we should just blindly trust that the programmers will somehow get it right ?
It's industry consensus that we reap more benefits in terms of interoperability and consistency if we standardize the IR over the source language so models like OpenGL/GLSL or OpenCL/C aren't going to be sustainable in the future

It makes no difference whether you translate from C/ C++ source code or LLVM / SPIR-V / DXIL etc. intermediate representation - the effects of having (mostly unintentional) undefined behaviour in your source code will still be present.

Undefined behaviour is not a compiler or vendor's driver bug, it's just how C-derived languages process obvious programming errors - like uninitialised variables and memory, signed integer overflow and division by zero, dereferencing null or invalid pointers, array/buffer out-of-bounds errors and stack overflow, unsafe pointer casting between different integral types (like floats to integers), etc.

The programmer was supposed to prevent these errors from happening, and modern compilers are allowed to assume that undefined behaviour will never happen, and are not required to implement additional safety checks and ensure code portability across different architectures.

LLVM IR does not impose type / memory safety either, just like any other assembly language.

https://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
https://doc.rust-lang.org/reference/behavior-considered-undefined.html

This part of the C standard is definitely not looking pretty today after all the security exploits caused by buffer overruns, but at this point the situation can only be improved by moving to a safe (or safer) language standard - like modern dialects of C++ which greatly improve code safety by preferring STL classes and iterators over legacy constructs like arrays and pointers, or new type-safe and memory-safe dialects like Rust.

That's the reason why I keep stressing the importance of implementing modern C++20/23 features for GPGPU programming.

HIP kernels absolutely supports inline assembly

Inline assembly is hardly a CUDA/HIP 'feature' - it's rather a way to hack through compiler/optimiser inefficiencies with hand-written code (and GCNx/xDNA machine ISA changed slightly with each new generation so it's not even 100% portable).

I do wonder if AMD ever considered abstracted assembly code as an object format... I hope they have SPIR-V in the long term plans, but their graphics team always seems to be several years late on the software release schedule.

Are you implying that we should implement SYCL over OpenCL C ?

SYCL was designed right from the start to be mapped into OpenCL C for device code. Basically it defines a 3rd party development environment which provides a Clang/LLVM based compiler and a platform-specific runtime library, and the video card driver only has to support OpenCL C (and optionally SPIR-V, PTX, HIP, and C++ for OpenCL, depending on the implementation).

SYCL doesn't address the underlying problem which is OpenCL C itself being a failure as I briefed on before

I got your idea already, but if you really expected Khronos to fully embrace CUDA and rename it, say, CUCL (pun intended) just like they did with Mantle and Vulkan, sorry it won't happen.

so I fail to see how it'd be the "same cost" to maintain yet another additional format (SPIR-V compute kernels)

AMDGPU back-end is all they need to maintain. SPIR-V will be translated into LLVM IR bytecode to be consumed by the AMDGPU back-end and rewritten into AMDGCN intrinsics - this is functionally the same as compiling OpenCL C/C++ source code.
https://github.com/ROCm-Developer-T...main/llvm/include/llvm/IR/IntrinsicsAMDGPU.td

There's a clear difference between a compiler and the driver

Unless the driver includes a compiler.

GLSL compilers prior to Vulkan were never consistent across drivers

This specific issue wasn't about GLSL compiler and undefined behaviour in C.

It's AMDGPU that ended up being lucky enough to reuse the work from the ROCm team. If it weren't for the team behind ROCm for their contributions to LLVM, the AMDGPU kernel driver wouldn't even have an open source shader compiler

So which part did I mix up?

Lurkmass · May 27, 2021

DmitryKo said:
It makes no difference whether you translate from C/ C++ source code or LLVM / SPIR-V / DXIL etc. intermediate representation - the effects of having (mostly unintentional) undefined behaviour in your source code will still be present.

In practice, there's arguably a lot less undefined behaviour when compiling from an intermediate representation compared to compiling from the source language itself ...

DmitryKo said:
Undefined behaviour is not a compiler or vendor's driver bug, it's just how C-derived languages process obvious programming errors - like uninitialised variables and memory, signed integer overflow and division by zero, dereferencing null or invalid pointers, array/buffer out-of-bounds errors and stack overflow, unsafe pointer casting between different integral types (like floats to integers), etc.

The programmer was supposed to prevent these errors from happening, and modern compilers are allowed to assume that undefined behaviour will never happen, and are not required to implement additional safety checks and ensure code portability across different architectures.

LLVM IR does not impose type / memory safety either, just like any other assembly language.

Sure enough the bolded shows why OpenCL C was doomed. Making portability and implementation details the programmer's responsibility was a mistake if vendor are allowed to mostly do whatever they want with their own OpenCL C compiler ...

DmitryKo said:
SYCL was designed right from the start to be mapped into OpenCL C for device code. Basically it defines a 3rd party development environment which provides a Clang/LLVM based compiler and a platform-specific runtime library, and the video card driver only has to support OpenCL C (and optionally SPIR-V, PTX, HIP, and C++ for OpenCL, depending on the implementation).

I got your idea already, but if you really expected Khronos to fully embrace CUDA and rename it, say, CUCL (pun intended) just like they did with Mantle and Vulkan, sorry it won't happen.

At the end of the day we're still left with the big elephant in the room which is OpenCL C itself with no widely accepted alternatives ...

DmitryKo said:
AMDGPU is all they need to maintain. SPIR-V will be translated into LLVM IR bytecode to be consumed by the AMDGPU back-end and rewritten into AMDGCN intrinsics - this is functionally the same as compiling OpenCL C/C++ source code.
https://github.com/ROCm-Developer-T...main/llvm/include/llvm/IR/IntrinsicsAMDGPU.td

Unless the driver includes a compiler.

LLVM IR is mostly used for source-to-binary translation if we take a look at oneAPI. LLVM is arguably way more comparable to glslang or even AMDIL rather than a binary format for driver ingestion. Even if LLVM IR is used for binary-to-binary translation like we see on AMDVLK, it only accepts SPIR-V binaries while LLVM IR is kept exclusively for internal driver use. The same concept applies when we're dealing with DXBC while AMDIL is used solely for transformation purposes for D3D drivers ...

I don't know how I can make this any clearer to you but drivers will absolutely not accept LLVM IR for ingestion. LLVM IR has only ever been used to translate SPIR-V shaders and never has been used to translate SPIR-V kernels. The vast majority of the LLVM's usage for compute stems from the fact that vendors are using it's infrastructure to translate their own specific source language (CUDA/DPC++/HIP) into driver supported formats like (GCN/PTX/SPIR-V). With these facts in place it should be clear to you by now as to why AMD would need to make a dedicated internal driver compiler for SPIR-V compute kernels just as Intel had to for oneAPI ...

DmitryKo said:
This specific issue wasn't about GLSL compiler and undefined behaviour in C.

I think you're missing the entire picture. Relying on programmers to workaround undefined behaviour or implementation specific details was a big mistake in OpenGL/OpenCL and making every vendor create their own compiler for a source language was another huge setback which is why those standards met their demise. Khronos Group had to learn the hard way that neither programmer or source language compilation could be trusted. We've had years of data to concede that OpenCL's model won't work for portability ...

DmitryKo said:
So which part did I mix up?

Well for starters ROCm mostly runs on totally different parts of the AMDGPU kernel driver. The work that solely goes into the graphics side of AMDGPU doesn't really help ROCm. Also, it's unlikely that the ROCm team can use LLVM to translate SPIR-V compute kernels to GCN code as I've elaborated further above. It's virtually impossible for it to be the "same cost" to support SPIR-V compute kernels for driver ingestion when it's pretty clear that the ROCm team would need to do lot's of extra work to add this capability ...

JoeJ · May 27, 2021

Lurkmass said:
Sure enough the bolded shows why OpenCL C was doomed. Making portability and implementation details the programmer's responsibility was a mistake if vendor are allowed to mostly do whatever they want with their own OpenCL C compiler ...

But then C or C++ would be doomed as well.
What would be some examples of problems often arising?

I mean, things like those:

DmitryKo said:
Undefined behaviour is not a compiler or vendor's driver bug, it's just how C-derived languages process obvious programming errors - like uninitialised variables and memory, signed integer overflow and division by zero, dereferencing null or invalid pointers, array/buffer out-of-bounds errors and stack overflow, unsafe pointer casting between different integral types (like floats to integers), etc.

... are not really a problem. We are all used to avoiding them? The only one we have to abuse sometimes is casting float to int, e.g. to can do atomic max if floating point atomics are still not supported.

Lurkmass · May 27, 2021

JoeJ said:
But then C or C++ would be doomed as well.
What would be some examples of problems often arising?

I mean, things like those:

... are not really a problem. We are all used to avoiding them? The only one we have to abuse sometimes is casting float to int, e.g. to can do atomic max if floating point atomics are still not supported.

Not all source languages share the goal of reaching portability. Sometimes abusing undefined behavior means gaining higher performance. C++ has it's uses outside of portability since it wasn't designed very well for this purpose. Java is portable by design but it imposes performance limitations ...

Unfortunately, we don't have any options for portable compute API/languages ... (OpenCL didn't deliver on this promise and all relevant vendors have better options in terms of features/performance)

DmitryKo · Jun 2, 2021

Making portability and implementation details the programmer's responsibility was a mistake

Again, these are implementation details for processing obvious programming errors.

This is from the current draft of the C2x standard,
http://www.open-std.org/jtc1/sc22/wg14/www/projects#9899

3.4.3
undefined behavior

Behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this document imposes no requirements

Note 1 to entry: Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).

I.e. non-portable or erroneous code can be prevented from compiling and silently discarded, or translated according to some architecture-specific behavior, and this can change with each different target machine architecture or new/updated programming environment.

A Guide to Undefined Behavior in C and C++, Part 1
https://blog.regehr.org/archives/213
Part 2
https://blog.regehr.org/archives/226
Part 3
https://blog.regehr.org/archives/232

there's arguably a lot less undefined behaviour when compiling from an intermediate representation

Only because machine code translators can decide how to process these errors on a specific machine architecture - i.e. signed integer overflow could be silently wrapped around to a two's complement representation, or could throw a runtime error when the overflow flag is detected.

LLVM IR actually preserves undefined behavior coming from the C/C++ source by using 'undef', 'poison' and 'freeze' attributes to mark potentially undefined results - see below for details:

Taming Undefined Behavior in LLVM
https://blog.regehr.org/archives/1496
https://www.microsoft.com/en-us/research/publication/taming-undefined-behavior-llvm/

https://llvm.org/docs/LangRef.html#undefined-values
https://llvm.org/docs/LangRef.html#poison-values

Alive2 Part 1: Introduction
https://blog.regehr.org/archives/1722
Alive2 Part 2: Tracking miscompilations in LLVM using its own unit tests
https://blog.regehr.org/archives/1737
Alive2 Part 3: Things You Can and Can’t Do with Undef in LLVM
https://blog.regehr.org/archives/1837

C++ has it's uses outside of portability since it wasn't designed very well for this purpose.

On this planet, C and C++ are portable languages which discourage writing unportable code, but do not expressly forbid it, trusting the programmer to understand the implications on their specific machine architecture.

Making portability and implementation details the programmer's responsibility was a mistake

Relying on programmers to workaround undefined behaviour or implementation specific details was a big mistake in OpenGL/OpenCL

the bolded shows why OpenCL C was doomed

Relying on programmers to stop making obvious programming errors was surely a mistake

That's why modern C/C++ compilers include strict warning levels, code analysis, sanitizers, and debug builds/runtimes, to help the programmer discover and correct these errors.

DmitryKo · Jun 2, 2021

Lurkmass said:
making every vendor create their own compiler for a source language was another huge setback

Just a matter of good software engineering practices.

Previously, vendors typically forked the LLVM source tree and maintained that fork in their own proprietary repository, without upstreaming their changes to the community or downstreaming recent bugfixes and code improvements.

Today AMD has an open-source branch in the mirror of the main LLVM tree, and maintains both upstream and downstream patches, so staying up-to-date with the latest version of Clang/LLVM takes much less effort.

https://github.com/RadeonOpenCompute/llvm-project

drivers will absolutely not accept LLVM IR for ingestion.

LLVM IR has only ever been used to translate SPIR-V shaders and never has been used to translate SPIR-V kernels.

This is not correct. AMD OpenCL drivers used to support SPIR (on TeraScale architecture), and all WDDM 2.1+ (user-mode) drivers support DXIL. Both of these are binary formats for driver ingestion, directly based on LLVM IR 3.7 bytecode.

Microsoft also developed an OpenCL on Direct3D 12 layer, which translates SPIR-V kernels into DXIL compute with the help of SPIRV-LLVM Translator and SPIRV-Tools from Khronos.

LLVM IR is mostly used for source-to-binary translation if we take a look at oneAPI ... Even if LLVM IR is used for binary-to-binary translation like we see on AMDVLK, it only accepts SPIR-V binaries while LLVM IR is kept exclusively for internal driver use.

Clang/LLVM is highly modular. Clang can generate an LLVM IR .bc bitcode file with -emit-llvm option, and LLVM compiler tools (lli and llc) can translate .bc files into assembly language or machine code for any supported architecture.

You can also use LVVM-C API or C++ classes like llvm::ExecutionEngine to perform these tasks programmatically in your application.

AMD would need to make a dedicated internal driver compiler for SPIR-V compute kernels just as Intel had to for oneAPI

I'd rather guess Intel wants that proprietary machine code translator backend to remain closed source.
LLVM software license doesn't require vendors to publish and open-source their code.
https://llvm.org/docs/DeveloperPolicy.html#new-llvm-project-license-framework

ROCm mostly runs on totally different parts of the AMDGPU kernel driver. The work that solely goes into the graphics side of AMDGPU doesn't really help ROCm

I was talking about AMDGPU back-end for Clang/LLVM - what AMDGPU kernel driver had to do with that discussion?

It's virtually impossible for it to be the "same cost" to support SPIR-V compute kernels for driver ingestion when it's pretty clear that the ROCm team would need to do lot's of extra work to add this capability

it's unlikely that the ROCm team can use LLVM to translate SPIR-V compute kernels to GCN code as I've elaborated further above

I disagree. That said, for the time coming they'll be pretty occupied with patching Linux to support cache-coherent GPU memory access...

Lurkmass · Jun 4, 2021

DmitryKo said:
Just a matter of good software engineering practices.

Good practices doesn't outweigh bad design principles. Vendors designing their own compiler for source languages was the greatest sin committed against portability ...

You don't see Microsoft asking each vendor to make their own HLSL compiler for Direct3D now, do you ?

DmitryKo said:
This is not correct. AMD OpenCL drivers used to support SPIR (on TeraScale architecture), and all WDDM 2.1+ (user-mode) drivers support DXIL. Both of these are binary formats for driver ingestion, directly based on LLVM IR 3.7 bytecode.

LLVM IR =/= SPIR/SPIR-V (SPIR/SPIR-V being forked off of LLVM IR just introduces design divergence between them)

You forget that SPIR/SPIR-V is independently developed from LLVM IR by Khronos Group so technically I am correct!

Also when has AMD's OpenCL driver ever supported SPIR on Terascale ? SPIR was only available on AMD drivers for gfx8 GPUs AFAIK

DmitryKo said:
Microsoft also developed an OpenCL on Direct3D 12 layer, which translates SPIR-V kernels into DXIL compute with the help of SPIRV-LLVM Translator and SPIRV-Tools from Khronos.

OpenCL over D3D12 is probably hot garbage. I can't see how Microsoft could fully implement OpenCL C since DXIL lacks support for pointers last time I checked ... (root descriptors don't count)

DmitryKo said:
I'd rather guess Intel wants their proprietary machine code translator backend to remain closed source.
LLVM software license doesn't require vendors to publish and open-source their code.
https://llvm.org/docs/DeveloperPolicy.html#new-llvm-project-license-framework

That's not true, Intel open sourced their offline compiler in addition to open sourcing their SPIR-V backend in upstream LLVM ...

DmitryKo said:
I disagree. That said, for the time coming they'll be pretty occupied with patching Linux to support cache-coherent GPU memory access...

When we take a look at the amount of code between the ROCm runtime and Intel's compute runtime, the Intel runtime has like over ~200k more LOC compared to the ROCm equivalent. Another way to see that having a SPIR-V backend was a disaster is that Intel still doesn't have a oneAPI backend for their GPUs in Tensorflow which is the most popular ML framework so they're making less progress on outside projects ...

Ext3h · Jun 5, 2021

Lurkmass said:
When we take a look at the amount of code between the ROCm runtime and Intel's compute runtime, the Intel runtime has like over ~200k more LOC compared to the ROCm equivalent. Another way to see that having a SPIR-V backend was a disaster is that Intel still doesn't have a oneAPI backend for their GPUs in Tensorflow which is the most popular ML framework so they're making less progress on outside projects ...

Which might have more to do with a shift in priorities. Tensorflow is predominant for research applications or targeting NVidias platforms, but has hardly any relevance in deployed applications for the integrated solutions which Intel has acquired for so much money. Tensorflow integration of oneAPI was supposed to pave the way for Nervana, which has been terminated last year. With the switch from Nervana to Habana, there is already a fully functional Tensorflow integration outside the oneAPI family (SynapseAI). No good reason to even try and integrate that with the oneAPI fly trap when large scale customers have already adopted the existing interfaces, too.

The other leg for oneAPI is the upgrade chain CPU -> iGPU -> dGPU/FPGA, but neural networks are no longer competitive on these platform familiies, so this aims solely at GPGPU alike tasks / DSP nowadays. That be said, for half the target audience (the one ending up on the FPGA path, unless they make the ultimate step to ASIC), half the assumed generalizations in this thread are not even applicable.

And even though it has been yet another year, nothing changed on the statement that targeting anything but the CPU still requires a strict focus on a specific target platform, and unless we are talking about NVidia here, none of the competitors actually has a uniform product lineup, but at least a duality of CPU/dGPU, or like Intel, even an entire zoo of specialized accelerators.

Intel having 200k LOC more doesn't even mean a thing. Development on Intels side is still just 2 senior full-time (and a handful of part-time junior) developers hacking away for the last 2 years, and that is by no means any indicator of them running out of resources in any form. Unless you seriously suggest that Intel only consists of only about 20-30 senior software developers in total. Heck, pulling in only one or two major customers means this investment in publicity has already amortized itself.

What you can say for certain though, is that both Intel and AMD are investing only abysmally small budgets into proper foundation work (and I'm not even talking standardization, but their proprietary stuff), while the vast majority of developers resources are sunk straight into invisible customer engineering projects.
Especially for Intel, which should at least be able to keep up with NVidia, if it wasn't for a butchered product strategy.
And AMD has a bad habit of getting the public projects only to a certain level of completion, and then to effectively stop development just before they would have to invest into customer-engineering in opensource 3rd-party projects to get the developed technologies to visible adoption....