OpenCL 3.0 [2020]

Discussion in 'Rendering Technology and APIs' started by DmitryKo, Apr 27, 2020.

  1. DmitryKo

    DmitryKo Regular

    DXIL certainly supports memory pointers, right from the first public commit.

    https://github.com/Microsoft/DirectXShaderCompiler/blob/master/docs/DXIL.rst#memory-accesses

    Indexable thread-local and groupshared variables are represented as variables and accessed via LLVM C-like pointers.... The following pointer types are supported:
    • Non-indexable thread-local variables.
    • Indexable thread-local variables (DXBC x-registers).
    • Groupshared variables (DXBC g-registers).
    • Device memory pointer.
    • Constant-buffer-like memory pointer.
    The type of DXIL pointer is differentiated by LLVM addrspace construct.​
    Resource descriptor structure does include a memory pointer to the actual 'texture' data, although descriptors use opaque hardware-specific formats. When you construct UAV/SRV/CBV descriptor heaps and link them to your shaders with root descriptors and descriptor tables, these are translated to actual memory addresses for execution.


    SPIR and DXIL are rather 'frozen at' specific LLVM version, but they still constitute legitimate LLVM IR bitcode version which can be read even by current LLVM releases.

    DXIL uses standard LLVM assembly instructions (such as Add, FAdd, Sub, FSub, Mul, FMul, UDiv, SDiv, FDiv, etc) and external functions which have to be implemented in LLVM assembly by each individual vendor - an example would be trigonometric functions (Cos, Sin, Tan, Acos, Asin, Atan, Hcos, Hsin, Htan, Exp, Frc, Log, Sqrt, etc.) which are expanded to Taylor series approximations by in the HLSL compiler (see DxilExpandTrigIntrinsics.cpp).

    SPIR also uses standard LLVM assembly instructions and data types, and a only a few 'built-in' functions (see the SPIR specifications registry)


    Not sure why you have to blame Intel and SPIR-V for just about everything that's wrong in this world...

    Intel and Microsoft already had working oneDNN and DirectML forks of TensorFlow 1.1x. Their changes were not merged because TF developers started an overhaul to support multiple pluggable GPU devices - this should be available in TensorFlow 2.5, which is still not ready.
    https://github.com/tensorflow/community/pull/243#issuecomment-837383825

    Call it 'practices' or 'principles', major vendors don't bother with proprietary OpenCL C compilers anymore and maintain Clang/LLVM forks or branches instead (which offers an opportunity to move away from unsafe C-style language constructs to safe C++ STL abstractions like containers, iterators and constructors with move semantics).


    SPIR 1.2/2.0 is a subset of LLVM 3.2/3.4 (as per feature table from the Khronos SPIR page), just like DXIL 1.x is a subset of LLVM 3.7. SPIR-V is indeed defined by Khronos, but it can be mapped to LLVM IR as well.

    It was their design decision to compile DPC++/SYCL source into SPIR-V target, because they wanted to support third-party FPGA accelerators and they even acquired one (and their SPIR-V to machine code translator is also more compact in comparison to LLVM).

    AMD already implemented their machine code translator as a LLVM back-end, so they just need to support intrinsic functions and assembly instructions issued by the SPIRV-LLVM Translator.

    ROCr is just a tiny user-mode runtime. AFAIK the bulk of open-source ROCm work goes into the actual Clang/LLVM compiler front-end and AMDGPU back-end, as well as ROCd Kernel driver and the Linux kernel. Development mostly happens in a proprietary AMD repository though, and public GitHub repositories are updated with bulk commits only once in a while.
     
    Last edited: Jun 12, 2021
  2. Lurkmass

    Lurkmass Regular

    Does DXIL support pointers to global memory ? If it did then why do developers keep asking Microsoft to expose them in HLSL ?

    Descriptors may contain GPU VAs but aside from root descriptors you can't pass this information to the shaders! The resource bindings have to be done and accessed through the root signature which is strictly limited to 64 DWORDs in space and you can't do any of the fun stuff such as creating complex data structures like linked lists as you would normally expect from a real pointer ...

    If D3D12 did truly support pointers like other APIs such as OpenCL or Vulkan did then we wouldn't need painful abstractions such as root signatures and we'd be able to place pointers directly in our resources like SRV/UAV/CBVs ...

    DXIL has been "diverging" from LLVM as well since Microsoft keeps updating it's specifications ...

    SPIR-V has it's uses which is being a useful portable abstraction for graphics shaders but you and I know that it hasn't really lived up to promise of being a portable abstraction for compute kernels ...

    SPIR-V is literally the best thing that's ever happen to the Khronos Group since it played a big role in their success to developing Vulkan and for it becoming widely adopted across the industry. Anyone would've wished for the same to happen for OpenCL/SYCL but most vendor had their own plans instead ...

    That might be true in the future because vendors plan on deprecating OpenCL! The reality is very different in the past and even now because projects like Blender infamously kept blaming AMD's OpenCL C compiler limitations for their inability to workaround them so it wasn't until AMD themselves had to intervene fix the project itself by submitting patches to them which was supposed to be the project's responsibility. OpenCL C compilers matter a lot because it's what drivers will accept and it's what led to a failure of portability ...

    Last I checked, Intel FPGAs only supported offline compilation so I don't think they support SPIR-V yet if they even plan to ...

    Everybody is aware that ROCm is not a community project which makes it very hard for outsiders to develop patches for them ...
     
  3. DmitryKo

    DmitryKo Regular

    These abstractions exist by design, Resource Binding is a fundamental concept of Direct3D 12 (and WDDM 2.0 GPUMMU memory management model), and this is not going to change by a GitHub request.

    HLSL (and NVIDIA's Cg) was developed at time of Direct3D 9 and fixed function shader units with limited parameter memory - though Direct3D 11/12 shader hardware evolved to hugely multithreaded general purpose SIMD processors, SM 2.x/4.x concepts of input / output / constant memory were retained for backward compatibility, and resource views (and resource descriptors) were added to Direct3D API (and WDDM drivers) for a limited form of random access.

    To entirely retire these abstractions, and expose unified, cache-coherent memory access to system memory and video local memory from both CPU and GPU, as would be possible in future AMD CDNA2 and Intel Xe-HPC GPUs, it would probably take a new shader model and a major Direct3D version (SM 7.0? Direct3D 13?) - maybe even a new C++ derived shader language or a single-source C++ library.

    Also remember how most of the enhancements in WDDM 2.x were actually presented at WinHEC 2006 but it took major vendors a dozen years to actually implement them in hardware and expose the benefits through new APIs (AMD Mantle and Direct3D 12).


    As for DXIL implementation details, LLVM IR supports addrspace (address space) attributes to tag different types of memory (i.e. thread-local, indexable thread-local, group shared, constant buffer, device memory etc. pointers above), so global virtual address space could be addded in new revisions of HLSL if needed; generic resource pointers are on the roadmap, and Vulkan SPIR-V target has been maintained as a community contribution.

    Pointer size seems to be limited to 32-bit, but the documentation is still stuck at DXIL 1.2 (SM 6.2) while most current revision is DXIL 1.6 (SM 6.6), and experimental branches like HLSL 2021 include function template syntax from C++ 98 and SM 6.7.

    Root Signatures were designed to contain links to descriptor heaps/tables (UAV/SRV/CBV), which will in turn contain millions of descriptors (with Resource Binding tier 3). While you can also store root constants and root descriptors, only a limited number would fit in the size of the structure.
    https://microsoft.github.io/DirectX-Specs/d3d/ResourceBinding.html#root-signature

    Walking bidirectional lists in system memory with compute shaders is not a good idea IMHO. GPUs are designed to process large continuous chunks of input data, like buffers or resources (textures), in local video memory, and random access would easily kill the performance.

    You can access system memory buffers using OpenCL 2.0 global address space (Shared Virtual Memory), or similar APIs in CUDA and SYCL. Memory management would be tricky though, as on existing hardware the GPU driver would have to page full 64 Kbyte blocks from system memory to local video memory, even if you only access a single variable.

    https://software.intel.com/content/...opencl-20-shared-virtual-memory-overview.html
    https://rocmdocs.amd.com/en/latest/...-programming-guide.html#shared-virtual-memory
    https://developer.amd.com/fine-grain-svm-with-examples/

    So Blender Cycles team, which has several developers from Nvidia on a payroll, released faulty OpenCL code and they blamed it on AMD OpenCL runtime, until AMD developers fixed errors in their code? Not sure why that's Khronos or AMD fault.

    It's still a subset of LLVM 3.7 - adding new HLSL intrinsic functions or supporting additional LLVM data types wouldn't really break bytecode compatibility.

    Since there are no released implementations, you can't really say anything about portability. 'Not lived up to the promise' presumes it has been tested and found to be lacking.

    That's OK as long as they would support C++ source code in either OpenCL or their proprietary APIs.
     
    Last edited: Jun 12, 2021
  4. Ethatron

    Ethatron Regular Subscriber

    Microsoft could experiment with it inside the C++ AMP run-time, without requiring re-tooling (on our side).
     
  5. DmitryKo

    DmitryKo Regular

    Microsoft C++ AMP uses Direct3D 11 runtime, and WDDM / DXGK 1.x don't support shared virtual address space, so it's even more page copying and address patching behind the scenes.

    Vulkan (Mantle) and D3D 12/WDDM 2.0 drivers can allocate device local, host visible memory pool (limited to legacy 256 Mbytes until very recently), and host visible, host cached pool in system memory; CUDA and ROCm/HIP also support 'pinned' host memory. However each pool can be only be write cached by its local processor, and memory coherency is implemented by actually disabling (flushing) the CPU cache, which incurs significant overhead.

    HIP Programming Guide - Host Memory - Coherency Controls
    https://rocmdocs.amd.com/en/latest/Programming_Guides/hip-programming-guide.html#host-memory

    Memory management in Vulkan and DX12
    Adam Sawicki (AMD)
    (Powerpoint slides)
    https://gpuopen.com/events/gdc-2018-presentations/

    Differences in memory management between Direct3D 12 and Vulkan
    https://www.asawicki.info/articles/memory_management_vulkan_direct3d_12.php5


    OTOH future GPUs and CPUs would use directory-based coherency protocols like GenZ/CXL over PCIe in Intel Xe-HPC parts, or Infinity Architecture in future AMD EPYC Genos/CDNA2 parts, instead of just snooping the other processor's cache over PCIe:

    https://wccftech.com/amd-next-gen-e...u-accelerator-power-el-capitan-supercomputer/
    https://www.tomshardware.com/news/amd-infinity-fabric-cpu-to-gpu

    https://www.nextplatform.com/2020/04/03/cxl-and-gen-z-iron-out-a-coherent-interconnect-strategy/

    https://www.nextplatform.com/2019/09/18/eating-the-interconnect-alphabet-soup-with-intels-cxl/
     
    Last edited: Jun 15, 2021
  6. JoeJ

    JoeJ Veteran

    If i got that correctly, AMP also lacks a definition of LDS memory, which is why i've ruled it out back then.
    That's also the point where i see issues with something like modern C++ on GPU. Too much abstractions. HW limits like LDS memory size, subgroup size, register file size, etc. seemingly prevent convenient abstractions to become possible. :/
     
  7. Lurkmass

    Lurkmass Regular

    Compatibility should never serve as a reason to hinder the future development of a source language. Having pointers in HLSL is one oldest requests by developers to Microsoft ...

    On Mantle, you could build similarly complex data structures via "hierarchical descriptor set" with nested descriptor sets and it also supported pointers as well. Even GLSL with it's comparatively backwards design was able to have standardized pointers on Vulkan so why must it be D3D/HLSL the last one to hold out ?

    Even if Nvidia is funding them, it's mostly to work on Cycle's CUDA/Optix backend rather than sabotaging the OpenCL backend. Don't you see the problem yet ? With OpenCL, developers have no idea what they're doing is even right or wrong. AMD's interference is just a sign that developers ultimately can't be trusted ...

    DXIL stopped being a subset of LLVM especially with the release of DX12 Ultimate features. DXIL just like SPIR-V are both diverging from LLVM as Khronos/Microsoft intended ...

    Well there were an implementations from ARM, Intel, and AMD attempted similar a concept in the past. AMD canned SPIR support for their OpenCL implementation for whatever reason they deemed that was unfit. Before that AMD also supported HSAIL but no other vendors were interested so it wasn't a viable option for portability either. That leaves us with ARM's Mali devices for which we have no idea on the quality of it's implementation and we have Intel who keeps adding more SPIR-V extensions which is bound to introduce more divergence ...

    Soon, your idea is going to be put to the test ...
     
    xpea likes this.
  8. Ethatron

    Ethatron Regular Subscriber

    You said one needs to invent/derive a language. I noted that there is already one available, MS only needs to experiment with DX13 under the hood. The CPU targeting for verification is also readily available. :)

    Ofc it has: tile_static
     
    JoeJ likes this.
  9. JoeJ

    JoeJ Veteran

    Oh, no wonder it did not find it. MS really is creative with renaming things.
    Seems a very attractive option then... :D
     
  10. DmitryKo

    DmitryKo Regular

    Microsoft is not likely to reimplement C++ AMP on top of Direct3D 12 (or 13, if that exists). Herb Sutter has since lost his moustache and is now more interested in improving standard C++, rather than baking another set of proprietary extensions...


    It's not just about HLSL shaders - this is how the entire resource binding API has been designed.

    Because it will be a significant departure from their Direct3D 10/11 programming model, which was largely retained in Direct3D 12 (in hopes of bringing mobile GPUs to the Windows 10 platform, which never materialized). Mantle was redesigned from scratch to support one single architecture, AMD GCN.

    It's still a subset of LLVM 3.7 - however recent LLVM tools cannnot generate old versions of the bitcode, so Microsoft is unable to rebase their code on the latest Clang/LLVM 13; they would probably need another clean break for SM 7.0.

    Well, it's still in an early stage, just the Intel DPC++ fork of LLVM/Clang compiler merging the AMD ROCm branch of AMDGPU LLVM backend. But if they could make it work and port their changes back to the LLVM/Clang tree, that would be a significant step toward unification...

    Writing uportable OpenCL code is indeed a problem, but it need to be solved by better developer tooling.

    At least they port their changes back to open-source repositories.
     
    Last edited: Jun 26, 2021
  11. DegustatoR

    DegustatoR Veteran

  12. Undefined behaviour is often completely non-obvious, and its existence is the reason that companies that use C/C++ invest heavily in tooling / sanitisers to avoid them, most notably memory and lifetime issues. Despite this, Microsoft estimates that 70% of security issues arise from undefined behaviour (memory/lifetime issues). Also, all non-C languages were created in response to the difficulty of avoiding undefined behaviours.

    Undefined behaviours have nothing to do with the obviousness of the error. They’re literally just undefined behaviours from the perspective of the language spec, originally intended to improve performance by leaving the choice to do the right thing in the programmers hands.

    Lastly, LLVM IR is something that is consumed only by the LLVM compiler. IRs are intermediate representations of higher level languages. I have no insight into whether they would ever be consumed by a driver, but if it were it would be purely to compile byte code. I could imagine uses for that, for instance if you wanted to optimise some ML pipeline based on the features of data known only at runtime.
     
  13. Lurkmass

    Lurkmass Regular

    AMD has deliberated their final decision. They're going to introduce their HIP API over the WDDM kernel driver on Windows and Blender is going to be the first application to support this environment ...

    It's only a matter of time before OpenCL disappears for good. There can be no future behind portable source languages or reusable binaries in the long-term ...
     
Loading...

Share This Page

Loading...