AMD CDNA Discussion Thread

Part of these exascale systems price is $300 million government investment in SYSCL effort... So yeah it proves that ROCm is nearly useless

Does AMD even support SYCL? The govt can’t make it happen on their own.

Also SYCL isn’t a replacement for ROCm. SYCL has to be compiled into something that will run on the target hardware and for AMD that’s ROCm. So either way ROCm needs to be fixed.
 
Last edited:
Raw SYCL is who cares, time to board OneAPI train.

AMDs official stack is HIP+ROCm. Their best bet would be to abandon HIP and build out an official SYCL/ROCm stack but there’s no sign of that. There’s no sign of them tossing ROCm in favor of oneAPI either.

How and when exactly is AMD planning to hop on that train?
 
2020-05-sycl-landing-page-02_3.jpg
 
AMDs official stack is HIP+ROCm. Their best bet would be to abandon HIP and build out an official SYCL/ROCm stack but there’s no sign of that. There’s no sign of them tossing ROCm in favor of oneAPI either.

How and when exactly is AMD planning to hop on that train?

SYCL has no future for interoperability since the only big corporation investing in it is Intel and even then they expect developers to use DPC++ instead which are Intel specific extensions to SYCL so I doubt many vendors including AMD would support those extensions or let alone SYCL by itself. There's a ROCm backend for DPC++ being developed by the community but I doubt it works ...

SYCL for portability purposes is not that useful either since AMD refuses to make a SPIR-V kernel compiler which would ensure more consistent behaviour across vendors. You could have a SYCL implementation on ROCm but the source code will get compiled into native GCN/CDNA bytecode which can't run anywhere but AMD HW and that's already the case that's happening with DPC++ community effort ...

HIP is arguably the more sane solution since that's what developers are actually using right now when it's syntax is more familiar to CUDA but it doesn't make any guarantees about portability either. The compute world is just going to have to cope with multiple but similar enough source languages rather than eventually hoping for one intermediate bytecode to rule them all ...

CUDA, ROCm, and oneAPI software stacks are all meant to be built and specialized for extracting maximum perf behind each vendors unique HW so forcing either to converge is ill-suited when they have different priorities and standards ...
 
The solution to MI250X's segmented memory access between the dies is to just launch twice as many compute kernel for maximum performance so that each die can run it's own compute kernels independent of each other in parallel. This shouldn't be much of a problem if at all since many existing workloads can be easily extended to extract more parallelism ...
 
Then why bother putting two dies together with all the added complexity?
Computation density in HPC. They need tons of FP64 flops in the smallest possible area.
What I don't get is why they sum up memory bandwidth and capacity with the 200GB/s bi-directional interface between the GCDs, that looks silly.

https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf
Not a word on on threads spawning on GPU and on lock free programming. I guess RDNA2 is still in stone age in these regards?
 
Then why bother putting two dies together with all the added complexity?

For higher compute density per-node of course since it helps conserve I/O bandwidth between different nodes and cabinets. Supercomputers and servers are connected by very thick cables on high speed networks so data traffic congestion becomes a real problem on large systems ...
 
GA100 doesn't require any special cache treatment unless you want to reach out the absolute maximum performance for a single GPU config, but that's a special case of low level opts to get max perf.
This is a single GPU and it's being programmed accordingly, it has nothing common with the dual separate GPUs in MI250.
 
GA100 doesn't require any special cache treatment unless you want to reach out the absolute maximum performance for a single GPU config
Well duh welcome to the very definition of NUMA lands.
it has nothing common with the dual separate GPUs in MI250.
You can treat those things as one bigass GPU.
You can treat the entire node as one bigass APU.
Unless you want to reach out the absolute maximum performance for a single GPU config, that is.
 
Well duh welcome to the very definition of NUMA lands.
NUMA has nothing to do with this. Apparently A100 has full speed access to all memory banks without any optimizations, the only difference can be in cache latencies.

You can treat those things as one bigass GPU.
Duh oh ah you can treat 8 GPUs DGX system as a one bigass GPU, what a news!
 
With the different bandwidth between GCDs and HBM is not "one bigass GPU". In fact there are 8 or 16 independent GPUs attached with IF and even then there are huge differences within the construct.
Either MI200 is one year to late or the US goverment has become impatient to build Frontier. Do they even able to get 50% sustained performance? AMDs numbers against A100 point in the direction of less than 50% real FP64 performance...
 
Back
Top