AMD CDNA Discussion Thread

Bondrewd · Nov 9, 2021

ethernity said:
Portfolio will expand overnight

An IP stash yes and AMD will leverage it in funny ways sooner or later.

trinibwoy · Nov 9, 2021

xpea said:
Part of these exascale systems price is $300 million government investment in SYSCL effort... So yeah it proves that ROCm is nearly useless

Does AMD even support SYCL? The govt can’t make it happen on their own.

Also SYCL isn’t a replacement for ROCm. SYCL has to be compiled into something that will run on the target hardware and for AMD that’s ROCm. So either way ROCm needs to be fixed.

Bondrewd · Nov 9, 2021

trinibwoy said:
Does AMD even support SYCL?

No.

trinibwoy said:
Also SYCL isn’t a replacement for ROCm

Raw SYCL is who cares, time to board OneAPI train.

trinibwoy · Nov 9, 2021

Bondrewd said:
Raw SYCL is who cares, time to board OneAPI train.

AMDs official stack is HIP+ROCm. Their best bet would be to abandon HIP and build out an official SYCL/ROCm stack but there’s no sign of that. There’s no sign of them tossing ROCm in favor of oneAPI either.

How and when exactly is AMD planning to hop on that train?

Granath · Nov 9, 2021

trinibwoy · Nov 9, 2021

Granath said:

Yup, notice the names of the organizations building SYCL solutions that would run on CDNA. None of them are named AMD. Question is whether the supercomputer guys will be willing to use these unofficial / unsupported solutions.

Lurkmass · Nov 9, 2021

trinibwoy said:
AMDs official stack is HIP+ROCm. Their best bet would be to abandon HIP and build out an official SYCL/ROCm stack but there’s no sign of that. There’s no sign of them tossing ROCm in favor of oneAPI either.

How and when exactly is AMD planning to hop on that train?

SYCL has no future for interoperability since the only big corporation investing in it is Intel and even then they expect developers to use DPC++ instead which are Intel specific extensions to SYCL so I doubt many vendors including AMD would support those extensions or let alone SYCL by itself. There's a ROCm backend for DPC++ being developed by the community but I doubt it works ...

SYCL for portability purposes is not that useful either since AMD refuses to make a SPIR-V kernel compiler which would ensure more consistent behaviour across vendors. You could have a SYCL implementation on ROCm but the source code will get compiled into native GCN/CDNA bytecode which can't run anywhere but AMD HW and that's already the case that's happening with DPC++ community effort ...

HIP is arguably the more sane solution since that's what developers are actually using right now when it's syntax is more familiar to CUDA but it doesn't make any guarantees about portability either. The compute world is just going to have to cope with multiple but similar enough source languages rather than eventually hoping for one intermediate bytecode to rule them all ...

CUDA, ROCm, and oneAPI software stacks are all meant to be built and specialized for extracting maximum perf behind each vendors unique HW so forcing either to converge is ill-suited when they have different priorities and standards ...

DavidGraham · Nov 9, 2021

Bondrewd said:
A100 programming in some cases is already that due to wonky as balls split L2.

Some cases are not the same as ALL cases.

Bondrewd said:
More, like 400.

Nope, 200GB/s, the two dies are connected by 4 IF links, each capable of 50GB/s up/down, so 200GB/s in total.

Lurkmass · Nov 9, 2021

The solution to MI250X's segmented memory access between the dies is to just launch twice as many compute kernel for maximum performance so that each die can run it's own compute kernels independent of each other in parallel. This shouldn't be much of a problem if at all since many existing workloads can be easily extended to extract more parallelism ...

DavidGraham · Nov 9, 2021

Lurkmass said:
The solution to MI250X's segmented memory access between the dies is to just launch twice as many compute kernel for maximum performance so that each die can run it's own compute kernels independent of each other in parallel

Then why bother putting two dies together with all the added complexity?

pTmdfx · Nov 9, 2021

DavidGraham said:
Some cases are not the same as ALL cases.

Nope, 200GB/s, the two dies are connected by 4 IF links, each capable of 50GB/s up/down, so 200GB/s in total.

AMD's product page says up to 400 GB/s bidirectional in-package bandwidth between the GCDs, though the footnote appears to be incomplete. :mrgreen:

OlegSH · Nov 9, 2021

DavidGraham said:
Then why bother putting two dies together with all the added complexity?

Computation density in HPC. They need tons of FP64 flops in the smallest possible area.
What I don't get is why they sum up memory bandwidth and capacity with the 200GB/s bi-directional interface between the GCDs, that looks silly.

https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf
Not a word on on threads spawning on GPU and on lock free programming. I guess RDNA2 is still in stone age in these regards?

Lurkmass · Nov 9, 2021

DavidGraham said:
Then why bother putting two dies together with all the added complexity?

For higher compute density per-node of course since it helps conserve I/O bandwidth between different nodes and cabinets. Supercomputers and servers are connected by very thick cables on high speed networks so data traffic congestion becomes a real problem on large systems ...

Bondrewd · Nov 9, 2021

DavidGraham said:
Some cases are not the same as ALL cases.

Same as here duh.

DavidGraham said:
Nope, 200GB/s

400.
Each IFIS at 25GT/s is 100GB/s bidir much the same way it is on EPYC.

OlegSH · Nov 9, 2021

GA100 doesn't require any special cache treatment unless you want to reach out the absolute maximum performance for a single GPU config, but that's a special case of low level opts to get max perf.
This is a single GPU and it's being programmed accordingly, it has nothing common with the dual separate GPUs in MI250.

Bondrewd · Nov 9, 2021

OlegSH said:
GA100 doesn't require any special cache treatment unless you want to reach out the absolute maximum performance for a single GPU config

Well duh welcome to the very definition of NUMA lands.

OlegSH said:
it has nothing common with the dual separate GPUs in MI250.

You can treat those things as one bigass GPU.
You can treat the entire node as one bigass APU.
Unless you want to reach out the absolute maximum performance for a single GPU config, that is.

OlegSH · Nov 9, 2021

Bondrewd said:
Well duh welcome to the very definition of NUMA lands.

NUMA has nothing to do with this. Apparently A100 has full speed access to all memory banks without any optimizations, the only difference can be in cache latencies.

Bondrewd said:
You can treat those things as one bigass GPU.

Duh oh ah you can treat 8 GPUs DGX system as a one bigass GPU, what a news!

DegustatoR · Nov 9, 2021

Bondrewd said:
You can treat those things as one bigass GPU.
You can treat the entire node as one bigass APU.

And you will likely get way less than A100 performance by treating them this way.
Which kinda makes no sense from any perspective but you proving some point to someone who's not even here.

troyan · Nov 9, 2021

With the different bandwidth between GCDs and HBM is not "one bigass GPU". In fact there are 8 or 16 independent GPUs attached with IF and even then there are huge differences within the construct.
Either MI200 is one year to late or the US goverment has become impatient to build Frontier. Do they even able to get 50% sustained performance? AMDs numbers against A100 point in the direction of less than 50% real FP64 performance...

Bondrewd · Nov 9, 2021

OlegSH said:
the only difference can be in cache latencies.

Totally not a misery point I swear to god.

DegustatoR said:
And you will likely get way less than A100 performance by treating them this way.

Who knows haha.

troyan said:
Either MI200 is one year to late

what the fuck.

troyan said:
AMDs numbers against A100 point in the direction of less than 50% real FP64 performance...

membw says hello.

AMD CDNA Discussion Thread

Bondrewd

trinibwoy

Meh

Bondrewd

trinibwoy

Meh

Granath

trinibwoy

Meh

Lurkmass

DavidGraham

Lurkmass

DavidGraham

pTmdfx

OlegSH

Lurkmass

Bondrewd

OlegSH

Bondrewd

OlegSH

DegustatoR

troyan

Bondrewd