I'll answer your question with another question: what good is FPU data doing sitting over on the GPU? Once it has been processed by the GPU, often you'd want to further process it by the CPU. You don't send a bunch of float data through a FPU (whether it is on a CPU or a GPU), only then to immediately throw away the results. Not every data processing task is going to be streaming a huge amount of data processing directly from and back to main memory...
With unified memory, "on the GPU" is a bit abstract. This would be vector data, not all FPU data. I can't think of very many cases you would bulk process data with a vector unit and then directly interact with the CPU barring some sort of reduction. Vector processing more often than not is filling buffers to stream out. In the case of Ryzen, AXV2 is supported, but not AVX512. The latter being SIMD sized bulk processing. Data which could reside in separate pools as the associated processor is likely the only one to use it. Filling buffers really doesn't require the CPU, just bandwidth and ILP. The real work for a CPU is addressing and logic, not heavy processing. All that prediction hardware and caching is useless when linearly processing giant buffers.
Anyhow, any new hardware approach to enable replacing FPUs with GPUs that involves a paradigm shift in how we write software is going to be - pardon the french - fucking dead on arrival. Quad-core CPUs first appeared over ten years ago, and it's taken us until now where there's everyday (in most cases this reads "games") software that reliably takes advantage of more cores than that.
I'm not suggesting to replace the FPUs, but the larger vector units. The CPU will still process floating point data. Vector instructions and data tend to live in their own world. The CPU only accessing that data to make a decision in some instances. As mentioned above, said data is generally streamed somewhere.
And how are you going to make Intel get on this bandwagon? It's not in their interest. Anyone using Intel's compiler is going to get Intel-approved code spitting out the other end.
It's already in LLVM and Intel's compiler isn't the only option.
Yes they could, but would they? Even Intel itself probably don't use the same opcodes for its GPU shaders as their AVX units. Does AVX ISA fit graphics processing needs? Is the AVX instruction set inclusive of everything a GPU needs? Is it overly burdened with stuff CUs don't need? There's that aspect to consider as well!
The opcodes can be translated easily enough in hardware and the GPU likely a superset of the instructions. AVX instructions would be wave level instructions to a GPU, not unlike tensor cores with Nvidia. With a FPGA or fixed logic the conversion could be handled easily enough. It's not difficult to turn a 5 into a 9 with binary logic. It would be up to the compiler to support the different target and there has been work on even supporting multiple paths through the compiler. I'd have to go find a reference for that, but it's somewhat recent as of this year.
I checked the instructions in AVX and nearly all of them would directly translate to GCN ISA. The permute instructions would be a bit different. The easier solution would be to compile vector instructions into GCN ISA and a parallel thread to start.
CU L1s have very low residency periods, when they aren't obligated to write back for something like coherent accesses. GPU memory latencies are measured in hundreds of GPU cycles, and the loaded latencies can probably push things to the order of magnitude higher range.
Part of that is with the pipelining. Take HBM2 with pseudo-channel and latency and power drop. Then consider the workload will mirror the current parallel one. When dealing with vectors on a GPU they will be sufficiently large and that overlapped latency less of an issue. They don't need the lower latency as they are focused on throughput and graphics memory isn't all that different from system memory. The model will entail heterogeneous memory with low latency, low bandwidth and high latency, high bandwidth.
I've noted who is involved in that research, and why they wouldn't care to let HSA and AMD in particular use it.
AMD is on the
board of OpenCAPI. Doesn't mean they can't play politics with it, but OpenCAPI should be agnostic of the HSA stack. Shared memory and low latency transactions are all that's required.
I'd prefer a more direct paper or other reference. They're focused on data, not kernel command movement, which HSA's AQL is about. It seems more like a controlled push model for a subset of memory traffic, which for compute GPUs have not been doing.
I'm not suggesting to move commands, but alter the threading model slightly. Not unlike how GPUs currently work, but with the CPU side waiting and more efficient hardware synchronization mechanisms added. Mechanisms that operate on bit fields and significantly reduce the overhead of locking and synchronization.
Zen has PCIe interfaces that it can switch to xGMI, it also has multiple on-package GMI links that are not PCIe for EPYC integration. The transmission path isn't limiting the protocol or fabric.
All I'm saying is that the signaling standards are the same. Infinity being a superset of whatever PCIe standard gets adopted. In current chips this means the signals can be driven faster thanks to guarantees of shorter distances.