Late followup on a few items:
This might be one way AMD could leverage GCN's architecture to satisfy some of the objections Sony's audio engineer had to using the GPU for audio purposes, back in 2013.
The diagram in the patent can be compared to the original GCN architecture diagram, where significant elements are in the same position and shape in both. What's stripped out of the proposed compute unit is most of the concurrent threads, SIMDs, LDS, and export bus.
What's left is a scalar unit that runs a persistent task scheduling program that reads messages over a new queue and matches the queue commands to a table of task programs, then starts them executing.
There's seemingly only one SIMD with a modified dual-issue ALU structure and a tiered register file. While there's no LDS, there's a different sort of sharing within the SIMD with a register access mechanism that allows for loading registers across "lanes" very easily, and a significant crossbar that can rearrange outputs in a programmed way. Some elements of the crossbar may be similar to the LDS, which automatically managed accesses between banks and handled conflicts.
The vector register file is not allocated like standard GCN. Besides the different tiers, the execution model sets aside a range of global registers, and per-task allocations that are created and released in a manner similar to standard shaders.
Once up and running, this CU would run a shader that essentially runs forever waiting to take host-generated messages directly, or read from a monitored address range. Rather than write to a standard GPU queue, have the command processor read it, engage a dispatch processor or the shader launch path, negotiate for a CU, go through the initialization process, set up parameters, wait for the CU to spin up, the host might be able to ping this custom unit with a series of writes or an interrupt. In the absence of CU reservation and real-time queues, GPUs can take tens of milliseconds, which Sony's audio group found unacceptable. Even with those measures, there's still a lot of the listed process that still has to happen to launch a shader on a reserved CU.
Other objections were the generally large minimum concurrency requirements, where a CU's multiple SIMD architecture required at least 4 (or at least 8 realistically) wavefronts before it could be reasonably expected to get good hardware utilization, and Sony's HSA presentation indicated a hoped-for flexible audio pipeline that wouldn't need the batching of hundreds of tasks. This stripped-down CU would remove the extra SIMDs Sony wouldn't want to batch for, although it's not clear if there's still 64-element wavefront granularity or if that could somehow be reduced. Just doing that and reducing the amount of concurrent wavefronts could save a decent amount of area, perhaps a third. More area could be saved if the texture and load/store units could be reduced in size for a workload that didn't need as many graphics functions. Some space savings could be lost with the more complex register file and ALU arrangement.
The queue method also provides a different and more direct way to get many low-latency tasks programmed into a software pipeline in a way that isn't as insulated as an API, while the task programs can still abstract away the particulars of CU.
This could be appealing for one or more custom Sony audio units, or more so than the existing GPU TrueAudio setup.
As for whether this could be relevant to Navi or a console, I did see one reference to shared VGPRs being added as a resource description for HSA kernels in changes added for GFX10.
https://github.com/llvm-mirror/llvm...3380939#diff-ad4812397731e1d4ff6992207b4d38fa
Although similar wording in a singe reference in many thousands of lines of code is slim evidence.
Wouldn't the chiplet design resolve the multi GPU rendering problem instead?
I mean.. Can the IO chip virtualise the GPU so that 2 gpu chiplet appear as 1 ? All the schedulding logic (and unified L2/L3cache) would be in the IO chip and the chiplet would just have the CU array and small L1 cache, same for the cpu chiplet with L1 cache...
In fact I wonder if the chiplet design would not also help to realise the full HSA vision with a fully unified memory pool.
One IO controller overmind chip to rule them all (cpu and gpu)..
Is it not a practical solution ?
One item to note is that the path between the unified last-level cache and the CUs supports at least several times the bandwidth of the memory bus, and the command processor, control logic, and export paths can all move values or signals amounting to hundreds to thousands of bytes per cycle to and from the CUs. Separating the CUs from their support infrastructure exposes all the on-die communications that had been considered internal to the die.