Anarchist4000
Veteran
It seems the "initial SW scheduler" is actually something specific to the graphics stack, according to bridgman.
HSA queues are virtualized are exposed directly to userspace. There is no ioctl involved in submitting work to the GPU. The userspace process kicks off the work directly. The HSA hardware scheduler provides the virtualization of the HSA controlled compute queues. Graphics, non-HSA compute, UVD, and VCE still use ioctls and kernel controlled queues. The scheduler allows the kernel driver to better arbitrate access to the actual hw rings rather than just first come first serve access to the rings via the ioctl. Among other things this can allow better utilization of the rings than would be possible with just first come first serve access. It also avoids possible stalls imposed by using semaphores to synchronize data access between rings.
Link
Unless it can pull all packets in parallel, it's a single point aggregating packets. In a design where system memory should be directly accessible to the execution units why go through the CP to retrieve data? If texturing from system memory do CUs normally have to submit packets to the CP to retrieve data? Point being if an agent on the GPU can read system memory or has coherent cache the CP can probably be bypassed. The CP makes more sense with a large graphics workload with work to distribute and synchronize while pulling in fixed function units etc.The packet (command) processor itself does not aggregate packets. The packets are written to the coherent memory, and the packet processor pulls them in upon being notified by other agents through user-mode signalling (doorbell interrupt).
It doesn't seem like the HSA specification has been interpreted correctly, and the understanding of the programming model doesn't feel right either.
That would seem to indicate a kernel can queue it's own work, although I'm not sure that would ultimately be necessary.http://www.hsafoundation.com/html/Content/SysArch/Topics/02_Details/agent_and_kernel_agent_entry.htm
- The entry should make a distinction between Agents (= can issue AQL packets into a queue) and kernel agents (= can issue and consume AQL packet from a queue), but otherwise provide similar detail of properties for each.
- Each such topology entry for an agent/kernel agent is identified by a unique ID on the platform. The ID is used by application and system software to identify agents & kernel agents. Each topology entry also contains an agent/kernel agent type (e.g., throughput/vector, latency/scalar, DSP, …) characterizing its compute execution properties, work-group and wavefront size, concurrent number of wavefronts, in addition to a nodeID value.
- One or more agents/kernel agents may belong to the same nodeID or some agents/kernel agents may be represented in more than one node, depending on the system configuration. This information should be equivalent or complementary to existing information on the platform (e.g., derived from ACPI SRAT or equivalent where applicable).
- An agent/kernel agent parameter set may reference other system resources associated with it to express their topological relationship as needed; examples are other agents/kernel agents, memory, cache, and IO resources, or sub-divisible components of the agent/kernel agent itself.
Why not as opposed to adding multiple command processors to the GPU that only get utilized in certain circumstances? Seems a bit redundant as kernel agents on a GPU could probably fetch work entirely separate of the CP. Not too dissimilar from current multi-threading models. Really depends on how the hardware is configured. For a relatively lite processing setup, say SIMD co-processor for a CPU, dedicating the resources might be more practical as opposed to instantiating an agent each time work arrives. Should provide better latency.There are ARM cores for embedded or control with items like local memory storage and an emphasis on fixed cycle costs for all operations. What is not done with them is taking that and putting it into the CPU host complex. There are dozens of similar cores already driving the GPU, for which the ISA is not a particularly relevant distinction. It may be the case that the ISA is changed, but the role doesn't shift because of it.
For a large transaction, but for frequent small transactions it would compromise the CP and add overhead. No different than overwhelming the CP by submitting individual triangles as draw calls. Given an appropriate architecture, it would make more sense to have a plurality of CPs if the hardware broken down into a handful of units. Data amplification would be a concern if squeezing commands through a narrow bus. In theory these designs have full access to system memory bandwidth based on leaks we've seen. My whole understanding of HSA was a single threat being able to alternate between scalar/latency and vector/throughput somewhat transparently. A design ideally negating the need for SIMD processors on a CPU. Use HSA transparently as opposed to larger AVX instructions that likely lack the hardware to sustain throughput.And the cache hierarchy, variable execution time, and lack of a domain-specific microcode store. On top of that, the command processor generates additional signals and instructions based on the contents of the packets, which generally should mean data amplification and a loss of abstraction if this is exposed to the host core complex. This is compromising the CPU's effectiveness in general if hooks for GPU packet processing and arbitrary hardware fiddling are added there.