AMD Architecture Discussion

Anarchist4000 · Dec 31, 2016

pTmdfx said:
It seems the "initial SW scheduler" is actually something specific to the graphics stack, according to bridgman.

HSA queues are virtualized are exposed directly to userspace. There is no ioctl involved in submitting work to the GPU. The userspace process kicks off the work directly. The HSA hardware scheduler provides the virtualization of the HSA controlled compute queues. Graphics, non-HSA compute, UVD, and VCE still use ioctls and kernel controlled queues. The scheduler allows the kernel driver to better arbitrate access to the actual hw rings rather than just first come first serve access to the rings via the ioctl. Among other things this can allow better utilization of the rings than would be possible with just first come first serve access. It also avoids possible stalls imposed by using semaphores to synchronize data access between rings.
Link

pTmdfx said:
The packet (command) processor itself does not aggregate packets. The packets are written to the coherent memory, and the packet processor pulls them in upon being notified by other agents through user-mode signalling (doorbell interrupt).

Unless it can pull all packets in parallel, it's a single point aggregating packets. In a design where system memory should be directly accessible to the execution units why go through the CP to retrieve data? If texturing from system memory do CUs normally have to submit packets to the CP to retrieve data? Point being if an agent on the GPU can read system memory or has coherent cache the CP can probably be bypassed. The CP makes more sense with a large graphics workload with work to distribute and synchronize while pulling in fixed function units etc.

pTmdfx said:
It doesn't seem like the HSA specification has been interpreted correctly, and the understanding of the programming model doesn't feel right either.

The entry should make a distinction between Agents (= can issue AQL packets into a queue) and kernel agents (= can issue and consume AQL packet from a queue), but otherwise provide similar detail of properties for each.

Each such topology entry for an agent/kernel agent is identified by a unique ID on the platform. The ID is used by application and system software to identify agents & kernel agents. Each topology entry also contains an agent/kernel agent type (e.g., throughput/vector, latency/scalar, DSP, …) characterizing its compute execution properties, work-group and wavefront size, concurrent number of wavefronts, in addition to a nodeID value.

One or more agents/kernel agents may belong to the same nodeID or some agents/kernel agents may be represented in more than one node, depending on the system configuration. This information should be equivalent or complementary to existing information on the platform (e.g., derived from ACPI SRAT or equivalent where applicable).

An agent/kernel agent parameter set may reference other system resources associated with it to express their topological relationship as needed; examples are other agents/kernel agents, memory, cache, and IO resources, or sub-divisible components of the agent/kernel agent itself.

http://www.hsafoundation.com/html/Content/SysArch/Topics/02_Details/agent_and_kernel_agent_entry.htm

That would seem to indicate a kernel can queue it's own work, although I'm not sure that would ultimately be necessary.

3dilettante said:
There are ARM cores for embedded or control with items like local memory storage and an emphasis on fixed cycle costs for all operations. What is not done with them is taking that and putting it into the CPU host complex. There are dozens of similar cores already driving the GPU, for which the ISA is not a particularly relevant distinction. It may be the case that the ISA is changed, but the role doesn't shift because of it.

Why not as opposed to adding multiple command processors to the GPU that only get utilized in certain circumstances? Seems a bit redundant as kernel agents on a GPU could probably fetch work entirely separate of the CP. Not too dissimilar from current multi-threading models. Really depends on how the hardware is configured. For a relatively lite processing setup, say SIMD co-processor for a CPU, dedicating the resources might be more practical as opposed to instantiating an agent each time work arrives. Should provide better latency.

3dilettante said:
And the cache hierarchy, variable execution time, and lack of a domain-specific microcode store. On top of that, the command processor generates additional signals and instructions based on the contents of the packets, which generally should mean data amplification and a loss of abstraction if this is exposed to the host core complex. This is compromising the CPU's effectiveness in general if hooks for GPU packet processing and arbitrary hardware fiddling are added there.

For a large transaction, but for frequent small transactions it would compromise the CP and add overhead. No different than overwhelming the CP by submitting individual triangles as draw calls. Given an appropriate architecture, it would make more sense to have a plurality of CPs if the hardware broken down into a handful of units. Data amplification would be a concern if squeezing commands through a narrow bus. In theory these designs have full access to system memory bandwidth based on leaks we've seen. My whole understanding of HSA was a single threat being able to alternate between scalar/latency and vector/throughput somewhat transparently. A design ideally negating the need for SIMD processors on a CPU. Use HSA transparently as opposed to larger AVX instructions that likely lack the hardware to sustain throughput.

pTmdfx · Jan 1, 2017

Anarchist4000 said:
Unless it can pull all packets in parallel, it's a single point aggregating packets.

It is the structured circular buffers — a.k.a. user mode queues — in the coherent memory which are aggregating the packets. The packet processor reads them and manipulates them based on the atomic queue indexes. How many of them is being read at a time is up to the implementation.

Then on top of the queuing model, you can have unlimited number of queues. Each queue is an independent command stream from each other, so you can have multiple concurrent packet processors that work on different queues at the same time.

In the ideal world of HSA you can have tens of processes competing for an agent, and queues can have unlimited packets enqueued. Managing queuing resources purely on-chip is simply impractical. Just look at how the modern graphics API shares a similar queue-based execution model with HSA.

Anarchist4000 said:
In a design where system memory should be directly accessible to the execution units why go through the CP to retrieve data? If texturing from system memory do CUs normally have to submit packets to the CP to retrieve data? Point being if an agent on the GPU can read system memory or has coherent cache the CP can probably be bypassed. The CP makes more sense with a large graphics workload with work to distribute and synchronize while pulling in fixed function units etc.

You seem to have a deep confusion in how GPU fundamentally works. The command processors and the scheduling layer is just the work scheduler — generally speaking, they create work based on the description in the "command" a.k.a. packets, and by work, I mean an invocation of shader program, or compute kernel, that is specified in the packet as a pointer to the program entry point (the first instruction).

The created work is distributed to CUs. But the CP/scheduler does not do the work, does not care how the work is done, and does not instruct how the CUs complete them. They just need to know whether it is completed (in a simplified model).

The CUs are essentially a 40-way SMT CPU core with a super wide vector unit. It is "context-switched" in a thread (i.e. a subset of the created work) by the front-end. They are responsible to run the preset shader program from the beginning to the end. Under HSA, they have full access to the flat coherent memory space. So whatever memory accesses they do are none of the command processors' business.

Then now for the definition of HSA agent, you have to understand that HSA is a data-parallel programming model. An agent is not a CU or a CPU core — these are implementation details — but just something that can run a huge data-parallel problem in general.

HSA agents execute "kernel dispatches", but it doesn't mean the packet processor itself instructs everything. The kernel dispatches represents a problem of any arbitrary size. Say siven a 1920 * 1080 dispatch, GCN breaks it down into 31800 wavefronts and run them concurrently on all CUs. Each wavefront runs the kernel program from the beginning to the end, and has their own program counter. They are concurrent, despite being spawn from a single command in the queue. The packet processor does not care how the CUs run them — it just cares about when the dispatch is completed, and then it would move on to the next packet.

Packet is a high-level description of a problem. A packet processor parses the packet to break it down into smaller pieces, and CUs are the one who actually solve the problem.

Anarchist4000 said:
That would seem to indicate a kernel can queue it's own work, although I'm not sure that would ultimately be necessary.

A kernel can enqueue to any queues, as long as it has the address to the user mode queue. As queues are stored in the coherent memory, it is just a matter of atomic operations and signalling, which is the whole point about HSA.

Anarchist4000 said:
Why not as opposed to adding multiple command processors to the GPU that only get utilized in certain circumstances? Seems a bit redundant as kernel agents on a GPU could probably fetch work entirely separate of the CP. Not too dissimilar from current multi-threading models. Really depends on how the hardware is configured. For a relatively lite processing setup, say SIMD co-processor for a CPU, dedicating the resources might be more practical as opposed to instantiating an agent each time work arrives. Should provide better latency.

The SIMD/FP "co-processor" of a CPU, say Zen, is a completely different matter. It separates the SIMD execution pipeline from the integer pipeline, but both are still part of the same state machine that appears to run on a sequential sequence of instructions. GPUs, on the other hand, are multiple state machines running concurrently like multi-core CPUs, and each of them runs its own sequential sequence of instructions in its own pace.

If you cannot get your head around how levels of concurrency work in modern systems, IMO it would be hard to push the discussion to a fruitful direction.

pTmdfx · Jan 1, 2017

Anarchist4000 said:
Should provide better latency.

The real problem to this rationale is that latency is already low enough not to be a bottleneck in its expected use cases.

User-mode queues in the coherent memory already provide an easy low-latency way to enqueue data-parallel work. In an ideal HSA SoC architecture, this should have no critical difference from the system memory and signalling latency among thousands of CPU threads. The packet processor would just be an ordinary first-class coherent client in the coherence domain of memory.

If you have a sufficiently large work to run, the dispatch cost would effectively be hidden by the execution time that outweighs it. Let alone that HSA and new modern APIs already push the bar of "sufficiently large work" down significantly, at least comparing to OCL1/OGL/DC.

It has to be understood that concurrency is limited. For anything that would let down by the HSA dispatch cost, even in its ideal SoC implementation, it might be a sign that the work is not really as concurrent as you expected. In this case, running it on a few or a single CPU thread is not a sin.

Anarchist4000 · Jan 1, 2017

pTmdfx said:
It is the structured circular buffers — a.k.a. user mode queues — in the coherent memory which are aggregating the packets. The packet processor reads them and manipulates them based on the atomic queue indexes. How many of them is being read at a time is up to the implementation.

Through the CP according to your definition of the model. That's what I've been getting at with the CPU being able to dispatch work more directly. In the model I'm envisioning with coherent queues the packet processor doesn't need to manipulate them. Any agent can do it directly.

pTmdfx said:
In the ideal world of HSA you can have tens of processes competing for an agent, and queues can have unlimited packets enqueued. Managing queuing resources purely on-chip is simply impractical. Just look at how the modern graphics API shares a similar queue-based execution model with HSA.

Similar, but not all hardware implementations have the same hardware scheduling capabilities. Most GCN chips are capable of queues, but only the more modern Fiji, Polaris, and Carrizo have the hardware implementation. Tonga has it as well but is lacking in the scheduler quantity to my understanding. So while the model is somewhat shared, there are some distinct differences, primarily around the CP to my understanding.

pTmdfx said:
You seem to have a deep confusion in how GPU fundamentally works. The command processors and the scheduling layer is just the work scheduler — generally speaking, they create work based on the description in the "command" a.k.a. packets, and by work, I mean an invocation of shader program, or compute kernel, that is specified in the packet as a pointer to the program entry point (the first instruction).

The created work is distributed to CUs. But the CP/scheduler does not do the work, does not care how the work is done, and does not instruct how the CUs complete them. They just need to know whether it is completed (in a simplified model).

The CUs are essentially a 40-way SMT CPU core with a super wide vector unit. It is "context-switched" in a thread (i.e. a subset of the created work) by the front-end. They are responsible to run the preset shader program from the beginning to the end. Under HSA, they have full access to the flat coherent memory space. So whatever memory accesses they do are none of the command processors' business.

Because I'm not describing a traditional GPU here. I'm describing a heterogeneous architecture. Even while "just the work scheduler" there is still actual work involved in scheduling, especially scheduling intelligently.

pTmdfx said:
Then now for the definition of HSA agent, you have to understand that HSA is a data-parallel programming model. An agent is not a CU or a CPU core — these are implementation details — but just something that can run a huge data-parallel problem in general.

HSA agents execute "kernel dispatches", but it doesn't mean the packet processor itself instructs everything. The kernel dispatches represents a problem of any arbitrary size. Say siven a 1920 * 1080 dispatch, GCN breaks it down into 31800 wavefronts and run them concurrently on all CUs. Each wavefront runs the kernel program from the beginning to the end, and has their own program counter. They are concurrent, despite being spawn from a single command in the queue. The packet processor does not care how the CUs run them — it just cares about when the dispatch is completed, and then it would move on to the next packet.

Packet is a high-level description of a problem. A packet processor parses the packet to break it down into smaller pieces, and CUs are the one who actually solve the problem.

The model I was describing had a CPU and CU (or some partitioning) more closely coupled. The implementation details as you said. Agents are like threads, they don't care what they run just that they're running when necessary. For clarification here thread not being an individual lane of a wave in this example. That thread, while probably just manipulating data, could actually be creating packets or work to be passed to one or more other agents. That wouldn't be unreasonable for some sort of large scale sorting. That could also be an agent scheduling work that ultimately came back to it.

My argument here is that the CP isn't necessarily involved in the process with a CPU/scalar agent being able to create and dispatch packets directly. The traditional GPU method on modern and upcoming hardware may very well be different. The graphics model had the CP pulling packets from the CPU to be distributed in whatever manner was appropriate for the hardware. In the case of recent GCN variations that was filling queues on 4 ACEs with 2 HWS units determining which and where work was scheduled. I'm unsure this is still the case with modern/upcoming hardware.

pTmdfx said:
The SIMD/FP "co-processor" of a CPU, say Zen, is a completely different matter. It separates the SIMD execution pipeline from the integer pipeline, but both are still part of the same state machine that appears to run on a sequential sequence of instructions. GPUs, on the other hand, are multiple state machines running concurrently like multi-core CPUs, and each of them runs its own sequential sequence of instructions in its own pace.
...
The real problem to this rationale is that latency is already low enough not to be a bottleneck in its expected use cases.

User-mode queues in the coherent memory already provide an easy low-latency way to enqueue data-parallel work. In an ideal HSA SoC architecture, this should have no critical difference from the system memory and signalling latency among thousands of CPU threads. The packet processor would just be an ordinary first-class coherent client in the coherence domain of memory.

If you have a sufficiently large work to run, the dispatch cost would effectively be hidden by the execution time that outweighs it. Let alone that HSA and new modern APIs already push the bar of "sufficiently large work" down significantly, at least comparing to OCL1/OGL/DC.

It has to be understood that concurrency is limited. For anything that would let down by the HSA dispatch cost, even in its ideal SoC implementation, it might be a sign that the work is not really as concurrent as you expected. In this case, running it on a few or a single CPU thread is not a sin.

I'm suggesting a change to the model and expected use cases. Completely removing the larger SIMD/vector instructions from the CPU in favor of execution on a GPU or co-processor to make cores smaller. I'm well aware of how the concurrency works. As for the latency expectations, it would depend on just what was being executed. Say looping through the square root of a dot product or a circumstance where the required instructions weren't present on each processor. Requiring execution to bounce between processors. Assumption being the CPU couldn't effectively calculate the dot product for whatever reason. Attempt to use HSA as an extension of the CPU's instruction set more than the current design. Limit the impact of scheduling. Expectations are it would be roughly equivalent in speed to bouncing between CPU cores each instruction. If a vector/throughput agent was exclusive to a core waiting out the result might be more practical. More along the lines of operating without cache or bandwidth constraints. Going back to the high performance scalar on GPU concept, I'd expect a good deal of work to be dispatched to the GPU. Relegating Zen cores to obscure instructions and prediction with the GPU portion doing most other work. Especially anything bandwidth constrained where the cache and prediction mattered less. That's why I'm arguing for even lower latency as traditional GPU latency hiding would break down.

pTmdfx · Jan 1, 2017

Anarchist4000 said:
Through the CP according to your definition of the model. That's what I've been getting at with the CPU being able to dispatch work more directly. In the model I'm envisioning with coherent queues the packet processor doesn't need to manipulate them. Any agent can do it directly.

Any agents can manipulate the queue by enqueuing to it. The packet processor consumes packet from the queue. It is a producer-consumer model. The whole enqueuing model is memory based, so there is nothing you can do more than writing to the memory, and trigger a signal. It is also intentionally designed to be memory based to be implementation agnostic.

Anarchist4000 said:
Similar, but not all hardware implementations have the same hardware scheduling capabilities. Most GCN chips are capable of queues, but only the more modern Fiji, Polaris, and Carrizo have the hardware implementation. Tonga has it as well but is lacking in the scheduler quantity to my understanding. So while the model is somewhat shared, there are some distinct differences, primarily around the CP to my understanding.

All GCN GPUs are capable of handling queues in hardware. This is what ACEs are all about. Not sure where you got this false impression.

If what you meant is that only the later GPUs has HWS, HWS is just a hardware mechanism that dynamically binds/schedules user mode queues and virtual memory address space from all HSA processes (as runlists) to the limited hardware queue slots (8 per ACEs) and VMID slots. Regardless of having HWS or not, all GCN GPUs still run queues in the hardware.

Anarchist4000 said:
Because I'm not describing a traditional GPU here. I'm describing a heterogeneous architecture. Even while "just the work scheduler" there is still actual work involved in scheduling, especially scheduling intelligently. [...]

Still speaking in terms of the HSA model here: the packet processor encapsulates the entirety of work scheduling and execution of the agent, so an agent would only see other agent's memory accesses, and communicates through the coherent memory. If you make CPU leaning on the internals of a GPU, the encapsulation is broken.

Even if you turn the packets into a CPU instruction (which carries too much state to be one), it would just be a super extra long latency synchronous instruction that waits for the completion of the spawned work. To submit the work, it still has to go through the interconnect for the CUs, unless you gonna bind CUs to CPU. In exchange, you break the asynchronous vendor independent execution model, have some GPU queueing logic in a clock domain that it is way more than what it needs, cluttering the on-chip interconnect with GPU internal specifics (the CU scheduling layer), and potentially spreading CUs everywhere for niche benefits that probably no one needs.

I can run timing-critical and low DLP SIMD routines on CPUs efficiently with all the high performance caches anyway, if the latency of HSA is not enough low for my workload. So why would I ever need it? HSA/OCL at least has quite a bunch of ideal use cases behind it, although no one really delivers (yet). Is there a real use case — that really needs integration beyond the asynchronous model that just starts blooming — other than it seeming better?

Honestly, there are a lot of things on paper, and on paper for good.

3dilettante · Jan 1, 2017

Anarchist4000 said:
Why not as opposed to adding multiple command processors to the GPU that only get utilized in certain circumstances?

The real-time or control implementations reduce sources of variable latency or faults, which highly speculative and wide cores introduce, as well as the wide and unpredictable memory hierarchies/networks needed to support them.
The packet processing step is not a dominant portion of the overall computation, but events that cause a host core to rely on OoO to hide latency or experience unpredictable latencies become a broader stall condition downstream.

The controller cores are small and configured to cater to the specific needs of the GPU domain, so I am not certain of many cases where that's a greater loss than forcing a host CPU orders of magnitude more complex to do this job, when the host domain is hostile to the kind of data flow and communications requirements of the task.
With all the talk of modularizing the interconnect, this is reversing the abstraction process of the GPU versus CPU, exposing elements the command processor encapsulates. If the CPU domain is altered to better handle this, it brings in the risk of cascading changes from somewhere in the GPU all the way into the host CPU core, which is a very unforgiving domain to pin a high rate of evolution on.

Seems a bit redundant as kernel agents on a GPU could probably fetch work entirely separate of the CP.

The GPU as a whole is what is presented as a kernel agent. The CP is not separate from that, but what it and the other controllers do is abstract the particulars from the CUs, while allowing the internals of the GPU to be shared or virtualized.
What the CP does in addition to that is help manage the heavy legacy and specialized state of the CPU, serves to manage the physical device, and in part helps give the illusion that there is a "GPU" rather than separate subunits.

Not too dissimilar from current multi-threading models. Really depends on how the hardware is configured. For a relatively lite processing setup, say SIMD co-processor for a CPU, dedicating the resources might be more practical as opposed to instantiating an agent each time work arrives. Should provide better latency.

That goes more to objections raised that HSA in general and its kernel-based architecture in particular is a waste of time, but that is not what AMD is counting on.

For a large transaction, but for frequent small transactions it would compromise the CP and add overhead. No different than overwhelming the CP by submitting individual triangles as draw calls. Given an appropriate architecture, it would make more sense to have a plurality of CPs if the hardware broken down into a handful of units.

Data amplification would be a concern if squeezing commands through a narrow bus. In theory these designs have full access to system memory bandwidth based on leaks we've seen.

It's a waste of power, worse latency, and additional storage burden, if used in a direct link from CPU to graphics domain. If we are going back to using the main hierarchy and interconnect, it is contention and variability in the control path, which the GPU is less able to hide latency for.
The graphics context is bigger than compute, which is why preemption and context switching were more readily introduced for compute. Having bandwidth, if not being consumed by actual compute, doesn't change adding trips to memory that currently do not happen.

My whole understanding of HSA was a single threat being able to alternate between scalar/latency and vector/throughput somewhat transparently. A design ideally negating the need for SIMD processors on a CPU. Use HSA transparently as opposed to larger AVX instructions that likely lack the hardware to sustain throughput.

HSA is an attempt to provide a virtual machine model that allows devices in a system that have gained programmability to be used similarly to what CPUs traditionally were. To do this, it sets down various minimums for hardware compliance, and a method to dispatch work in a way that abstracts the devices' internals from code this meant to be portable after it is run through a chain of intermediate translators and compilers. The system at run-time could get the overall context of the HSA program, and potentially have a different device run it, or arbitrate parts of that device out due to other services or virtual clients.

This is some number of layers of abstraction higher than making a CPU drop AVX support, and frankly it would be a poor substitute besides being architecture-breaking.
Less charitably, HSA was AMD's way of codifying what it was doing anyway with GPUs and hoping the get a smattering of other vendors threated by Intel and the difficulty in getting better process nodes to sign on and give it legitimacy. That was pointedly given as a motivation for the 1.1 revision of HSA, where others complained that it was mainly suited for AMD's implementations.

The main boosters for HSA, like ARM and AMD, have found significant portions of it to be unnecessary for their products.

pTmdfx · Jan 2, 2017

3dilettante said:
The main boosters for HSA, like ARM and AMD, have found significant portions of it to be unnecessary for their products.

I wouldn't say "significant". AMD does keep all the important bits (IMO) of HSA in their ROCm stack, except the HSA IL bits that enable runtime compatibility. It is quite sad to see the status quo though — I have always hoped somehow things would be converged into a united front so that we can get real (?) cross-vendor heterogeneous platform libraries. But at least OpenCL 2.0 doesn't seem getting there (yet).

Anarchist4000 · Jan 2, 2017

pTmdfx said:
All GCN GPUs are capable of handling queues in hardware. This is what ACEs are all about. Not sure where you got this false impression.

If what you meant is that only the later GPUs has HWS, HWS is just a hardware mechanism that dynamically binds/schedules user mode queues and virtual memory address space from all HSA processes (as runlists) to the limited hardware queue slots (8 per ACEs) and VMID slots. Regardless of having HWS or not, all GCN GPUs still run queues in the hardware.

CI is the first generation with support for user queues, HW scheduling and AQL, but there's a limit on MEC microcode store size so at the moment we can't fit support for PM4 (what graphics uses), AQL (what HSA uses) and HW scheduling (what HSA also uses) in a single image. Kaveri has two MEC blocks so we were able to configure one for AQL+HWS and the other for AQL+PM4, but IIRC the dGPUs only have a single MEC block so that approach won't work there.

VI doesn't have those limits, and it also adds the ability to interrupt the execution of long-running shader threads and roll the context out to memory so you can context-switch even if individual shaders are running for seconds or minutes. We call that Compute Wave Save/Restore (CWSR) and the latest KFD release (part of the Boltzmann stack under GPUOpen) includes CWSR support for Carrizo.
Link

To varying degrees of success. That's why I'm suggesting the model is changing. It's possible this has since changed, but only the newer models had full hardware support despite having queues. That response would also suggest they want all behavior working simultaneously for future designs.

pTmdfx said:
Even if you turn the packets into a CPU instruction (which carries too much state to be one), it would just be a super extra long latency synchronous instruction that waits for the completion of the spawned work. To submit the work, it still has to go through the interconnect for the CUs, unless you gonna bind CUs to CPU. In exchange, you break the asynchronous vendor independent execution model, have some GPU queueing logic in a clock domain that it is way more than what it needs, cluttering the on-chip interconnect with GPU internal specifics (the CU scheduling layer), and potentially spreading CUs everywhere for niche benefits that probably no one needs.

That's precisely what I'm advocating. With the reservation features they have implemented the vendor independent model isn't affected. It's still there, just some CUs being re-purposed. Take a couple CUs and dedicate them for audio acceleration or another task. Like what they've currently done with TrueAudio and the CU reservation example on GPUOpen. I'm sure there are other possibilities. The model ultimately still exists, it's just short a few cores. A CU reserved for indirect execution or queuing might be interesting as well.

3dilettante said:
The GPU as a whole is what is presented as a kernel agent. The CP is not separate from that, but what it and the other controllers do is abstract the particulars from the CUs, while allowing the internals of the GPU to be shared or virtualized.
What the CP does in addition to that is help manage the heavy legacy and specialized state of the CPU, serves to manage the physical device, and in part helps give the illusion that there is a "GPU" rather than separate subunits.

I'm suggesting it be presented as multiple agents as opposed to one giant agent. While yes that's how we've seen them implemented in the past, it won't necessarily continue. With a configurable fabric specializing segments to reduce contention of resources is likely beneficial.

I'm not advocating completely removing the CP, especially for graphics, just allow the device to be partitioned for whatever reason. Avoid having to send latency sensitive packets through the CP for a task not ultimately related to graphics. Segregate the device into a compute and graphics portion. Not necessarily just two portions either. In the case of the ARM processor, I'm not expecting a giant core. It may very well be a smaller micro, but ideally I think the design might benefit from a more configurable control node. A design more suitable for handling multiple GPU cores, additional memory or IO pools, or futuristic microcode implementations. It's difficult to say how valid that idea is without knowing just how much wiring is required to properly integrate it. A lot of the actual control and signaling logic I expect in the queueng mechanisms. Whatever the geometry distribution mechanism is may reduce the links further. I'm suggest ARM as it should be easier to integrate with other devices for any partner designs.

pTmdfx · Jan 2, 2017

Anarchist4000 said:
That's precisely what I'm advocating. With the reservation features they have implemented the vendor independent model isn't affected. It's still there, just some CUs being re-purposed. Take a couple CUs and dedicate them for audio acceleration or another task. Like what they've currently done with TrueAudio and the CU reservation example on GPUOpen. I'm sure there are other possibilities. The model ultimately still exists, it's just short a few cores. A CU reserved for indirect execution or queuing might be interesting as well.

There are always possibilities. There are tons of alternatives to implement heterogeneous systems on paper. You are not the first to come up with this either. AMD had a similar patent many years ago — filed after the merger, but way before HSA was a thing.

The point is what definitive benefits they would bring over what you already have, and whether the cost is worth it. At this point, the question is always if the lower dispatch latency — not the kernel execution time — gives an advantage, and it remains unanswered. Let's not forget the dispatch latency is independent of the problem size, and data are still being exchanged through the memory hierarchy.

AMD apparently had answered the question with HSA for you.

Working in user-mode and coherent memory ideally already gives a similar dispatch latency similar to task-based parallelism on multi-core CPU. If the necessity is not addressed in the first place, any discussion on the implementation details is irrelevant.

AMD Architecture Discussion

Similar threads