AMD Architecture Discussion

Discussion in 'Architecture and Products' started by Anarchist4000, Dec 19, 2016.

Tags:
  1. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Unless it can pull all packets in parallel, it's a single point aggregating packets. In a design where system memory should be directly accessible to the execution units why go through the CP to retrieve data? If texturing from system memory do CUs normally have to submit packets to the CP to retrieve data? Point being if an agent on the GPU can read system memory or has coherent cache the CP can probably be bypassed. The CP makes more sense with a large graphics workload with work to distribute and synchronize while pulling in fixed function units etc.

    That would seem to indicate a kernel can queue it's own work, although I'm not sure that would ultimately be necessary.

    Why not as opposed to adding multiple command processors to the GPU that only get utilized in certain circumstances? Seems a bit redundant as kernel agents on a GPU could probably fetch work entirely separate of the CP. Not too dissimilar from current multi-threading models. Really depends on how the hardware is configured. For a relatively lite processing setup, say SIMD co-processor for a CPU, dedicating the resources might be more practical as opposed to instantiating an agent each time work arrives. Should provide better latency.

    For a large transaction, but for frequent small transactions it would compromise the CP and add overhead. No different than overwhelming the CP by submitting individual triangles as draw calls. Given an appropriate architecture, it would make more sense to have a plurality of CPs if the hardware broken down into a handful of units. Data amplification would be a concern if squeezing commands through a narrow bus. In theory these designs have full access to system memory bandwidth based on leaks we've seen. My whole understanding of HSA was a single threat being able to alternate between scalar/latency and vector/throughput somewhat transparently. A design ideally negating the need for SIMD processors on a CPU. Use HSA transparently as opposed to larger AVX instructions that likely lack the hardware to sustain throughput.
     
  2. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    128
    It is the structured circular buffers — a.k.a. user mode queues — in the coherent memory which are aggregating the packets. The packet processor reads them and manipulates them based on the atomic queue indexes. How many of them is being read at a time is up to the implementation.

    Then on top of the queuing model, you can have unlimited number of queues. Each queue is an independent command stream from each other, so you can have multiple concurrent packet processors that work on different queues at the same time.

    In the ideal world of HSA you can have tens of processes competing for an agent, and queues can have unlimited packets enqueued. Managing queuing resources purely on-chip is simply impractical. Just look at how the modern graphics API shares a similar queue-based execution model with HSA.

    You seem to have a deep confusion in how GPU fundamentally works. The command processors and the scheduling layer is just the work scheduler — generally speaking, they create work based on the description in the "command" a.k.a. packets, and by work, I mean an invocation of shader program, or compute kernel, that is specified in the packet as a pointer to the program entry point (the first instruction).

    The created work is distributed to CUs. But the CP/scheduler does not do the work, does not care how the work is done, and does not instruct how the CUs complete them. They just need to know whether it is completed (in a simplified model).

    The CUs are essentially a 40-way SMT CPU core with a super wide vector unit. It is "context-switched" in a thread (i.e. a subset of the created work) by the front-end. They are responsible to run the preset shader program from the beginning to the end. Under HSA, they have full access to the flat coherent memory space. So whatever memory accesses they do are none of the command processors' business.

    Then now for the definition of HSA agent, you have to understand that HSA is a data-parallel programming model. An agent is not a CU or a CPU core — these are implementation details — but just something that can run a huge data-parallel problem in general.

    HSA agents execute "kernel dispatches", but it doesn't mean the packet processor itself instructs everything. The kernel dispatches represents a problem of any arbitrary size. Say siven a 1920 * 1080 dispatch, GCN breaks it down into 31800 wavefronts and run them concurrently on all CUs. Each wavefront runs the kernel program from the beginning to the end, and has their own program counter. They are concurrent, despite being spawn from a single command in the queue. The packet processor does not care how the CUs run them — it just cares about when the dispatch is completed, and then it would move on to the next packet.

    Packet is a high-level description of a problem. A packet processor parses the packet to break it down into smaller pieces, and CUs are the one who actually solve the problem.

    A kernel can enqueue to any queues, as long as it has the address to the user mode queue. As queues are stored in the coherent memory, it is just a matter of atomic operations and signalling, which is the whole point about HSA.

    The SIMD/FP "co-processor" of a CPU, say Zen, is a completely different matter. It separates the SIMD execution pipeline from the integer pipeline, but both are still part of the same state machine that appears to run on a sequential sequence of instructions. GPUs, on the other hand, are multiple state machines running concurrently like multi-core CPUs, and each of them runs its own sequential sequence of instructions in its own pace.

    If you cannot get your head around how levels of concurrency work in modern systems, IMO it would be hard to push the discussion to a fruitful direction.
     
    #62 pTmdfx, Jan 1, 2017
    Last edited: Jan 1, 2017
  3. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    128
    The real problem to this rationale is that latency is already low enough not to be a bottleneck in its expected use cases.

    User-mode queues in the coherent memory already provide an easy low-latency way to enqueue data-parallel work. In an ideal HSA SoC architecture, this should have no critical difference from the system memory and signalling latency among thousands of CPU threads. The packet processor would just be an ordinary first-class coherent client in the coherence domain of memory.

    If you have a sufficiently large work to run, the dispatch cost would effectively be hidden by the execution time that outweighs it. Let alone that HSA and new modern APIs already push the bar of "sufficiently large work" down significantly, at least comparing to OCL1/OGL/DC.

    It has to be understood that concurrency is limited. For anything that would let down by the HSA dispatch cost, even in its ideal SoC implementation, it might be a sign that the work is not really as concurrent as you expected. In this case, running it on a few or a single CPU thread is not a sin.
     
    #63 pTmdfx, Jan 1, 2017
    Last edited: Jan 1, 2017
  4. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Through the CP according to your definition of the model. That's what I've been getting at with the CPU being able to dispatch work more directly. In the model I'm envisioning with coherent queues the packet processor doesn't need to manipulate them. Any agent can do it directly.

    Similar, but not all hardware implementations have the same hardware scheduling capabilities. Most GCN chips are capable of queues, but only the more modern Fiji, Polaris, and Carrizo have the hardware implementation. Tonga has it as well but is lacking in the scheduler quantity to my understanding. So while the model is somewhat shared, there are some distinct differences, primarily around the CP to my understanding.

    Because I'm not describing a traditional GPU here. I'm describing a heterogeneous architecture. Even while "just the work scheduler" there is still actual work involved in scheduling, especially scheduling intelligently.

    The model I was describing had a CPU and CU (or some partitioning) more closely coupled. The implementation details as you said. Agents are like threads, they don't care what they run just that they're running when necessary. For clarification here thread not being an individual lane of a wave in this example. That thread, while probably just manipulating data, could actually be creating packets or work to be passed to one or more other agents. That wouldn't be unreasonable for some sort of large scale sorting. That could also be an agent scheduling work that ultimately came back to it.

    My argument here is that the CP isn't necessarily involved in the process with a CPU/scalar agent being able to create and dispatch packets directly. The traditional GPU method on modern and upcoming hardware may very well be different. The graphics model had the CP pulling packets from the CPU to be distributed in whatever manner was appropriate for the hardware. In the case of recent GCN variations that was filling queues on 4 ACEs with 2 HWS units determining which and where work was scheduled. I'm unsure this is still the case with modern/upcoming hardware.

    I'm suggesting a change to the model and expected use cases. Completely removing the larger SIMD/vector instructions from the CPU in favor of execution on a GPU or co-processor to make cores smaller. I'm well aware of how the concurrency works. As for the latency expectations, it would depend on just what was being executed. Say looping through the square root of a dot product or a circumstance where the required instructions weren't present on each processor. Requiring execution to bounce between processors. Assumption being the CPU couldn't effectively calculate the dot product for whatever reason. Attempt to use HSA as an extension of the CPU's instruction set more than the current design. Limit the impact of scheduling. Expectations are it would be roughly equivalent in speed to bouncing between CPU cores each instruction. If a vector/throughput agent was exclusive to a core waiting out the result might be more practical. More along the lines of operating without cache or bandwidth constraints. Going back to the high performance scalar on GPU concept, I'd expect a good deal of work to be dispatched to the GPU. Relegating Zen cores to obscure instructions and prediction with the GPU portion doing most other work. Especially anything bandwidth constrained where the cache and prediction mattered less. That's why I'm arguing for even lower latency as traditional GPU latency hiding would break down.
     
  5. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    128
    Any agents can manipulate the queue by enqueuing to it. The packet processor consumes packet from the queue. It is a producer-consumer model. The whole enqueuing model is memory based, so there is nothing you can do more than writing to the memory, and trigger a signal. It is also intentionally designed to be memory based to be implementation agnostic.

    All GCN GPUs are capable of handling queues in hardware. This is what ACEs are all about. Not sure where you got this false impression.

    If what you meant is that only the later GPUs has HWS, HWS is just a hardware mechanism that dynamically binds/schedules user mode queues and virtual memory address space from all HSA processes (as runlists) to the limited hardware queue slots (8 per ACEs) and VMID slots. Regardless of having HWS or not, all GCN GPUs still run queues in the hardware.

    Still speaking in terms of the HSA model here: the packet processor encapsulates the entirety of work scheduling and execution of the agent, so an agent would only see other agent's memory accesses, and communicates through the coherent memory. If you make CPU leaning on the internals of a GPU, the encapsulation is broken.

    Even if you turn the packets into a CPU instruction (which carries too much state to be one), it would just be a super extra long latency synchronous instruction that waits for the completion of the spawned work. To submit the work, it still has to go through the interconnect for the CUs, unless you gonna bind CUs to CPU. In exchange, you break the asynchronous vendor independent execution model, have some GPU queueing logic in a clock domain that it is way more than what it needs, cluttering the on-chip interconnect with GPU internal specifics (the CU scheduling layer), and potentially spreading CUs everywhere for niche benefits that probably no one needs.

    I can run timing-critical and low DLP SIMD routines on CPUs efficiently with all the high performance caches anyway, if the latency of HSA is not enough low for my workload. So why would I ever need it? HSA/OCL at least has quite a bunch of ideal use cases behind it, although no one really delivers (yet). Is there a real use case — that really needs integration beyond the asynchronous model that just starts blooming — other than it seeming better?

    Honestly, there are a lot of things on paper, and on paper for good.
     
    #65 pTmdfx, Jan 1, 2017
    Last edited: Jan 1, 2017
    Gubbi likes this.
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,120
    Likes Received:
    2,867
    Location:
    Well within 3d
    The real-time or control implementations reduce sources of variable latency or faults, which highly speculative and wide cores introduce, as well as the wide and unpredictable memory hierarchies/networks needed to support them.
    The packet processing step is not a dominant portion of the overall computation, but events that cause a host core to rely on OoO to hide latency or experience unpredictable latencies become a broader stall condition downstream.

    The controller cores are small and configured to cater to the specific needs of the GPU domain, so I am not certain of many cases where that's a greater loss than forcing a host CPU orders of magnitude more complex to do this job, when the host domain is hostile to the kind of data flow and communications requirements of the task.
    With all the talk of modularizing the interconnect, this is reversing the abstraction process of the GPU versus CPU, exposing elements the command processor encapsulates. If the CPU domain is altered to better handle this, it brings in the risk of cascading changes from somewhere in the GPU all the way into the host CPU core, which is a very unforgiving domain to pin a high rate of evolution on.

    The GPU as a whole is what is presented as a kernel agent. The CP is not separate from that, but what it and the other controllers do is abstract the particulars from the CUs, while allowing the internals of the GPU to be shared or virtualized.
    What the CP does in addition to that is help manage the heavy legacy and specialized state of the CPU, serves to manage the physical device, and in part helps give the illusion that there is a "GPU" rather than separate subunits.

    That goes more to objections raised that HSA in general and its kernel-based architecture in particular is a waste of time, but that is not what AMD is counting on.

    For a large transaction, but for frequent small transactions it would compromise the CP and add overhead. No different than overwhelming the CP by submitting individual triangles as draw calls. Given an appropriate architecture, it would make more sense to have a plurality of CPs if the hardware broken down into a handful of units.
    It's a waste of power, worse latency, and additional storage burden, if used in a direct link from CPU to graphics domain. If we are going back to using the main hierarchy and interconnect, it is contention and variability in the control path, which the GPU is less able to hide latency for.
    The graphics context is bigger than compute, which is why preemption and context switching were more readily introduced for compute. Having bandwidth, if not being consumed by actual compute, doesn't change adding trips to memory that currently do not happen.

    HSA is an attempt to provide a virtual machine model that allows devices in a system that have gained programmability to be used similarly to what CPUs traditionally were. To do this, it sets down various minimums for hardware compliance, and a method to dispatch work in a way that abstracts the devices' internals from code this meant to be portable after it is run through a chain of intermediate translators and compilers. The system at run-time could get the overall context of the HSA program, and potentially have a different device run it, or arbitrate parts of that device out due to other services or virtual clients.

    This is some number of layers of abstraction higher than making a CPU drop AVX support, and frankly it would be a poor substitute besides being architecture-breaking.
    Less charitably, HSA was AMD's way of codifying what it was doing anyway with GPUs and hoping the get a smattering of other vendors threated by Intel and the difficulty in getting better process nodes to sign on and give it legitimacy. That was pointedly given as a motivation for the 1.1 revision of HSA, where others complained that it was mainly suited for AMD's implementations.

    The main boosters for HSA, like ARM and AMD, have found significant portions of it to be unnecessary for their products.
     
  7. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    128
    I wouldn't say "significant". AMD does keep all the important bits (IMO) of HSA in their ROCm stack, except the HSA IL bits that enable runtime compatibility. It is quite sad to see the status quo though — I have always hoped somehow things would be converged into a united front so that we can get real (?) cross-vendor heterogeneous platform libraries. But at least OpenCL 2.0 doesn't seem getting there (yet).
     
    #67 pTmdfx, Jan 2, 2017
    Last edited: Jan 2, 2017
  8. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    To varying degrees of success. That's why I'm suggesting the model is changing. It's possible this has since changed, but only the newer models had full hardware support despite having queues. That response would also suggest they want all behavior working simultaneously for future designs.

    That's precisely what I'm advocating. With the reservation features they have implemented the vendor independent model isn't affected. It's still there, just some CUs being re-purposed. Take a couple CUs and dedicate them for audio acceleration or another task. Like what they've currently done with TrueAudio and the CU reservation example on GPUOpen. I'm sure there are other possibilities. The model ultimately still exists, it's just short a few cores. A CU reserved for indirect execution or queuing might be interesting as well.

    I'm suggesting it be presented as multiple agents as opposed to one giant agent. While yes that's how we've seen them implemented in the past, it won't necessarily continue. With a configurable fabric specializing segments to reduce contention of resources is likely beneficial.

    I'm not advocating completely removing the CP, especially for graphics, just allow the device to be partitioned for whatever reason. Avoid having to send latency sensitive packets through the CP for a task not ultimately related to graphics. Segregate the device into a compute and graphics portion. Not necessarily just two portions either. In the case of the ARM processor, I'm not expecting a giant core. It may very well be a smaller micro, but ideally I think the design might benefit from a more configurable control node. A design more suitable for handling multiple GPU cores, additional memory or IO pools, or futuristic microcode implementations. It's difficult to say how valid that idea is without knowing just how much wiring is required to properly integrate it. A lot of the actual control and signaling logic I expect in the queueng mechanisms. Whatever the geometry distribution mechanism is may reduce the links further. I'm suggest ARM as it should be easier to integrate with other devices for any partner designs.
     
  9. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    128
    There are always possibilities. There are tons of alternatives to implement heterogeneous systems on paper. You are not the first to come up with this either. AMD had a similar patent many years ago — filed after the merger, but way before HSA was a thing.

    The point is what definitive benefits they would bring over what you already have, and whether the cost is worth it. At this point, the question is always if the lower dispatch latency — not the kernel execution time — gives an advantage, and it remains unanswered. Let's not forget the dispatch latency is independent of the problem size, and data are still being exchanged through the memory hierarchy.

    AMD apparently had answered the question with HSA for you.

    Working in user-mode and coherent memory ideally already gives a similar dispatch latency similar to task-based parallelism on multi-core CPU. If the necessity is not addressed in the first place, any discussion on the implementation details is irrelevant.
     
    #69 pTmdfx, Jan 2, 2017
    Last edited: Jan 2, 2017
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...