DETAILED DESCRIPTION
In a heterogeneous computing system, memory may be managed by using a distributed array, which is a global set of local memory regions. To use the distributed array, it is first declared along with optional parameters. The parameters may include an indication whether the distributed array is persistent (data written to the distributed array during one parallel dispatch is accessible by work items in a subsequent dispatch) or an indication whether the distributed array is shared (meaning that nested kernels may access the distributed array). A segment in the distributed array is allocated for use and is bound to a physical memory region. The segment is used by a workgroup (including one or more work items) dispatched as part of a data parallel kernel, and may be deallocated after it has been used.
A distributed array provides an abstraction through a uniform interface in terms of reads and writes, and can guarantee coherency. Accesses to memory are partitioned, such that how the user programs the memory access is how the memory access is compiled down to the machine. The properties that the programmer provides to the memory determines which physical memory it gets mapped to. The programmer does not have to specify (as under the OpenCL model) whether the memory is global, local, or private. The implementation of the distributed array maps to these different memory types because it is optimized to the hardware that is present and to where a work item is dispatched.
With the distributed array, memory may be defined to be persistent, such that it is loaded into local regions and can be stored back out again to more permanent storage if needed. The distributed array may be made persistent if the next workgroup needs this same data; for example, if the output of one workgroup is the input to the next workgroup. Workgroups may be scheduled to run on the same core, so that the workgroups can access the memory and eliminate the copy in/copy out overhead for later workgroups.
The distributed array is described in terms of segments, wherein the distributed array is a representation of a global set of all local memory regions. When bound to a parallel kernel launch, each segment of the distributed array can be accessed from one defined subgroup of the overall parallel launch, including a subset of individual work items. In the described embodiment, the subset would be the parallel workgroup within the overall parallel dispatch. Access from outside that subgroup may or may not be possible depending on defined behavior. The segment may be allocated at run time, or may be persistent due to a previous execution. If a segment is allocated, that segment may be explicitly passed into another launch, so a particular block of data can be identified and passed to a particular consuming task.
With the distributed array, it is possible to perform a depth-first optimization, in which all consecutive work that relies on one particular block of data is run before moving on to the next block of data. The distributed array is used instead of the current OpenCL-style memory model, with a large data parallel operation that writes a large amount of data to memory, reads a large amount of data back in, and so on. The depth-first optimization changes the order of execution based on the data dependencies, rather than the original data parallel construction, and allows for a more flexible execution pattern.