When data is copied between CPU and GPU memory, a GPU direct memory access (DMA) copy engine may be used. The GPU DMA copy engine may indicate that the CPU's memory to be properly aligned before data can be transferred. In such cases, data transfers may involve an extra step of copying between the actual CPU source/destination memory and a temporary memory allocation on the CPU that meets the DMA copy engine constraints. Such data transfer operations may result in the overall latency of the data transfer operation being the sum of each step in the process.
In embodiments described herein, the overall transfer operation may be broken into a series of transfers of smaller chunks. Then, the data chunks may be pipelined so that the execution of different steps in the transfer operation overlap by concurrently executing the different steps in the pipeline with each different step being performed on a different chunk of the larger transfer. The pipelining technique may be applied to any data transfer operation that involves multiple steps as part of the transfer operation. The different steps may be performed by any hardware or software resources that can function concurrently.
This pipelining technique may be applied, for example, when copying a large amount of data between the memory arenas of two different accelerator devices. In such cases, the transfer is to be routed through CPU memory, and thus includes two steps: 1) copy data from source accelerator memory to CPU memory, and 2) copy the data from CPU memory to destination accelerator memory. Since step 1 and step 2 in the above transfer operation are performed by independent DMA engines (step 1 by the source accelerator DMA engine and step 2 by the destination accelerator's DMA copy engine), the two engines can work concurrently to speed up the overall transfer time by pipelining the data copying steps.
In some cases, data may be transferred between a first memory arena and a fourth memory arena. In such cases, transfer determining module 115 may determine that for the data 131 to be transferred from the first memory arena to a fourth arena, the data chunk is to be transferred from the first memory arena to a second memory arena, from a second memory arena to a third memory arena, and from the third memory arena to the fourth memory arena. In response to the determination, the data copying module 125 may perform the following in parallel: copy a third data portion (not shown) from the first memory arena 135 to the second memory arena 140, copy the second data portion 137 from the second memory arena to the third memory arena 145 and copy the first data portion 136 from the third memory arena to the fourth memory arena. As shown in FIG. 5, the total transfer time for such a data transfer is t+t/n seconds, where n is the number of data portions (i.e. chunks). The copying is performed concurrently at each stage, once the pipeline is loaded. This concurrent data transfer among heterogeneous memory arenas allows data to be quickly transferred and accessed by the destination. Instead of serially sending a large data chunk from arena to arena, the chunks are broken down into smaller pieces and transferred concurrently, thus greatly reducing transfer time.