Processing dependent tasks on a heterogeneous GPU resource architecture
Abstract
In this dissertation, a heterogeneous GPUs system means the system consists of a variety of different types of GPUs. Many problems in science and engineering can be represented as a two dimensional grid where updating of each grid point value is dependent on its nearest neighbor's values. The grid size used may be too large to be handled on a single computing node. If a distributed and heterogeneous processors system is applied two crucial issues are introduced, namely, minimizing inter-processors communication and load balancing. Firstly, a novel partitioning algorithm for heterogeneous processors (NPHP) is proposed which is based on gird shape to choose an efficient way to divide blocks as square as possible to minimize communication cost. Secondly, a functional performance model with communication (FPMC) is proposed to estimate the absolute speeds of processors accurately. This method can accurately divide the workload proportional to the speeds of GPUs. Based on these two partitioning algorithms, a heterogeneous GPU system (HG) is implemented. The HG is different from other distributed GPU systems because HG can process dependent tasks which indicate the tasks in HG can communicate with each other. Furthermore, a dynamic component is designed and implement in HG system. Hence the neighbor relationship can change at run time. Using this architecture HG can deal with more complex task dependent applications. To validate our approach, a HG system running heat transfer and Gaussian Elimination is implemented. The results of experiment demonstrate that the heterogeneous GPU system has an essential advantage over traditional homogeneous GPU and CPUs system. For the static neighbor application, heat transfer, HG is at least 8 times faster than a MPI program running on CPU. For the dynamic neighbor application, Gaussian Elimination, HG can get 2.75 times speedup. Also we propose and implement some optimizations to improve performance. These include NPHP which reduces communication cost by at least 10%, and FMPC which improves the load balance by 10% on average. Optimization in the form of the data reuse technology in the computing kernel to utilize shared memory to reduce the global memory accesses yields a 7 times speedup.
Collections
- OSU Dissertations [11222]