Barnes, RonaldFitzmorris, Lucia2018-05-012018-05-012018-05https://hdl.handle.net/11244/299691In this thesis, I introduce and implement an extension to the Decoupled Software Pipelining (DSWP) algorithm proposed by Rangan et al. This new extension is named Learning Assisted Decoupled Software Pipelining (or LA-DSWP) as it applies reinforcement learning to the partitioning problem found within DSWP. Through experimentation, the viability of DSWP and LA-DSWP as optimizations that produce significant program speedup is tested and measured. As computer architects strive to keep up with public expectations for processor performance growth, they are increasingly turning to processor designs which utilize multiple independent cores on a single chip. Unlike most prior hardware innovations, computer programs must be written or compiled with multiple threads in mind to take advantage of these new hardware innovations. Automatic thread-extraction using Decoupled Software Pipelining seeks to extract multiple threads from a single-threaded program~\cite{ottoni-micro-2005}. This is done by allowing loops within the program to execute on multiple cores on a single processor chip simultaneously without programmer intervention. DSWP focuses on splitting large recursive data structure's traversal loops into multiple threads in an attempt to increase overall program performance. Unlike prior implementations of DSWP, this research uses a hardware and language independent implementation of DSWP using the LLVM framework. Rather than relying on custom-built hardware to facilitate communication between program threads, this implementation uses Intel's Thread Building Blocks library to create queues in the shared memory between the various on-chip processor cores. As this thesis will show, this design setup relies heavily on the memory subsystem of the targeted processors and is greatly impacted by the actual design of the memory subsystem. Another novel addition to DSWP explored in this thesis is the application of machine learning to the partitioning process. Instead of partitioning the nodes of a loop's program dependency graph using predefined heuristics, this thesis seeks to apply reinforcement learning to allow the DSWP agent to make more informed decisions when optimizing a given loop. The DSWP agent is able to collect and analyze data about each node of a program's loop to partition the loop on a node-by-node basis. This addition constitutes LA-DSWP. Through experimentation on modern Intel processors, this thesis tests the feasibility of LA-DSWP on current hardware. Multiple kernel programs were written to search for program patterns that can achieve performance increases using DSWP partitioning. Experiments were run using the partitioning methods discussed in earlier papers along with the proposed method utilizing machine learning.Engineering, Electronics and Electrical.Computer Science.Learning Assisted Decoupled Software Pipelining (LA-DSWP)