Date
Journal Title
Journal ISSN
Volume Title
Publisher
Out-of-order engines are the basis for nearly every high performance general purpose processor today due to their ability to mask the penalties associated with long latency operations. Unfortunately, these benefits come at a cost of a substantial amount of power consuming hardware. In addition, the prevalence of loops in code means that this hardware often duplicates its efforts, rescheduling the same sequence thousands of times within a typical program. While each iteration of the loop is not identical due to branches and variable length operations, for the SPEC2017 benchmarks tested, the most common dynamically scheduled instruction pattern takes up anywhere from 43% to 88% of the reorderings, and the four most common patterns accounting for anywhere from 70% to 98%.
To eliminate some of the duplicated work in finding the same dynamic schedule, the execution patterns that the out-of-order engine creates can be recorded in a cache that indexes patterns based on the branch that immediately precedes that pattern. If the same pattern is seen enough times, then much of the dynamic scheduling hardware can be power off and the previously determined schedule can be used. This powered off hardware includes the common data bus that broadcasts output registers written in a particular cycle to every reservations station. Instead, the system can replay instructions based on the recorded order. There are many parameters within this system that affect both the amount of time spent in this replay mode and the relative performance of the system. These include the start threshold, stop threshold, length of time to wait on a replayed instruction to be ready, the mechanism for handling squashes, and timeouts, the number of patterns to store for a branch, and the length of history considered. While the general trade-offs of adjusting these parameters is often the same for most benchmarks, the optimum value depends both on the value of the other parameters and the particular set of code being used. With this in mind, the proposed system has been shown to be able to achieve over 40% of the time spent in replay mode with only a 2% reduction in performance relative to a standard out of order processor.
While this system has been shown to work well, the goal of reducing power by removing the duplicated efforts means that the pattern cache size must be limited. These limitations include limits to both the length of patterns as well as the number of patterns that can be stored. Limiting the length of pattern to reasonable sizes has almost no affect on the system operation in most cases, but limiting the number of patterns does. For some benchmarks it is still possible to get most of the performance with a pattern cache size on the order of kilobytes (30% utilization with a 1% performance drop in the best case), other benchmarks only have utilization rates at half of their maximum possible values. With this in mind, the proposed system shows promising initial results, but for this system to be an effective power saving tool more work must be done to find ways to limit the size of the pattern cache with smaller reductions in utilization rates.