Fault-Tolerant Coherence Protocol for Distributed Shared Memory Systems
Abstract
Distributed Shared Memory (DSM) systems are becoming increasingly more significant as a result of being used more extensively in modern computing environments. DSM gives the illusion of shared memory on a loosely couple system. In a scenario where systems are connected across a network, DSM coherence protocols should be able to scale well to larger networks. When real-time applications run on distributed systems, providing a high degree of reliability is an inherently error-prone environment is a formidable task. Regardless, fault tolerance, in terms of highly available data-access and uninterrupted service, should be provided. Recovery is the process of restoring a system to its normal operational state in the event of a failure. Reliability ensures the consistency of the data after recovery. Existing DSM systems provide reliability by replicating data, either in stable storage or in the main memories of different processors. But these systems suspend the DSM service during recovery. In time-critical applications, providing uninterrupted DSM service to the greatest possible extent is a necessity and a challenge. Research has been reported in the literature on architectures with a single server and multiple clients. This thesis reports on investigations on finding a solution where the server is not a single point of failure, faster recovery is possible in the event of a failure, and increased throughput can be obtained during normal operation of the system. It was found that better performance can be obtained by using the multi-server protocol when the user application exhibits locality of reference. The server was not a single point of failure and recovery form a single site failure was appoximately 50% faster when 2 servers, instead of 1, were used.
Collections
- OSU Theses [15752]