Dynamic workload balancing and scheduling in Hadoop MapReduce with software defined networking

Hou, Xiaofei

dc.contributor.advisor	Thomas, Johnson
dc.contributor.author	Hou, Xiaofei
dc.date.accessioned	2018-04-23T19:35:23Z
dc.date.available	2018-04-23T19:35:23Z
dc.date.issued	2017-07
dc.identifier.uri	https://hdl.handle.net/11244/299516
dc.description.abstract	Hadoop offers a platform to process big data. Hadoop Distributed File System (HDFS) and MapReduce are two components of Hadoop. Hadoop adopts HDFS which is a distributed file system for storing data and MapReduce for processing this data for users. Hadoop stores data based on space utilization of datanodes, without considering the processing capability and busy level during the running time of each datanode. Furthermore datanodes may be not homogeneous as Hadoop may run in a heterogeneous environment. For these reasons, workload imbalances will appear and result in poor performance. We propose a dynamic algorithm that considers space availability, processing capability and busy level of datanodes to ensure workload balance between different racks. Our results show that the execution time of map tasks moved will be reduced by more than 50%. Furthermore, we propose a method in which Hadoop runs on a Software Defined Network in order to further improve the performance by allowing fast and adaptable data transfers between racks. By installing OpenFlow switches to replace classical switches on a Hadoop cluster, we can modify the topology of the network between racks in order to enlarge the bandwidth if large amounts of data need to be transferred from one rack to another. Our results show that the execution time of map tasks moved is significantly reduced by about 50% when employing our proposed Hadoop cluster Bandwidth Routing algorithm. Apache YARN is the second generation of MapReduce. YARN has three built-in schedulers: the FIFO, Fair and Capacity Scheduler. Though these schedulers provide users different methods to allocate resources of a Hadoop cluster to execute their MapReduce jobs, they do not guarantee that their jobs will be executed within a specific deadline. We propose a deadline constraint scheduler algorithm for Hadoop. This algorithm uses a statistical approach to measure the performance of datanodes and based on this information the proposed algorithm creates several check points to monitor the progress of a job. Based on the progress of jobs at every checkpoint the proposed scheduler will assign them to different job queues. These queues will have different priorities and the proportion of resources used by these queues will depend on their priority. The results of our experiments show that the proposed scheduler ensures that jobs will be completed within a given deadline whereas the native schedulers cannot guarantee this. Moreover, the average job execution time in the proposed scheduler is 56% and 15% less when compared to the Fair and EDF schedulers respectively.
dc.format	application/pdf
dc.language	en_US
dc.rights	Copyright is held by the author who has granted the Oklahoma State University Library the non-exclusive right to share this material in its institutional repository. Contact Digital Library Services at lib-dls@okstate.edu or 405-744-9161 for the permission policy on the use, reproduction or distribution of this material.
dc.title	Dynamic workload balancing and scheduling in Hadoop MapReduce with software defined networking
dc.contributor.committeeMember	Chan-Tin, Eric
dc.contributor.committeeMember	Crick, Christopher
dc.contributor.committeeMember	Sheng, Weihua
osu.filename	Hou_okstate_0664D_15353.pdf
osu.accesstype	Open Access
dc.type.genre	Dissertation
dc.type.material	Text
thesis.degree.discipline	Computer Science
thesis.degree.grantor	Oklahoma State University

Files in this item

Name:: Hou_okstate_0664D_15353.pdf
Size:: 1.894Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

OSU Dissertations [11222]

Show simple item record

SHAREOK^TM

advancing Oklahoma scholarship, research and institutional memory

Dynamic workload balancing and scheduling in Hadoop MapReduce with software defined networking

Files in this item

This item appears in the following Collection(s)