Show simple item record

dc.contributor.advisorThomas, Johnson
dc.contributor.authorHou, Xiaofei
dc.date.accessioned2018-04-23T19:35:23Z
dc.date.available2018-04-23T19:35:23Z
dc.date.issued2017-07
dc.identifier.urihttps://hdl.handle.net/11244/299516
dc.description.abstractHadoop offers a platform to process big data. Hadoop Distributed File System (HDFS) and MapReduce are two components of Hadoop. Hadoop adopts HDFS which is a distributed file system for storing data and MapReduce for processing this data for users. Hadoop stores data based on space utilization of datanodes, without considering the processing capability and busy level during the running time of each datanode. Furthermore datanodes may be not homogeneous as Hadoop may run in a heterogeneous environment. For these reasons, workload imbalances will appear and result in poor performance. We propose a dynamic algorithm that considers space availability, processing capability and busy level of datanodes to ensure workload balance between different racks. Our results show that the execution time of map tasks moved will be reduced by more than 50%. Furthermore, we propose a method in which Hadoop runs on a Software Defined Network in order to further improve the performance by allowing fast and adaptable data transfers between racks. By installing OpenFlow switches to replace classical switches on a Hadoop cluster, we can modify the topology of the network between racks in order to enlarge the bandwidth if large amounts of data need to be transferred from one rack to another. Our results show that the execution time of map tasks moved is significantly reduced by about 50% when employing our proposed Hadoop cluster Bandwidth Routing algorithm. Apache YARN is the second generation of MapReduce. YARN has three built-in schedulers: the FIFO, Fair and Capacity Scheduler. Though these schedulers provide users different methods to allocate resources of a Hadoop cluster to execute their MapReduce jobs, they do not guarantee that their jobs will be executed within a specific deadline. We propose a deadline constraint scheduler algorithm for Hadoop. This algorithm uses a statistical approach to measure the performance of datanodes and based on this information the proposed algorithm creates several check points to monitor the progress of a job. Based on the progress of jobs at every checkpoint the proposed scheduler will assign them to different job queues. These queues will have different priorities and the proportion of resources used by these queues will depend on their priority. The results of our experiments show that the proposed scheduler ensures that jobs will be completed within a given deadline whereas the native schedulers cannot guarantee this. Moreover, the average job execution time in the proposed scheduler is 56% and 15% less when compared to the Fair and EDF schedulers respectively.
dc.formatapplication/pdf
dc.languageen_US
dc.rightsCopyright is held by the author who has granted the Oklahoma State University Library the non-exclusive right to share this material in its institutional repository. Contact Digital Library Services at lib-dls@okstate.edu or 405-744-9161 for the permission policy on the use, reproduction or distribution of this material.
dc.titleDynamic workload balancing and scheduling in Hadoop MapReduce with software defined networking
dc.contributor.committeeMemberChan-Tin, Eric
dc.contributor.committeeMemberCrick, Christopher
dc.contributor.committeeMemberSheng, Weihua
osu.filenameHou_okstate_0664D_15353.pdf
osu.accesstypeOpen Access
dc.type.genreDissertation
dc.type.materialText
thesis.degree.disciplineComputer Science
thesis.degree.grantorOklahoma State University


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record