Show simple item record

dc.contributor.advisorRadhakrishnan, Sridhar
dc.contributor.authorWomack, Addison
dc.date.accessioned2019-05-08T14:44:51Z
dc.date.available2019-05-08T14:44:51Z
dc.date.issued2019-05-10
dc.identifier.urihttps://hdl.handle.net/11244/319599
dc.description.abstractDNA sequencing machines produce tens of thousands to hundreds of millions of reads. Each read consists of letters from the alphabet X= {A, T, C, G, N} and varies in length between 30 to 120 characters and beyond. The DNA reads are stored in a standard FASTQ file format that contains not only the reads but also a quality score for each character in each read that corresponds to the probability that the identified character is correct. The FASTQ files vary in size between 100s of megabytes to 10s of gigabytes. The reads in the FASTQ files are processed as part of many DNA algorithms for various sequence analyses. Given the fact that the size of each file is considerable, keeping and handling multiple of these files in main memory for faster processing is not possible on commodity hardware. In this thesis, we propose a lossless compression mechanism named CIGARCoil that operates on the FASTQ files and other files that contain the DNA reads. The other salient features of CIGARCoil are: • It is a not a reference-based algorithm in the sense that one does not need to create a reference string before the compression can begin. Reference strings are undesirable due to them not only being hard to determine, but also due to them being required for both the compression and decompression of the file. • In this thesis, for the first time, we show that each of the reads can be accessed directly on the compressed structure created by CIGARCoil. That is, we provide access to each read without having to uncompress the file. • Since we can provide direct access to a read on the CIGARCoil compressed structure, we have implemented a [] (square-bracket) array indexing operator. Through this implementation, we can implement a predictive caching mechanism that will make the reads available for the enduser based on their access pattern. We have analyzed our compressed mechanism on various well-known FASTQ data sets along with synthetic data sets. In all cases, our compression method produces a compressed file that is smaller or approximately the same size as ones created by the existing DNA compression mechanisms, including BZIP, DSRC2, and LFQC.en_US
dc.languageen_USen_US
dc.rightsAttribution 4.0 International*
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/*
dc.subjectData Compressionen_US
dc.subjectDNA Sequencingen_US
dc.subjectComputer Science.en_US
dc.titleCIGARCoil: A New Algorithm for the Compression of DNA Sequencing Dataen_US
dc.contributor.committeeMemberGrant, Christan
dc.contributor.committeeMemberPan, Chongle
dc.date.manuscript2019-05-05
dc.thesis.degreeMaster of Scienceen_US
ou.groupGallogly College of Engineering::School of Computer Scienceen_US
shareok.orcid0000-0002-8755-8095en_US


Files in this item

Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record


Attribution 4.0 International
Except where otherwise noted, this item's license is described as Attribution 4.0 International