Show simple item record

dc.contributor.advisorRadhakrishnan, Sridhar
dc.contributor.authorSelimovic, Dorian
dc.date.accessioned2020-01-03T20:23:28Z
dc.date.available2020-01-03T20:23:28Z
dc.date.issued2019-12-13
dc.identifier.urihttps://hdl.handle.net/11244/323262
dc.description.abstractThe significant drop in DNA Sequencing costs caused by Next-Generation Sequencing has led to the production of massive amounts of raw sequencing data. This data is stored in FASTQ files, which are text files containing a large number of reads, each composed of a short DNA sequence and its associated identifier and quality score. The DNA sequence is a string of fixed length over the alphabet Σ = {A, C, T, G, N}, the identifier is an arbitrary string that is sequencer-dependent, and the quality score is a string of the same length as the DNA sequence, indicating for each base how confident the sequencer was when determining it. These files can range from a few gigabytes to hundreds of gigabytes, which poses a Big Data challenge, as the growth of generated sequencing data now exceeds the decrease of storage hardware price. Therefore, storing and transmitting such data requires more performant compression algorithms than general purpose compressors such as gzip, the de facto standard. Many different specialized compressors have been proposed to tackle this problem. In this thesis, we review currently existing compressors for FASTQ files and we propose a novel compression algorithm for DNA sequences, MATC, for Multiple Attribute Tree Compression. Our algorithm divides DNA sequences into k-mers, i.e., substrings of length k, and performs column-wise compression using a multiple attribute tree. In our case the multiple attribute tree is a complete tree where each node is a k-mer and each leaf represents the sequence formed by the concatenation of its parent k-mers. The tree is then stored using level-order traversal and k-mers are compressed using Huffman encoding. We show that our algorithm offers compression ratios comparable to the current specialized compressors. Moreover, we propose a distributed version of our algorithm, allowing the compression of larger files across a cluster of machines. This allows compression to be processed in the cloud, rather than on commodity hardware, which will become less and less suited to handle the growing size of generated sequencing data.en_US
dc.languageen_USen_US
dc.rightsAttribution 4.0 International*
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/*
dc.subjectComputer Scienceen_US
dc.subjectDNA Sequencing.en_US
dc.subjectData Compression.en_US
dc.titleCompressing Massive Sequencing Data with Multiple Attribute Treeen_US
dc.contributor.committeeMemberGrant, Christan
dc.contributor.committeeMemberKim, Changwook
dc.date.manuscript2019-12-09
dc.thesis.degreeMaster of Scienceen_US
ou.groupGallogly College of Engineering::School of Computer Scienceen_US


Files in this item

Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record


Attribution 4.0 International
Except where otherwise noted, this item's license is described as Attribution 4.0 International