Compressing Massive Sequencing Data with Multiple Attribute Tree

dc.contributor.advisorRadhakrishnan, Sridhar
dc.contributor.authorSelimovic, Dorian
dc.contributor.committeeMemberGrant, Christan
dc.contributor.committeeMemberKim, Changwook
dc.date.accessioned2020-01-03T20:23:28Z
dc.date.available2020-01-03T20:23:28Z
dc.date.issued2019-12-13
dc.date.manuscript2019-12-09
dc.description.abstractThe significant drop in DNA Sequencing costs caused by Next-Generation Sequencing has led to the production of massive amounts of raw sequencing data. This data is stored in FASTQ files, which are text files containing a large number of reads, each composed of a short DNA sequence and its associated identifier and quality score. The DNA sequence is a string of fixed length over the alphabet Σ = {A, C, T, G, N}, the identifier is an arbitrary string that is sequencer-dependent, and the quality score is a string of the same length as the DNA sequence, indicating for each base how confident the sequencer was when determining it. These files can range from a few gigabytes to hundreds of gigabytes, which poses a Big Data challenge, as the growth of generated sequencing data now exceeds the decrease of storage hardware price. Therefore, storing and transmitting such data requires more performant compression algorithms than general purpose compressors such as gzip, the de facto standard. Many different specialized compressors have been proposed to tackle this problem. In this thesis, we review currently existing compressors for FASTQ files and we propose a novel compression algorithm for DNA sequences, MATC, for Multiple Attribute Tree Compression. Our algorithm divides DNA sequences into k-mers, i.e., substrings of length k, and performs column-wise compression using a multiple attribute tree. In our case the multiple attribute tree is a complete tree where each node is a k-mer and each leaf represents the sequence formed by the concatenation of its parent k-mers. The tree is then stored using level-order traversal and k-mers are compressed using Huffman encoding. We show that our algorithm offers compression ratios comparable to the current specialized compressors. Moreover, we propose a distributed version of our algorithm, allowing the compression of larger files across a cluster of machines. This allows compression to be processed in the cloud, rather than on commodity hardware, which will become less and less suited to handle the growing size of generated sequencing data.en_US
dc.identifier.urihttps://hdl.handle.net/11244/323262
dc.languageen_USen_US
dc.rightsAttribution 4.0 International*
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/*
dc.subjectComputer Scienceen_US
dc.subjectDNA Sequencing.en_US
dc.subjectData Compression.en_US
dc.thesis.degreeMaster of Scienceen_US
dc.titleCompressing Massive Sequencing Data with Multiple Attribute Treeen_US
ou.groupGallogly College of Engineering::School of Computer Scienceen_US

Files

Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
2019_Selimovic_Dorian_Thesis.pdf
Size:
3.45 MB
Format:
Adobe Portable Document Format
Description:
No Thumbnail Available
Name:
2019_Selimovic_Dorian_Thesis.zip
Size:
2.85 MB
Format:
Unknown data format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections