Compressing Massive Sequencing Data with Multiple Attribute Tree

Selimovic, Dorian

Compressing Massive Sequencing Data with Multiple Attribute Tree

dc.contributor.advisor	Radhakrishnan, Sridhar
dc.contributor.author	Selimovic, Dorian
dc.contributor.committeeMember	Grant, Christan
dc.contributor.committeeMember	Kim, Changwook
dc.date.accessioned	2020-01-03T20:23:28Z
dc.date.available	2020-01-03T20:23:28Z
dc.date.issued	2019-12-13
dc.date.manuscript	2019-12-09
dc.description.abstract	The significant drop in DNA Sequencing costs caused by Next-Generation Sequencing has led to the production of massive amounts of raw sequencing data. This data is stored in FASTQ files, which are text files containing a large number of reads, each composed of a short DNA sequence and its associated identifier and quality score. The DNA sequence is a string of fixed length over the alphabet Σ = {A, C, T, G, N}, the identifier is an arbitrary string that is sequencer-dependent, and the quality score is a string of the same length as the DNA sequence, indicating for each base how confident the sequencer was when determining it. These files can range from a few gigabytes to hundreds of gigabytes, which poses a Big Data challenge, as the growth of generated sequencing data now exceeds the decrease of storage hardware price. Therefore, storing and transmitting such data requires more performant compression algorithms than general purpose compressors such as gzip, the de facto standard. Many different specialized compressors have been proposed to tackle this problem. In this thesis, we review currently existing compressors for FASTQ files and we propose a novel compression algorithm for DNA sequences, MATC, for Multiple Attribute Tree Compression. Our algorithm divides DNA sequences into k-mers, i.e., substrings of length k, and performs column-wise compression using a multiple attribute tree. In our case the multiple attribute tree is a complete tree where each node is a k-mer and each leaf represents the sequence formed by the concatenation of its parent k-mers. The tree is then stored using level-order traversal and k-mers are compressed using Huffman encoding. We show that our algorithm offers compression ratios comparable to the current specialized compressors. Moreover, we propose a distributed version of our algorithm, allowing the compression of larger files across a cluster of machines. This allows compression to be processed in the cloud, rather than on commodity hardware, which will become less and less suited to handle the growing size of generated sequencing data.	en_US
dc.identifier.uri	https://hdl.handle.net/11244/323262
dc.language	en_US	en_US
dc.rights	Attribution 4.0 International	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	*
dc.subject	Computer Science	en_US
dc.subject	DNA Sequencing.	en_US
dc.subject	Data Compression.	en_US
dc.thesis.degree	Master of Science	en_US
dc.title	Compressing Massive Sequencing Data with Multiple Attribute Tree	en_US
ou.group	Gallogly College of Engineering::School of Computer Science	en_US