MEDICAL SIGNALS ALIGNMENT AND PRIVACY PROTECTION USING BELIEF PROPAGATION AND COMPRESSED SENSING
Abstract
The advance in human genome sequencing technology has significantly reduced the
cost of data generation and overwhelms the computing capability of sequence analysis.
Efficiency, efficacy and scalability remain challenging in sequence alignment,
which is an important and foundational operation for genome data analysis. In this
dissemination, I propose a two stage approach to tackle this problem. In the preprocessing
step, I match blocks of reference and target genome sequences based on the
similarities between their empirical transition probability distributions using belief
propagation. I then conduct a refined match using our recently published SCoBeP
technique. I extract features from neighbors of an input nucleotide (a genome sequence
of neighboring nucleotides that the input nucleotide is its middle nucleotide)
and leverage sparse coding to find a set of candidate nucleotides, followed by using
Belief Propagation (BP) to rank these candidates. Our experimental results demonstrated
robustness in nucleotide sequence alignment and our results are competitive
to those of the SOAP aligner and the BWA algorithm .
In addition, Most genomic datasets are not publicly accessible, due to privacy
concerns. Patients genomic data contains identifiable markers and can be used to
determine the presence of an individual in a dataset. Prior research shows that the
re-identification can be possible when a very small set of genomic data is released.
To protect patients, the data owners impose an application and evaluation procedure
which often takes months to complete and limits the researchers. One solution to
the problem is to let each data owner publish a set of pilot data to help data users choose the right datasets based on their needs. The data owners release these pilot
data with the noise parameters and the mechanism that they used. A data user can
run any kind of association tests and compare the outcomes with the other datasets
outputs to get an idea which datasets can be useful. I present a privacy preserving
genomic data dissemination algorithm based on the compressed sensing. In my
proposed method, I am adding the noise into the sparse representation of the input
vector to make it differentially private. It means I find the sparse representation
using using the SubSpace Pursuit and then disturb it with sufficient Laplasian noise.
I compare my method with state-of-the-art compressed sensing privacy protection
method.
Collections
- OU - Dissertations [9305]