Privacy attacks reported in the literature alerted the research community for the existing serious privacy issues in current biomedical process workflows. Since sharing biomedical data is vital for the advancement of research and the improvement of medical healthcare, reconciling sharing with privacy assumes an overwhelming importance. In this thesis, we state the need for effective privacy-preserving measures for biomedical data processing, and study solutions for the problem in one of the harder contexts, genomics. The thesis focuses on the specific properties of the human genome that make critical parts of it privacy-sensitive and tries to prevent the leakage of such critical information throughout the several steps of the sequenced genomic data analysis and processing workflow. In order to achieve this goal, it introduces efficient and effective privacy-preserving mechanisms, namely at the level of reads filtering right upon sequencing, and alignment.
Human individuals share the majority of their genome (99.5%), the remaining 0.5% being what distinguishes one individual from all others. However, that information is only revealed after two costly processing steps, alignment and variant calling, which today are typically run in clouds for performance efficiency, but with the corresponding privacy risks. Reaping the benefits of cloud processing, we set out to neutralize the privacy risks, by identifying the sensitive (i.e., discriminating) nucleotides in raw genomic data, and acting upon that.
The first contribution is DNA-SeAl, a systematic classification of genomic data into different levels of sensitivity with regard to privacy, leveraging the output of a state-of-the-art automatic filter (SRF) isolating the critical sequences. The second contribution is a novel filtering approach, LRF, which undertakes the early protection of sensitive information in the raw reads right after sequencing, for sequences of arbitrary length (long reads), improving SRF, which only dealt with short reads. The last contribution proposed in this thesis is MaskAl, an SGX-based privacy-preserving alignment approach based on the filtering method developed.
These contributions entailed several findings. The first finding of this thesis is the performance × privacy product improvement achieved by implementing multiple sensitivity levels. The proposed example of three sensitivity levels allows to show the benefits of mapping progressively sensitive levels to classes of alignment algorithms with progressively higher privacy protection (albeit at the cost of a performance tradeoff). In this thesis, we demonstrate the effectiveness of the proposed sensitivity levels classification, DNA-SeAl. Just by considering three levels of sensitivity and taking advantage of three existing classes of alignment algorithms, the performance of privacy-preserving alignment significantly improves when compared with state-of-the-art approaches. For reads of 100 nucleotides, 72% have low sensitivity, 23% have intermediate sensitivity, and the remaining 5% are highly sensitive. With this distribution, DNA-SeAl is 5.85× faster and it requires 5.85× less data transfers than the binary classification – two sensitivity levels.
The second finding is the sensitive genomic information filtering improvement by replacing the per read classification with a per nucleotide classification. With this change, the filtering approach proposed in this thesis (LRF) allows the filtering of sequences of arbitrary length (long reads), instead of the classification limited to short reads provided by the state-of-the-art filtering approach (SRF). This thesis shows that around 10% of an individuals genome is classified as sensitive by the developed LRF approach. This improves the 60% achieved by the previous state of the art, the SRF approach.
The third finding is the possibility of building a privacy-preserving alignment approach based on reads filtering. The sensitivity-adapted alignment relying on hybrid environments, in particular composed by common (e.g., public cloud) and trustworthy execution environments (e.g., SGX enclave cloud) in clouds, gets the best of both worlds: it enjoys the resource and performance optimization of cloud environments,while providing a high degree of protection to genomic data. We demonstrate that MaskAl is 87% faster than existing privacy-preserving alignment algorithms (Balaur), with similar privacy guarantees. On the other hand, Maskal is 58% slower compared to BWA, a highly efficient non-privacy preserving alignment algorithm. In addition, MaskAl requires less 95% of RAM memory and it requires between 5.7 GB and 15 GB less data transfers in comparison with Balaur.
This thesis breaks new ground on the simultaneous achievement of two important goals of genomics data processing: availability of data for sharing; and privacy preservation. We hope to have shown that our work, being generalisable, gives a significant step in the direction of, and opens new avenues for, wider-scale, secure, and cooperative efforts and projects within the biomedical information processing life cycle.