Tianyi Wei


Generalizing the Correlation between Genotypes and Geography of Europeans with Robust and Differentially Private Methods

Our research aims at finding a systematic and rigorous approach to identify outliers among genotype data from 197,461 loci in 1887 European individuals by studying the geometric structure of the sample. Specifically, we apply Fast Median Subspace (FMS) and Geodesic Gradient Descent (GGD) to separate inliers that lie on or near a low-dimensional subspace from outliers that are distributed in the ambient space. Both FMS and GGD are robust subspace recovery (RSR) methods that aim at finding an underlying low-dimensional subspace in a corrupted and high-dimensional dataset. Instead of solving for the least squares problem as in the classic Principal Component Analysis (PCA) approach, FMS and GGD solve for the least absolute deviations problem over the non-convex Grassmannian. Our result shows that both FMS and GGD perform dimension reduction as well as PCA for producing a 2-dimensional visualization of the genetic variations that resembles the map of Europe. In addition, we can locate inliers on the low-dimensional subspace generated by FMS and GGD in the presence of corruption, which would help filter outliers. Our result addresses the application of RSR methods on real datasets in terms of dimension reduction and data preprocessing. In the presentation, I will present the algorithms of FMS and GGD, as well as the results after applying the algorithms on the European genotype data. I will also discuss a private version of GGD, although its performance on our European genotype data still needs to be tested.