Nicholas Simafranca

Session
Session 2
Board Number
58

Fair Principal Component Analysis on High Dimensional Genetic Data

Principal Component Analysis (PCA) is a widely adopted dimensionality reduction technique in numerous scientific disciplines. However, traditional PCA may not provide equal representation of different populations within the dataset. Fair PCA is a linear dimensionality reduction method that aims to minimize reconstruction error while representing different populations with similar fidelity. In this study, we applied Fair PCA to the Population Reference Sample (POPRES) dataset, which consists of genotype data from approximately 8,000 individuals of diverse geographical origins. We focused on a subset of 1,385 individuals of European origin, segregating them into four subgroups based on their geographic clusters. The subgroups were: (1) Spain and Portugal, (2) Western Europe, (3) Italy, and (4) Eastern Europe. Our objective was to compare the performance of Fair PCA against Classical PCA and assess its ability to provide an equitable representation of the different subgroups. However, our numerical experiments did not yield significant results, as the Fair PCA plot differed significantly from the Classical PCA plot. We suspect that partitioning the dataset into four distinct subgroups might have distorted the original geometric structure of the data, potentially affecting the performance of Fair PCA. While our research could not establish the effectiveness of Fair PCA in representing the POPRES dataset, we recommend further investigation into alternative data partitioning methods or modifications to the Fair PCA algorithm to better suit high-dimensional genetic data. This work contributes to ongoing efforts to develop more equitable dimensionality reduction techniques for diverse populations in genomic research.