Revealing Old and New Population Structures in Large Cohorts

Alex Diaz-Papkovich, Simon Gravel, Luke Anderson-Trocmé

McGill University and Génome Québec Innovation Centre

Population structures in genomic datasets are formed by factors including geographical isolation, ancestral migration patterns, and admixture. These structures confound medical research using genome-wide association studies (GWAS) by introducing spurious relationships between single nucleotide polymorphisms (SNPs) and phenotypes. Thus, understanding these structures is critical to proper GWAS research. Affordable high throughput sequencing techniques have created massive datasets containing the SNPs of hundreds of thousands of individuals, but the size and high dimensionality of the data present challenges in understanding how population structures form and relate to each other. Methods of dimension reduction exist but have difficulty balancing local and global relationships between data, and struggle computationally as datasets grow.

Here we demonstrate a new approach to visualizing structures in genomic datasets: using unsupervised dimension reduction methods, we condense data such that we can illustrate population structures and how they relate to each other on local and global scales. By using a combination of Principal Components Analysis (PCA) and the newly developed Uniform Manifold Approximation and Projection (UMAP), we create visualizations that comprehensively illustrate population structures in very large datasets.

The study created visualizations using three genotype datasets: the 1000 Genome Project (1000G), the Health and Retirement Study (HRS), and the UK BioBank (UKBB). With each dataset, we use PCA and UMAP to generate and plot 2D projections of every individual’s genotype and colour the points by provided ethnicity, admixture estimates, or geographic coordinates. We demonstrate using the 1000G that this creates clusters from balanced populations and groups ethnicities that are connected by geography and shared ancestries. Using the HRS, we show how the method connects populations with admixture and reflects genetic diversity within groups, and also find sub-populations within the HRS. Finally, using the UKBB, we visualize a massive dataset made up of many sub-populations and examine how they relate and reflect fine-scale geographic differences.