Yahya, Padillah
(2021)
Ancestry informative markers single nucleotide polymorphisms panel for ancestry estimation in the Malay population.
PhD thesis, Universiti Sains Malaysia.
Abstract
Ancestry-informative markers (AIMs) can be used to infer an individual’s ancestry to minimize the inaccuracy of self-reported ethnicity in biomedical research. The AIM-SNP panels for the Malay population were developed using three methods in this study: iterative pruning principal component analysis (ipPCA) combined with pairwise FST (ipPCA-FST), informativeness for assignment (In), and PCA-correlated SNPs (or PCA-informative markers; PCAIMs). The Malay AIM-SNP panels were designed using two sets of Malay SNP genotype datasets stored in SNP arrays hosted by the Malaysian Node of the Human Variome Project (MyHVP). The first dataset contained 135 Malay SNPs genotypes generated from Affymetrix GeneChip Mapping Xba 50K array platform. The second dataset contained 76 Malay SNPs genotypes generated from Affymetrix SNP-6 array and OMNI2.5 Illumina SNP array platforms. In addition, 89 Malay SNP genotypes from the Singapore Genome Variation Project (SGVP) using the Affymetrix SNP-6 array platform were also included in the second set of the Malay SNP genotype datasets. The Pan-Asian SNP dataset was used as a reference population for the first Malay SNP genotype dataset, whereas the International HapMap Phase 3 project and SGVP datasets served as the reference populations for the second Malay SNP genotype datasets. The accuracy of each resulting Malay AIM-SNP panel was evaluated using machine learning “ancestry-predictive model” constructed using WEKA, a comprehensive machine learning platform written in Java. The ADMIXTURE program was used to explore the genetic pattern of Malays based on the selected AIM-SNP panels. The results showed that models with 144, 299, and 433 AIM-SNP panels selected from the Affymetrix 50K SNPs dataset using the ipPCA-FST method, correctly classified Malay individuals with an accuracy of 76.5, 70.6, and 82.4%, respectively. The accuracy further increased to 88.2% when using models with 1772 and 3145 AIM-SNP panels. Models with 250 and 2000 SNPs ranked by In, correctly classified Malay individuals with an accuracy of 89.8 and 96.1%, respectively. However, the accuracy was slightly lower, 78 and 92.1%, respectively, for the same number of SNPs selected by the PCAIM method. ADMIXTURE analysis showed that the genetic structure of the Malay population can be distinctly differentiated from the reference populations using 1772 and 3145 AIM-SNP panels selected by ipPCA-FST, and 2000 SNPs selected by In and PCAIM. Models with 101, 157, and 294 AIM-SNP panels selected from the Affymetrix SNP-6 dataset using the ipPCA-FST method demonstrated classification accuracy of 88.8, 94.4, and 96.9%, respectively. Remarkable results were obtained using 1250 and 2240 AIM-SNP panels, where the accuracy increased to 100%. Models with 100, 200, and 2000 ranked by In, correctly classified Malay individuals with an accuracy of 67.5, 80, and 100%, respectively. For the same number of AIM-SNP panels, the PCAIM showed an accuracy of 68.8, 81.9, and 99.4%, respectively. The genetic structure of the Malay population can be differentiated from the other world populations used in this study using the 1250 and 2240 AIM-SNP panels selected by ipPCA-FST and 2000 AIM-SNP panels selected by In and PCAIM.
Actions (login required)
|
View Item |