Abstract: | Empowered by new computing technology and low genotyping cost, large biobank projects like UK
Biobank (UKB) have had fruitful results in the advancement of biomedical sciences. However, there
are several smaller biobanks sampling from different ethnic groups and the statistical power to detect
any association from these datasets is lower. Data augmentation by synthesizing unobserved samples
show promising results in the application of machine learning algorithms. Here, we hypothesized that
augmentation of small biobank data can increase statistical power and detect reliable association
signals.
A two-step strategy was adopted. First, control samples were filtered using Partition Around Medoids
Algorithm, using the entire phenome to divide controls into clusters according to comorbidity. To
reduce the heterogeneity, only samples not in the same cluster for the phenotype of interest were
used as controls. Second, cases and controls were stratified by age and gender. By applying Synthetic
Minority Oversampling Technique on each stratum, artificial cases and controls were generated. In
this study, we chose to use asthma as the phenotype. Dataset from Caucasians in UKB (UKB-C, NCtotal=204,893, NC-case=31,303) and a random sample were selected (UKB-CS, NCS-total=24,000, NCScase=3,612). Fourteen linkage disequilibrium peaks (p≤10-8) from UKB-C GWAS were used as targets
for comparison. Only HLA region was replicated using UKB-CS. Our strategy was then applied to UKBCS. The real-to-artificial sample ratio (RAR) ranged from 4 (4 real and one artificial sample) to 1.
Compared to targets from UKB-Cdata, 4 peaks were replicated when RAR=4, 5 when RAR=3, 6 when
RAR=2 and 11 when RAR = 1. HLA region was prominent for every RAR. When RAR=2, false positive
peaks seemed modest; almost half of the signals could be replicated when roughly 1/9 of the UKB-C
samples were used.
The above procedure was applied to data from Taiwan Biobank (TWB, NT-total=23,942, NT-case=2069).
Without augmentation, only HLA region was significant. When RAR=2 for TWB and UKB-CS, GWAS
results showed a similar trend. In addition to HLA region, only two other regions were replicated for
TWB. Population heterogeneity may contribute to this discrepancy. Our results showed that data
augmentation is promising, however caution needs to be taken with respect to input data quality and
possible stratification, etc. More testing of augmentation algorithms should be done to further
evaluate for performance. |