English  |  正體中文  |  简体中文  |  Items with full text/Total items : 57342/90923 (63%)
Visitors : 13050857      Online Users : 310
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: http://tkuir.lib.tku.edu.tw:8080/dspace/handle/987654321/119097


    Title: Exploring the feasibility of data augmentation while using smaller biobank data sets
    Authors: Lee, Chia Jung;Hsieh(謝璦如), Ai Ru;Kwok, Pui Yan;Fann, Cathy SJ
    Keywords: Computational tools;Bioinformatics;Genetic epidemiology;Genotype-phenotype correlations;Phenome-wide association
    Date: 2019/10/16
    Issue Date: 2020-09-17 12:12:53 (UTC+8)
    Abstract: Empowered by new computing technology and low genotyping cost, large biobank projects like UK
    Biobank (UKB) have had fruitful results in the advancement of biomedical sciences. However, there
    are several smaller biobanks sampling from different ethnic groups and the statistical power to detect
    any association from these datasets is lower. Data augmentation by synthesizing unobserved samples
    show promising results in the application of machine learning algorithms. Here, we hypothesized that
    augmentation of small biobank data can increase statistical power and detect reliable association
    signals.
    A two-step strategy was adopted. First, control samples were filtered using Partition Around Medoids
    Algorithm, using the entire phenome to divide controls into clusters according to comorbidity. To
    reduce the heterogeneity, only samples not in the same cluster for the phenotype of interest were
    used as controls. Second, cases and controls were stratified by age and gender. By applying Synthetic
    Minority Oversampling Technique on each stratum, artificial cases and controls were generated. In
    this study, we chose to use asthma as the phenotype. Dataset from Caucasians in UKB (UKB-C, NCtotal=204,893, NC-case=31,303) and a random sample were selected (UKB-CS, NCS-total=24,000, NCScase=3,612). Fourteen linkage disequilibrium peaks (p≤10-8) from UKB-C GWAS were used as targets
    for comparison. Only HLA region was replicated using UKB-CS. Our strategy was then applied to UKBCS. The real-to-artificial sample ratio (RAR) ranged from 4 (4 real and one artificial sample) to 1.
    Compared to targets from UKB-Cdata, 4 peaks were replicated when RAR=4, 5 when RAR=3, 6 when
    RAR=2 and 11 when RAR = 1. HLA region was prominent for every RAR. When RAR=2, false positive
    peaks seemed modest; almost half of the signals could be replicated when roughly 1/9 of the UKB-C
    samples were used.
    The above procedure was applied to data from Taiwan Biobank (TWB, NT-total=23,942, NT-case=2069).
    Without augmentation, only HLA region was significant. When RAR=2 for TWB and UKB-CS, GWAS
    results showed a similar trend. In addition to HLA region, only two other regions were replicated for
    TWB. Population heterogeneity may contribute to this discrepancy. Our results showed that data
    augmentation is promising, however caution needs to be taken with respect to input data quality and
    possible stratification, etc. More testing of augmentation algorithms should be done to further
    evaluate for performance.
    Appears in Collections:[Graduate Institute & Department of Statistics] Proceeding

    Files in This Item:

    File SizeFormat
    index.html0KbHTML8View/Open

    All items in 機構典藏 are protected by copyright, with all rights reserved.


    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - Feedback