淡江大學機構典藏:Item 987654321/119097
English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 62805/95882 (66%)
造訪人次 : 3923952      線上人數 : 591
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋
    請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/119097


    題名: Exploring the feasibility of data augmentation while using smaller biobank data sets
    作者: Lee, Chia Jung;Hsieh(謝璦如), Ai Ru;Kwok, Pui Yan;Fann, Cathy SJ
    關鍵詞: Computational tools;Bioinformatics;Genetic epidemiology;Genotype-phenotype correlations;Phenome-wide association
    日期: 2019-10-16
    上傳時間: 2020-09-17 12:12:53 (UTC+8)
    摘要: Empowered by new computing technology and low genotyping cost, large biobank projects like UK
    Biobank (UKB) have had fruitful results in the advancement of biomedical sciences. However, there
    are several smaller biobanks sampling from different ethnic groups and the statistical power to detect
    any association from these datasets is lower. Data augmentation by synthesizing unobserved samples
    show promising results in the application of machine learning algorithms. Here, we hypothesized that
    augmentation of small biobank data can increase statistical power and detect reliable association
    signals.
    A two-step strategy was adopted. First, control samples were filtered using Partition Around Medoids
    Algorithm, using the entire phenome to divide controls into clusters according to comorbidity. To
    reduce the heterogeneity, only samples not in the same cluster for the phenotype of interest were
    used as controls. Second, cases and controls were stratified by age and gender. By applying Synthetic
    Minority Oversampling Technique on each stratum, artificial cases and controls were generated. In
    this study, we chose to use asthma as the phenotype. Dataset from Caucasians in UKB (UKB-C, NCtotal=204,893, NC-case=31,303) and a random sample were selected (UKB-CS, NCS-total=24,000, NCScase=3,612). Fourteen linkage disequilibrium peaks (p≤10-8) from UKB-C GWAS were used as targets
    for comparison. Only HLA region was replicated using UKB-CS. Our strategy was then applied to UKBCS. The real-to-artificial sample ratio (RAR) ranged from 4 (4 real and one artificial sample) to 1.
    Compared to targets from UKB-Cdata, 4 peaks were replicated when RAR=4, 5 when RAR=3, 6 when
    RAR=2 and 11 when RAR = 1. HLA region was prominent for every RAR. When RAR=2, false positive
    peaks seemed modest; almost half of the signals could be replicated when roughly 1/9 of the UKB-C
    samples were used.
    The above procedure was applied to data from Taiwan Biobank (TWB, NT-total=23,942, NT-case=2069).
    Without augmentation, only HLA region was significant. When RAR=2 for TWB and UKB-CS, GWAS
    results showed a similar trend. In addition to HLA region, only two other regions were replicated for
    TWB. Population heterogeneity may contribute to this discrepancy. Our results showed that data
    augmentation is promising, however caution needs to be taken with respect to input data quality and
    possible stratification, etc. More testing of augmentation algorithms should be done to further
    evaluate for performance.
    顯示於類別:[統計學系暨研究所] 會議論文

    文件中的檔案:

    檔案 大小格式瀏覽次數
    index.html0KbHTML78檢視/開啟

    在機構典藏中所有的資料項目都受到原著作權保護.

    TAIR相關文章

    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - 回饋