English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 62830/95882 (66%)
造訪人次 : 4048585      線上人數 : 596
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋
    請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/114267


    題名: 高維度不平衡資料演算法之變數篩選
    其他題名: Variable selection for imbalanced learning from high-dimensional datasets
    作者: 俞允晨;Yu, Yun Chen
    貢獻者: 淡江大學數學學系碩士班
    王彥雯
    關鍵詞: Binary classification;High-dimensional imbalanced data;Variable selection;二元分類;高維度不平衡資料;變數篩選
    日期: 2017
    上傳時間: 2018-08-03 14:47:04 (UTC+8)
    摘要: 在針對二元分類的問題上,常會面臨不平衡資料(imbalanced data)的處理,此類問題的分類模型建構的挑戰在於,某個類別樣本數遠遠超過另一個類別,意即類別樣本的分佈呈現偏斜狀況(skewed class distribution),使得傳統的分類器在建置的過程往往傾向將樣本佔多數的類別(major class)分類正確而忽略樣本佔少數的類別(minor class),如此一來雖然提高了分類器整體的正確率,但卻降低了針對少數類別的分類敏感度(sensitivity)。此外,現今由於資訊技術的進步,資料在取得與儲存的困難度大幅降低,在實際應用問題上則常面臨資料維度過高,使得資料分析有一定的困難度,特別是在高維度類別不平衡資料的分類問題上,大量的變數當中夾雜多數不具分類區辨效果的變數,也就是雜訊(noise),再加上不平衡資料的特性,使得分類器在訓練時往往會產生偏誤,導致其對少數類別有相當低的預測準確率。因此,為了解決高維度類別不平衡資料的分類問題,本研究將利用Kolmogorov–Smirnov statistic先進行具分類區辨力變數之篩選,再以Lin et al. (2009) 針對不平衡資料所提出之Meta Imbalance Classification Ensemble (MICE) 演算法為基礎,加入1-norm限制式(1-norm constrain)篩選分類效果較佳之子分類器(sub-classifiers)整合成最終的分類模型,進行類別預測。實驗結果顯示,所提出之方法針對少數類別具有較好之敏感度(Sensitivity),且當維度高時,需先將雜訊變數刪除才有利於分類模型之建構,獲得較好之分類表現。
    Class imbalance problem in binary classification is a challenge for establishing an excellent learning algorithm. When the data with skewed class distributions, that is, the sample size of one class is much more than the other class, the traditional learning algorithms tend to assign correct labels for the majority group and ignore to assign correct labels for minority group in order to gain higher overall accuracy of the classifier. But, this kind of learning algorithms will reduce the sensitivity for the minority group. In addition, with the advance in information technology, researchers are able to collect and store large-scale data. However, in practice, it is difficult to carry out large-scale data due to the high dimensionality. In high-dimensional imbalanced classification problems, the classifiers with large non-distinguished variables (noise variables) will be biased and result in lower prediction accuracy for the minority group. Hence, we proposed two algorithms combining a variable selection process based on Kolmogorov–Smirnov statistic with a modification of MICE algorithm (Lin et al., 2009) to analyze high-dimensional imbalanced data. The simulation results show that the proposed method has higher sensitivity for the minority group. When the dimension is high, it is necessary to remove the noise variables before the construction of the classification model and it will obtain better performance of the classifier. Finally, a lung cancer dataset is used to evaluate the performance of the proposed methods in real applications.
    顯示於類別:[數學學系暨研究所] 學位論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML179檢視/開啟

    在機構典藏中所有的資料項目都受到原著作權保護.

    TAIR相關文章

    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - 回饋