高維度不平衡資料演算法之變數篩選

淡江大學機構典藏 > 理學院 > 應用數學與數據科學學系 > 學位論文 > Item 987654321/114267

請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/114267

題名:	高維度不平衡資料演算法之變數篩選
其他題名:	Variable selection for imbalanced learning from high-dimensional datasets
作者:	俞允晨;Yu, Yun Chen
貢獻者:	淡江大學數學學系碩士班王彥雯
關鍵詞:	Binary classification;High-dimensional imbalanced data;Variable selection;二元分類;高維度不平衡資料;變數篩選
日期:	2017
上傳時間:	2018-08-03 14:47:04 (UTC+8)
摘要:	在針對二元分類的問題上，常會面臨不平衡資料(imbalanced data)的處理，此類問題的分類模型建構的挑戰在於，某個類別樣本數遠遠超過另一個類別，意即類別樣本的分佈呈現偏斜狀況(skewed class distribution)，使得傳統的分類器在建置的過程往往傾向將樣本佔多數的類別(major class)分類正確而忽略樣本佔少數的類別(minor class)，如此一來雖然提高了分類器整體的正確率，但卻降低了針對少數類別的分類敏感度(sensitivity)。此外，現今由於資訊技術的進步，資料在取得與儲存的困難度大幅降低，在實際應用問題上則常面臨資料維度過高，使得資料分析有一定的困難度，特別是在高維度類別不平衡資料的分類問題上，大量的變數當中夾雜多數不具分類區辨效果的變數，也就是雜訊(noise)，再加上不平衡資料的特性，使得分類器在訓練時往往會產生偏誤，導致其對少數類別有相當低的預測準確率。因此，為了解決高維度類別不平衡資料的分類問題，本研究將利用Kolmogorov–Smirnov statistic先進行具分類區辨力變數之篩選，再以Lin et al. (2009) 針對不平衡資料所提出之Meta Imbalance Classification Ensemble (MICE) 演算法為基礎，加入1-norm限制式(1-norm constrain)篩選分類效果較佳之子分類器(sub-classifiers)整合成最終的分類模型，進行類別預測。實驗結果顯示，所提出之方法針對少數類別具有較好之敏感度(Sensitivity)，且當維度高時，需先將雜訊變數刪除才有利於分類模型之建構，獲得較好之分類表現。 Class imbalance problem in binary classification is a challenge for establishing an excellent learning algorithm. When the data with skewed class distributions, that is, the sample size of one class is much more than the other class, the traditional learning algorithms tend to assign correct labels for the majority group and ignore to assign correct labels for minority group in order to gain higher overall accuracy of the classifier. But, this kind of learning algorithms will reduce the sensitivity for the minority group. In addition, with the advance in information technology, researchers are able to collect and store large-scale data. However, in practice, it is difficult to carry out large-scale data due to the high dimensionality. In high-dimensional imbalanced classification problems, the classifiers with large non-distinguished variables (noise variables) will be biased and result in lower prediction accuracy for the minority group. Hence, we proposed two algorithms combining a variable selection process based on Kolmogorov–Smirnov statistic with a modification of MICE algorithm (Lin et al., 2009) to analyze high-dimensional imbalanced data. The simulation results show that the proposed method has higher sensitivity for the minority group. When the dimension is high, it is necessary to remove the noise variables before the construction of the classification model and it will obtain better performance of the classifier. Finally, a lung cancer dataset is used to evaluate the performance of the proposed methods in real applications.
顯示於類別:	[應用數學與數據科學學系] 學位論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	223	檢視/開啟

在機構典藏中所有的資料項目都受到原著作權保護.

TAIR相關文章

資料載入中.....