淡江大學機構典藏:Item 987654321/114267
English  |  正體中文  |  简体中文  |  全文笔数/总笔数 : 64178/96951 (66%)
造访人次 : 10774085      在线人数 : 20209
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
搜寻范围 查询小技巧:
  • 您可在西文检索词汇前后加上"双引号",以获取较精准的检索结果
  • 若欲以作者姓名搜寻,建议至进阶搜寻限定作者字段,可获得较完整数据
  • 进阶搜寻


    jsp.display-item.identifier=請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/114267


    题名: 高維度不平衡資料演算法之變數篩選
    其它题名: Variable selection for imbalanced learning from high-dimensional datasets
    作者: 俞允晨;Yu, Yun Chen
    贡献者: 淡江大學數學學系碩士班
    王彥雯
    关键词: Binary classification;High-dimensional imbalanced data;Variable selection;二元分類;高維度不平衡資料;變數篩選
    日期: 2017
    上传时间: 2018-08-03 14:47:04 (UTC+8)
    摘要: 在針對二元分類的問題上,常會面臨不平衡資料(imbalanced data)的處理,此類問題的分類模型建構的挑戰在於,某個類別樣本數遠遠超過另一個類別,意即類別樣本的分佈呈現偏斜狀況(skewed class distribution),使得傳統的分類器在建置的過程往往傾向將樣本佔多數的類別(major class)分類正確而忽略樣本佔少數的類別(minor class),如此一來雖然提高了分類器整體的正確率,但卻降低了針對少數類別的分類敏感度(sensitivity)。此外,現今由於資訊技術的進步,資料在取得與儲存的困難度大幅降低,在實際應用問題上則常面臨資料維度過高,使得資料分析有一定的困難度,特別是在高維度類別不平衡資料的分類問題上,大量的變數當中夾雜多數不具分類區辨效果的變數,也就是雜訊(noise),再加上不平衡資料的特性,使得分類器在訓練時往往會產生偏誤,導致其對少數類別有相當低的預測準確率。因此,為了解決高維度類別不平衡資料的分類問題,本研究將利用Kolmogorov–Smirnov statistic先進行具分類區辨力變數之篩選,再以Lin et al. (2009) 針對不平衡資料所提出之Meta Imbalance Classification Ensemble (MICE) 演算法為基礎,加入1-norm限制式(1-norm constrain)篩選分類效果較佳之子分類器(sub-classifiers)整合成最終的分類模型,進行類別預測。實驗結果顯示,所提出之方法針對少數類別具有較好之敏感度(Sensitivity),且當維度高時,需先將雜訊變數刪除才有利於分類模型之建構,獲得較好之分類表現。
    Class imbalance problem in binary classification is a challenge for establishing an excellent learning algorithm. When the data with skewed class distributions, that is, the sample size of one class is much more than the other class, the traditional learning algorithms tend to assign correct labels for the majority group and ignore to assign correct labels for minority group in order to gain higher overall accuracy of the classifier. But, this kind of learning algorithms will reduce the sensitivity for the minority group. In addition, with the advance in information technology, researchers are able to collect and store large-scale data. However, in practice, it is difficult to carry out large-scale data due to the high dimensionality. In high-dimensional imbalanced classification problems, the classifiers with large non-distinguished variables (noise variables) will be biased and result in lower prediction accuracy for the minority group. Hence, we proposed two algorithms combining a variable selection process based on Kolmogorov–Smirnov statistic with a modification of MICE algorithm (Lin et al., 2009) to analyze high-dimensional imbalanced data. The simulation results show that the proposed method has higher sensitivity for the minority group. When the dimension is high, it is necessary to remove the noise variables before the construction of the classification model and it will obtain better performance of the classifier. Finally, a lung cancer dataset is used to evaluate the performance of the proposed methods in real applications.
    显示于类别:[應用數學與數據科學學系] 學位論文

    文件中的档案:

    档案 描述 大小格式浏览次数
    index.html0KbHTML225检视/开启

    在機構典藏中所有的数据项都受到原著作权保护.

    TAIR相关文章

    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - 回馈