Feature selection in bioinformatics

淡江大學機構典藏 > 工學院 > 資訊工程學系暨研究所 > 學位論文 > Item 987654321/52408

請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/52408

題名:	Feature selection in bioinformatics
其他題名:	生物資訊中之特徵選取
作者:	謝正葦;Hsieh, Cheng-wei
貢獻者:	淡江大學資訊工程學系博士班許輝煌
關鍵詞:	特徵選取;生物資訊;群集分析;機器學習;Feature Selection;Bioinformatics;Clustering;Machine learning
日期:	2010
上傳時間:	2010-09-23 17:36:32 (UTC+8)
摘要:	此論文主要探討生物資訊研究的特徵選取，在眾多的應用中，無論是分類、預測以致於基因選取。所包含的資料總是數以千計，有的甚至超過上萬筆屬性，過多的屬性會導致預測或是分類器的運作耗費大量的時間以及機器的運算，甚者，預測或是分類的效能將大幅降低。因此，在上述問題中，特徵選取將是一個有效的解決之道。一來，可僅挑選重要的特徵為訓練學習機器做少量化的處理；二來，可剔除特徵中所謂的雜訊，而讓整體運作的結果更向上提升。相對於特徵擷取(feature extraction)，此篇論文所採用的特徵選取(feature selection)更能保留屬性中的重要資料，也更能了解與問題息息相關的屬性為何。然而，在眾多的特徵選取模型中，並沒有同時具有快速且準確率高的選取機制。一般而言，特徵選取可分兩大類，一類是利用訊息原理計算屬性間的資訊量或是相依關係來達到特徵選取，此類稱為filter且其運作速度相較而言較為快速，但是不保證其正確率；而另一類則是把一個學習機器嵌入特徵選取機制中，利用觀察學習機器的運作結果來判定選取的好壞，此類稱為wrapper，其結果較為正確但運作相對緩慢。在此論文中，我們提出兩種解決的方法，以期能夠同時加快運作的速度並且保持其正確率。其中第一種設計的方法為混雜式的特徵選取方式，利用filter模式的快速大量篩選，把大部分的特徵點濾除，此為第一階段的特徵篩選；第二階段的篩選則採用正確率導向的wrapper模式，由於此類的篩選速度較為緩慢，所以我們在第一階段所做的filter篩選即發揮其效用，只保留部分的特徵給予後階段的wrapper模組分析，如此一來，總運算量便可以大幅下降，在經過兩階段的篩選後，不但在速度方面有所提升，甚至正確率與運算量方面皆有所改進，此為我們所提出的混雜式資料選取模型。另一方面，我們亦開發一個以群集概念為基礎的特徵選取方式，利用群集能夠分析空間中多點之間相近與否的特性套用在屬性的比對上，最終被歸類到同一群組的特徵可視為相似的或是重複的特徵，除了保留一個特徵外，其餘的特徵再進一步予以剔除，如此便達到特徵篩選的功能。其中，我們變更群集機制中的距離計算核心，把原本的歐式距離變更為相似度的計算，利用相關係數的大小來判定兩屬性間的相似與否，如此，產生出來的群集便能使空間中所有的資料點達到多對多相似度屬性計算之效果，最終再將一個群集保留一個特徵即可。此特徵擷取步驟能保證所取出的特徵彼此之間必定是極度相異的，可視為重覆資料濾除的特徵擷取法。本論文中，我們針對三種生物資訊的應用主題進行實驗分析，分別為蛋白質非穩定區段的偵測、蛋白質結晶與否的預測及微陣列的基因選取。在這些實驗中，我們證明本論文所提出的模組可以將上述實驗之預測效能實質提升。此外我們亦同時比較其他著名的選取模組，結果亦證明我們提出的方法能更有效的增進效能。 This dissertation focuses on the feature selection problems, especially in bioinformatics. For the most applications of prediction, classification or gene selection, the feature or data number are over hundreds or even thousands. Too many data features would waste computation time for classifiers. Even more, the prediction accuracy might also decrease greatly. Therefore, regarding to the above-mentioned problems, feature selection would be an efficient solution. It not only selects the most important features for learning models to train, but also removes the noisy features from the original feature set. In contrast to the feature extraction techniques, feature selection keeps important information from the original feature set. Besides, we can find out the relationships between problems and features. However, in most feature selection models, there is no model with both fast and satisfied prediction accuracy. Generally speaking, there are two main kinds of feature selection models, including filter and wrapper models. The filter kind of feature selection models is based on information theorem, which calculates the amount of information or dependency between features, and performs faster than wrappers. However, this model does not guarantee the prediction accuracy. As for the wrapper, it combines a learning model in feature selection procedures, and uses the learning results to decide the removal of features. The result of the wrapper is much more accurate than the filter model, but requires more computational time. In this dissertation, we proposed two mechanisms to improve the computational performance and prediction accuracies. The first one is the hybrid feature selection model. It applies the filter models to remove most noisy features in the primary procedure, and applies the wrapper with a learning model in the secondary selection process. With these two main procedures, the filter provides efficient performance and the wrapper increases the system accuracy. For the second proposed solution, we developed a clustering based feature selection model. Using the concept of clustering which can analyze distances among data points in the high dimensional space, we can use it to measure similarities among features. In the final clustered results, redundant features were removed and only one feature was retained as the representative feature of the group. In additional to the original concept of clustering, we changed the core of distance calculation mechanism with correlation coefficient calculation. We measured the similarity between two features by the value of correlation coefficient. This procedure guarantees the result of selected features processing quite different characteristics from each other. This kind of feature selection is considered as the model of redundant feature removal. In the thesis, we have evaluated the proposed methods with respect to three different bioinformatics applications, including protein disordered region prediction, protein crystallization prediction and gene selection in the microarray. In these experiments, the proposed models were applied to assist the prediction procedures and enhance their performance. Besides, we also compare our methods to other famous feature selection models. The results show that the performance can be improved efficiently and effectively.
顯示於類別:	[資訊工程學系暨研究所] 學位論文

文件中的檔案:

檔案	大小	格式	瀏覽次數
index.html	0Kb	HTML	445	檢視/開啟

在機構典藏中所有的資料項目都受到原著作權保護.

TAIR相關文章

資料載入中.....