English  |  正體中文  |  简体中文  |  全文笔数/总笔数 : 62805/95882 (66%)
造访人次 : 3884233      在线人数 : 500
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
搜寻范围 查询小技巧:
  • 您可在西文检索词汇前后加上"双引号",以获取较精准的检索结果
  • 若欲以作者姓名搜寻,建议至进阶搜寻限定作者字段,可获得较完整数据
  • 进阶搜寻


    jsp.display-item.identifier=請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/35205


    题名: 使用混合特徵選取於蛋白質結晶預測
    其它题名: Hybrid feature selection in protein crystallization prediction
    作者: 葉致遠;Yeh, Jr-yuan
    贡献者: 淡江大學資訊工程學系碩士班
    許輝煌;Hsu, Hui-huang
    关键词: 支持向量機;適應推進演算法;特徵選取;機器學習;不平衡資料集;蛋白質結晶;Support Vector Machine;Adaboost;Feature Selection;Machine learning;Imbalance data;Protein Crystallization
    日期: 2009
    上传时间: 2010-01-11 06:09:06 (UTC+8)
    摘要:   蛋白質是生命構成的主要物質。蛋白質的功能會隨著結構不同而不同,因此,研究蛋白質分子的三維結構是科學家們努力的目標。而目前解析蛋白質三維結構的方法,除了利用統計學習理論去預測其結構外,在實作上通常是用X光繞射(X-ray diffraction)或是核磁共振(Nuclear Magnetic Resonance, NMR)實驗的結果來定義。其中,核磁共振耗時且花費成本,還不一定能解析出蛋白質結構。但如果蛋白質的溶液可以析出結晶,便可以用X光繞射來對結晶作分析。不過,不是所有蛋白質都可以產生結晶,故預測蛋白質是否能結晶就成為一個重要的問題。
      我們希望藉由從TargetDB這個蛋白質資料庫所取得的蛋白質的氨基酸序列 - 即蛋白質的一級結構所提供的各種資訊來進行編碼,並使用F-score和Information Gain兩種特徵選取方法挑出對預測蛋白質結晶幫助較大的特徵。接著,我們將挑選出來的資料分別使用支持向量機和Adaboost演算法來進行學習的工作。支持向量機使用一個超平面(Hyperplane)將空間中不同類別的資料切開,以達到分類的效果;而Adaboost藉由Weak Learner在若干次的學習過程中,不斷的調整每筆訓練用資料的權重值,來降低Weak Learner的錯分率 (error rate),最後將這些學習的成果結合成為一個Strong Learner來達到分類的效果。
      我們的實驗結果,對targetDB資料的預測正確率可達到93.02% ,而sensitivity (可結晶資料被正確分類為可結晶)為95.49%,specificity (不可結晶資料被正確分類為不可結晶)則是86.08% ,這些實驗的目的,無非是為了找出影響蛋白質不能結晶的要素,並更進一步的去改善這些造成蛋白質無法結晶的因素,以析出這些蛋白質的結晶,便可以利用X光繞射方法取得蛋白質結構的資訊。
    Proteins are the major components of organisms. The structure of a protein gives information about its functions. Therefore, it is important to find out the structures of proteins. Nowadays, scientists usually use X-ray diffraction or Nuclear Magnetic Resonance (NMR) to discover the structures of proteins. However, the process of NMR is time-consuming and expensive. Therefore, X-ray diffraction is usually used to determine the structures of proteins. In order to use X-ray diffraction, we have to make sure the target protein can be crystallized. If a target protein can be crystallized, we can use X-ray diffraction to discover the target protein’s structure. Thus, the discovery of crystallization states of the target protein is very important.
    In this thesis, we use the data in TargetDB to generate a data set that have significant relationships with protein crystallization. We then apply two feature selection methods on the data set to remove the irrelevant or redundant features. After feature selection process, we use the support vector machine (SVM) and Adaboost respectively to predict whether the proteins can be crystallized or not. Furthermore, we compare and discuss the results generated by these two methods.
    According to our experimental results, applying Adaboost generates higher accuracy than applying SVM on the same data set. The prediction accuracy for Adaboost is 93.02%. Moreover, sensitivity (crystallized data) and specificity (non-crystallized data) by Adaboost is 95.49% and 86.08% respectively. The purpose of our experiments is to find out the factors that may cause proteins to be non-crystallized for Scientists to improve protein crystallization. As a result, X-ray diffraction can be applied to discover the structures of proteins.
    显示于类别:[資訊工程學系暨研究所] 學位論文

    文件中的档案:

    档案 大小格式浏览次数
    0KbUnknown273检视/开启

    在機構典藏中所有的数据项都受到原著作权保护.

    TAIR相关文章

    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - 回馈