English  |  正體中文  |  简体中文  |  Items with full text/Total items : 54907/89265 (62%)
Visitors : 10598826      Online Users : 48
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: http://tkuir.lib.tku.edu.tw:8080/dspace/handle/987654321/35211


    Title: 使用支援向量機於蛋白質結晶預測
    Other Titles: Protein crystallization prediction using support vector machine
    Authors: 王祥銘;Wang, Shiang-ming
    Contributors: 淡江大學資訊工程學系碩士班
    許輝煌;Hsu, Hui-huang
    Keywords: 結構基因體學;蛋白質結晶;支持向量機;蛋白質結構;機器學習;Protein Crystallization;Support Vector Machine;Structural Genomics;Protein Structure;Machine learning
    Date: 2008
    Issue Date: 2010-01-11 06:10:20 (UTC+8)
    Abstract: 蛋白質為生命構成的主要物質,也是生命活動的主要承擔者,研究蛋白質分子的三維結構和功能對於我們對疾病的瞭解或是生物製藥的過程有很大的幫助。而目前解析出蛋白質三維結構的方法,除了利用資訊科學的統計學習理論去預測其結構,科學家們在實際上大部分是由X光線繞射(X-ray diffraction)或是核磁共振(NMR)所實驗定義出來的。其中核磁共振這個方法,可能會耗上數個禮拜到數個月,才能夠解出一個蛋白質的三維結構,不僅耗時且花費成本,而且不一定能解析出蛋白質結構。但如果是此蛋白質的溶液可以析出結晶,科學家們可以使用X光繞射的方法對此結晶進行分析,便只需要幾個小時便可以解出此蛋白質的三維結構。但是有很多的蛋白質並沒有辦法產生結晶,所以在蛋白質結構定義的過程中對於蛋白質結晶與否的預測是一個重要的問題。
    我們希望經由蛋白質的一級結構,也就是胺基酸(Amino Acid)序列的資料,使用支援向量機(Support Vector Machine, SVM) ,利用空間轉換的觀念,使用一個平面去將可以結晶和不能結晶的兩個蛋白質類別做切割,達到分類的效果。而去預測此蛋白質是否可以結晶,可以結晶的話,便不需要大費周章的去用NMR來解出蛋白質的結構,更快的取得蛋白質三維結構資訊。
    在最後我們希望找出更多蛋白質本身影響結晶的特性,無論化學或是物理性質,經由胺基酸序列所能提供給我們的資訊來編碼,進一步的提升預測蛋白質結晶的準確率。接著我們希望經過特徵選取(Feature Selection)的方式,根據特徵選取後預測的準確率,挑出其中真正大部分影響結晶的特徵值,藉由這些特徵值,來幫助做蛋白質結晶時的外在條件篩選。最後我們使用支援向量機做出來的5-Fold成果為79.5%,對於可以產生結晶的蛋白質族群預測率為80.8%,而對於無法結晶的蛋白質族群預測率為78.3%。這個實驗的最終目的,就是希望找出影響蛋白質非結晶的要素,更進一步的想辦法去改善這些造成蛋白質無法結晶的因素,幫助科學家們可以將這個蛋白質析出結晶,以更快速的利用X光繞射的方法取得蛋白質結構的資訊。
    In structural genomics, proteins are essential materials that define life. A protein’s function is strongly related to it’s structure. The ultimate goal of structural research is to determine the three-dimensional structure of a protein. However, structure determination is often a time-consuming and expensive process. Also the process of experimental determination of protein structure has a high ratio of failures at different stages.
    There are two prevalent methods for protein structure determination - the magnetic resonance (NMR) spectroscopy and X-ray crystallography. The NMR protein structure determination requires weeks of data acquisition, expensive stable isotope labeling, and extensive manual analysis of data. On the other hand, X-ray crystallography is by far the most successful approach to structure determination. But X-ray crystallography has an importance condition. That is the protein target must be crystallized first. Then the resulting crystal must diffract to sufficient resolution. Therefore, prediction of protein crystallization is an essential problem for structural research.
    Protein Data Bank (PDB) provides us detailed protein sequence information. We use information from a protein’s primary structure, i.e. the amino acid sequence, as the input to the support vector machine to predict the protein’s crystallizability. Several protein features that correlate with protein crystallization are identified first. The support vector machine then generates a hyperplane in the feature space to predict the protein sequence’s crystallizability. We also investigated two feature selection methods - the wrapper method and the filter method. The purpose is to remove irrelevant and redundant features, and thus reduce dimensionality of the input data. A feature subset can be resulted to make the support vector machine result in higher prediction accuracy. The feature selection approach can also help us recognize which protein features are more important for protein crystallization. This can help the chemist to understand the key factors of protein crystallization. An overall prediction accuracy of 79% was achieved on a screened PDB data set with 5-fold cross-validation. The true-positive rate (crystallization) is 80.8% and the true-negative (non-crystllizable) rate is 78.4%.
    Appears in Collections:[Graduate Institute & Department of Computer Science and Information Engineering] Thesis

    Files in This Item:

    File SizeFormat
    0KbUnknown227View/Open

    All items in 機構典藏 are protected by copyright, with all rights reserved.


    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - Feedback