使用支援向量機於蛋白質結晶預測

淡江大學機構典藏 > 工學院 > 資訊工程學系暨研究所 > 學位論文 > Item 987654321/35211

請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/35211

題名:	使用支援向量機於蛋白質結晶預測
其他題名:	Protein crystallization prediction using support vector machine
作者:	王祥銘;Wang, Shiang-ming
貢獻者:	淡江大學資訊工程學系碩士班許輝煌;Hsu, Hui-huang
關鍵詞:	結構基因體學;蛋白質結晶;支持向量機;蛋白質結構;機器學習;Protein Crystallization;Support Vector Machine;Structural Genomics;Protein Structure;Machine learning
日期:	2008
上傳時間:	2010-01-11 06:10:20 (UTC+8)
摘要:	蛋白質為生命構成的主要物質，也是生命活動的主要承擔者，研究蛋白質分子的三維結構和功能對於我們對疾病的瞭解或是生物製藥的過程有很大的幫助。而目前解析出蛋白質三維結構的方法，除了利用資訊科學的統計學習理論去預測其結構，科學家們在實際上大部分是由X光線繞射(X-ray diffraction)或是核磁共振(NMR)所實驗定義出來的。其中核磁共振這個方法,可能會耗上數個禮拜到數個月，才能夠解出一個蛋白質的三維結構，不僅耗時且花費成本，而且不一定能解析出蛋白質結構。但如果是此蛋白質的溶液可以析出結晶，科學家們可以使用X光繞射的方法對此結晶進行分析，便只需要幾個小時便可以解出此蛋白質的三維結構。但是有很多的蛋白質並沒有辦法產生結晶，所以在蛋白質結構定義的過程中對於蛋白質結晶與否的預測是一個重要的問題。我們希望經由蛋白質的一級結構，也就是胺基酸(Amino Acid)序列的資料，使用支援向量機(Support Vector Machine, SVM) ，利用空間轉換的觀念，使用一個平面去將可以結晶和不能結晶的兩個蛋白質類別做切割，達到分類的效果。而去預測此蛋白質是否可以結晶,可以結晶的話，便不需要大費周章的去用NMR來解出蛋白質的結構，更快的取得蛋白質三維結構資訊。在最後我們希望找出更多蛋白質本身影響結晶的特性，無論化學或是物理性質，經由胺基酸序列所能提供給我們的資訊來編碼，進一步的提升預測蛋白質結晶的準確率。接著我們希望經過特徵選取(Feature Selection)的方式，根據特徵選取後預測的準確率，挑出其中真正大部分影響結晶的特徵值，藉由這些特徵值，來幫助做蛋白質結晶時的外在條件篩選。最後我們使用支援向量機做出來的5-Fold成果為79.5%，對於可以產生結晶的蛋白質族群預測率為80.8%，而對於無法結晶的蛋白質族群預測率為78.3%。這個實驗的最終目的，就是希望找出影響蛋白質非結晶的要素，更進一步的想辦法去改善這些造成蛋白質無法結晶的因素，幫助科學家們可以將這個蛋白質析出結晶，以更快速的利用X光繞射的方法取得蛋白質結構的資訊。 In structural genomics, proteins are essential materials that define life. A protein’s function is strongly related to it’s structure. The ultimate goal of structural research is to determine the three-dimensional structure of a protein. However, structure determination is often a time-consuming and expensive process. Also the process of experimental determination of protein structure has a high ratio of failures at different stages. There are two prevalent methods for protein structure determination - the magnetic resonance (NMR) spectroscopy and X-ray crystallography. The NMR protein structure determination requires weeks of data acquisition, expensive stable isotope labeling, and extensive manual analysis of data. On the other hand, X-ray crystallography is by far the most successful approach to structure determination. But X-ray crystallography has an importance condition. That is the protein target must be crystallized first. Then the resulting crystal must diffract to sufficient resolution. Therefore, prediction of protein crystallization is an essential problem for structural research. Protein Data Bank (PDB) provides us detailed protein sequence information. We use information from a protein’s primary structure, i.e. the amino acid sequence, as the input to the support vector machine to predict the protein’s crystallizability. Several protein features that correlate with protein crystallization are identified first. The support vector machine then generates a hyperplane in the feature space to predict the protein sequence’s crystallizability. We also investigated two feature selection methods - the wrapper method and the filter method. The purpose is to remove irrelevant and redundant features, and thus reduce dimensionality of the input data. A feature subset can be resulted to make the support vector machine result in higher prediction accuracy. The feature selection approach can also help us recognize which protein features are more important for protein crystallization. This can help the chemist to understand the key factors of protein crystallization. An overall prediction accuracy of 79% was achieved on a screened PDB data set with 5-fold cross-validation. The true-positive rate (crystallization) is 80.8% and the true-negative (non-crystllizable) rate is 78.4%.
顯示於類別:	[資訊工程學系暨研究所] 學位論文

文件中的檔案:

檔案	大小	格式	瀏覽次數
	0Kb	Unknown	320	檢視/開啟

在機構典藏中所有的資料項目都受到原著作權保護.

TAIR相關文章

資料載入中.....