長句斷詞法和遺傳演算法對新聞分類的影響

淡江大學機構典藏 > 工學院 > 資訊工程學系暨研究所 > 學位論文 > Item 987654321/87938

Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/87938

Title:	長句斷詞法和遺傳演算法對新聞分類的影響
Other Titles:	The effects of long sentence segmentation and genetic algorithm on news classification
Authors:	許桓瑜;Hsu, Huang-Yu
Contributors:	淡江大學資訊工程學系碩士班蔡憶佳
Keywords:	中文斷詞;長詞優先法;遺傳演算法;單純貝氏分類法;Chinese Word Segmentation;maximum matching;Genetic Algorithm;Naive Bayesian classification
Date:	2012
Issue Date:	2013-04-13 11:53:11 (UTC+8)
Abstract:	最近由於網路迅速發展，已經在很多人的生活裡占很重要的一部分，加上網路的方便性、即時性、全球性，讓使用者可以快速接收新消息，因此許多新聞網站成立，但是分類上卻沒有統一性。如何讓使用者可以有效率的找到想要的新聞資訊，是目前需要解決的問題；而好的新聞文章分類有賴中文斷詞技術，所以精準的中文斷詞也不失為一個重要的議題。在本篇論文中主要針對兩個系統：第一個是中文斷詞系統，我們先利用訓練文章來建造詞庫，之後以詞庫搭配長詞優先法和遺傳演算法，把測試用新聞文章的長句子截成短句子，並且分析比對內容，找出最佳的全文斷詞方式。第二個是文章分類系統，對於經過斷詞的新聞文章，我們比對這些斷詞與詞庫，找出每個斷詞常出現在哪些類別，再利用單純貝氏分類來分析此篇新聞最可能的類別。其中我們也提出偵測新詞是否有助於新聞文章斷詞及分類的想法，先利用A-priori and adjacent characters algorithm找出未知詞或新詞，把新詞加入詞庫，之後再用擴充後的詞庫繼續分析之後的新聞文章斷詞及分類。本論文實驗的結果是，將長句子截成短句子之後，再使用遺傳演算法做中文斷詞，其斷詞精確率與召回率會比未經過截短句子的組別提升1-2%；而一旦經過遺傳演算法作中文斷詞，不論是否有先進行截短句子，斷詞精確率與召回率均可達到約八成。又在此斷詞精確的情況下，新聞文章藉由單純貝氏分類也有高達九成五的分類正確率。最後我們也提出未來考慮以添加新詞至詞庫的方式，或許可更提升新聞文章斷詞及分類準確性的理論。 The fast growing Internet has become a very important part of our lives. Its convenience, instantaneity, and globality enable users to receive news promptly, which also promotes the creation of news web sites. However, there’s no standard ways of classification for all the news. How to increase the accuracy and efficiency of news searching becomes a major issue to be solved. Also, a good news classification system depends on the quality of word segmentation, so it is very important to have an appropriate Chinese words segmentation system. In this paper we focused on two issues: the first is the Chinese word segmentation system. We use training articles to build vocabulary database, which will be used by two algorithms – Maximum Matching Algorithm and Genetic Algorithm to split unknown long sentence into short sentences during content analysis. The second is the news classification system. After performing word segmentation, we compare the segmented words with the vocabulary database to determine which categories the article most likely belongs to by Naive Bayesian Classification. In addition, we adopt A-Priori and Adjacent Characters Algorithm to identify unknown words or new words. The detected new words will be added to the database and we will use the expanded one to redo the tasks and see if there is difference in word segmentation and news classification. After splitting any long sentence into short sentences, the precision and recall of segmented words performed by Genetic Algorithm will increase. Furthermore, the results of news classification is fairly accurate if the word segmentation is appropriate. Adding new words to the database will also enhance the accuracy of both word segmentation and news classification more.
Appears in Collections:	[資訊工程學系暨研究所] 學位論文

Files in This Item:

File	Size	Format
index.html	0Kb	HTML	430	View/Open

Loading...