中文意見探勘系統設計

淡江大學機構典藏 > 工學院 > 資訊工程學系暨研究所 > 學位論文 > Item 987654321/87989

Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/87989

Title:	中文意見探勘系統設計
Other Titles:	Design of a Chinese opinion mining system
Authors:	簡立;Chien, Lee
Contributors:	淡江大學資訊工程學系碩士班蔣璿東;Chiang, Rui-Dong
Keywords:	意見探勘;冷門詞;熱門詞;排除字;詞庫穩定;Opinion Mining;unpopular words;popular words;exclude words;lexicon stable
Date:	2012
Issue Date:	2013-04-13 11:55:34 (UTC+8)
Abstract:	由於中文文法結構與英文不同，字與字之間是沒有間隔分開來，若使用POS或Parser來找尋意見詞時，會很容易產生錯誤，因此本論文在採用詞庫方式來擷取意見詞同時，搭配著我們提出的排除字方法來改善意見詞擷取的準確率。由於每個不同的領域都有不同的習慣用語(意見詞和排除字)，所以一般的詞典很難涵蓋一個特定領域中所有的意見詞。但我們認為針對一個特定領域而言，只要訓練資料夠多，大部分的詞典外的意見詞和排除字都可被擷取，而未出現在訓練資料集的詞典外意見詞和排除字的除了數量並不多且呈現穩定狀態外，而且通常都是較冷門較不常使用的意見詞和排除字。本論文節將分別利用Mobile01電信和網路寬頻兩個不同但相似領域之實驗數據來證明此一觀點。由於本論文是採用詞庫方式來來擷取意見詞和排除字，所以詞典內的意見詞和排除字都可被節取出來，而詞典外的意見詞和排除字則必須利用人工標註方式才可找出，但此法必須花費大量時間和人工；因此依據新增詞典外意見詞和排除字的穩定性，我們設計出二階段式詞庫訓練方法來解決非常耗時費力的問題。我們二階段式詞庫訓練方法，第一階段是借助人工半自動標註來擷取訓練資料的意見詞或排除字，第二階段則是在系統上線時，直接利用詞典來擷取文章中的意見詞或排除字，再利用人工檢查所擷取意見詞和排除字的正確性。依據實驗數據顯示，我們第二步驟訓練流程相較於第一個月的訓練，在犧牲準確率及回收率很少狀況下，能夠節省大量人力標註及檢查的時間。 Since the Chinese grammatical structure is different from English, there is no interval space in between Chinese words. Using POS or Parser in search of opinion words can easily lead to errors. Therefore, when capturing opinion words by using the thesaurus (lexicon) way, this study uses the proposed exclusion word method to improve the opinion word capturing precision. As each of the different fields has different terminologies or idioms (opinion words and exclusion words), ordinary dictionaries can hardly cover all the opinion words in a specific field. However, for a specific field, as long as the training data are sufficient, most of the opinion words and exclusion words outside the dictionaries can be captured. The opinion words and exclusion words outside the dictionaries that have not been included in the training set are few, and at a stable state. Moreover, they are often opinion words and exclusion words that are not frequently used. This paper uses the experimental data of two different but similar fields of Mobile01 telecommunications. As this paper uses the thesaurus/lexicon way to capture the opinion words and exclusion words, all the opinion words and exclusion words in dictionaries can be captured. The opinion words and exclusion words outside the dictionaries can be determined only by manual tagging, which is time and labor consuming. Therefore, according to the stability of the new opinion words and exclusion words outside the dictionaries, this study attempts to design a two-stage lexicon training method to solve this problem. Regarding the proposed two-stage lexicon training method, the first stage is to capture the opinion words or exclusion words of training data by manual semi-automated tagging. The second stage is to directly use the dictionaries to capture the opinion words or exclusion words of the articles when the system is online before manually inspecting the accuracy of the captured opinion words and exclusion words. According to the experimental data, the training procedure of the second stage can save a great deal of time for manual tagging.
Appears in Collections:	[資訊工程學系暨研究所] 學位論文

Files in This Item:

File	Size	Format
index.html	0Kb	HTML	437	View/Open

Loading...