運用重複句排除技術於中文文件自動摘要之研究

機構典藏 > College of Business and Management > Graduate Institute & Department of Information Management > Thesis > Item 987654321/34098

Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/34098

Title:	運用重複句排除技術於中文文件自動摘要之研究
Other Titles:	Elimination of duplicate sentences in automatic summarization of Chinese documents
Authors:	陳姿妤;Chen, Tzu-yu
Contributors:	淡江大學資訊管理學系碩士班魏世杰;Wei, Shih-chieh
Keywords:	自動摘要;TFIDF;相似度;Hownet;重複句排除;Automatic Summarization;TFIDF;Similarity Measure;Duplicate Sentences
Date:	2007
Issue Date:	2010-01-11 04:55:08 (UTC+8)
Abstract:	本研究針對中文文件，以節錄的方式自原文中摘要出重要的句子集合。在擷取重要句子的作法上，一般是利用特徵選取的方式來抽取文章中心概念，如以TFIDF法計算詞彙、句子權重；或以考量特殊關鍵詞、提示字、句子位置等指標作為句子重要度評斷的依據。本研究假設作者於文章寫作時，經常會重複提及欲論述的主題，這些意思相近的句子便容易在抽取文章中心概念的過程中形成高得分的句子集合。因此，本研究希望藉由比對兩句子之間的相似度，過濾摘要結果中資訊重複的句子。在句子相似度的計算上，除了做詞彙共同出現的布林比對外，也希望能進一步考量同義詞的比對，因此，我們引入中文詞語義知識庫「知網」，透過知網中對詞彙的語義定義，來進行同義詞的相似度計算。實驗結果發現，在擷取文中重要句子的作法上，使用TFIDF為基礎的詞彙權重計算，結合句子與文章標題句之間的相似度特徵，可提升摘要結果的平均精確度約7%。於摘要結果中，利用Jaccard相似度，結合Hownet的同義詞觀念，以排除摘要重複句，亦可達到提升摘要精確度的效果。 This is a research on automatic summarization of Chinese documents. We try to extract important sentences from documents based on such sentence features as sum of TFIDF weights in a sentence or the location of the sentence in a document. We assume that the important sentences thus extracted might still contain redundant information as authors tend to repeat their main ideas several times in documents. This redundancy would preclude the inclusion of other important sentences under a given summary compression rate. To solve this problem, we propose a sentence similarity measure to filter out duplicate sentences in a summary. Our proposed similarity measure takes into account the co-occurrence of exact and synonym words in two sentences. To compute the similarity of synonym words, Hownet, a Chinese equivalent of English lexical database WordNet, is introduced and implemented. The result shows that a combined sentence feature using sum of TFIDF weights as well as similarity with the title sentence can improve the precision by 7%. For elimination of duplicate sentences, a Jaccard- and Hownet-based similarity measure can also give an improved precision in the automatic summarization results.
Appears in Collections:	[Graduate Institute & Department of Information Management] Thesis

Files in This Item:

File	Size	Format
	0Kb	Unknown	309	View/Open

Loading...