文件資料集類別一致性分析工具之實作

淡江大學機構典藏 > 商管學院 > 資訊管理學系暨研究所 > 學位論文 > Item 987654321/52125

Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/52125

Title:	文件資料集類別一致性分析工具之實作
Other Titles:	Implementation of an analysis tool for class consistency check on document datasets
Authors:	蕭凱元;Hsiao, Kai-yuan
Contributors:	淡江大學資訊管理學系碩士班魏世杰
Keywords:	nMRD;FastMap;SOM;Weka;CIRB030;分群假說;類別一致性;nMRD;FastMap;SOM;Weka;CIRB030;cluster hypothesis;class consistency
Date:	2010
Issue Date:	2010-09-23 16:51:24 (UTC+8)
Abstract:	隨著資訊的超載，要如何從龐大的資料中找到所需的資訊遂變成重要的課題。資訊檢索與文件自動分類就是為了幫助使用者找到想要資訊的常用技術。由於評估檢索及分類結果常需依賴答案集，因此答案集本身的類別一致性好壞，往往會影響評估品質；另外，如果答案集正確性無疑慮，則檢索或分類後，適當檢視人工與機器答案的類別一致性，往往能協助診斷機器判別錯誤原因，所以提供資料集類別一致性的分析工具是有必要的。為因應上述需求，本文提出一套分析工具，使用了兩種指標自動評估答案集一致性。一種是相似度間距，透過答案集相關文章與非相關文章的相似度間距，來分析相關文章與非相關文章是否明顯分離。另一種則是使用平均距離倒數(Normalized Mean Reciprocal Distance，nMRD)，來評估相關文章緊密程度。於評估後，針對一致性較差的答案集，透過FastMap空間分佈圖與字詞篩選，以人工方式診斷不一致所在，並利用前十大字詞與自我組織映射網路(Self-Organizing Map，SOM)，來分析症狀群特性。本工具同時結合了目前在資料探勘領域已經相當成熟的機器學習軟體Weka，透過其豐富的分類演算法學習後，能針對分類後的結果不一致處，協助診斷分類錯誤文章的特性。展示部分採用中文標準新聞文件集CIRB030人工評估好的答案集為範例。經由斷詞、向量化後，於機器學習前，針對答案集本身，提供指標評估方式，找出一致性較差的問題集，讓使用者利用不同的3D角度挑選離群文章，探索不一致所在。另外，於透過Weka提供的分類器學習後，比較人工與機器評估結果的差異，以進一步了解分類錯誤文章，協助找出原因及對策。 With the ever increasing information overload, it has become harder for one to find the desired information from the huge body of information. Information retrieval and classification have come on the scene to help users find the information. To evaluate the retrieval or classification performances, reliance on an answer set is inevitable. Thus how to ensure the class consistency of an answer set will determine the quality of the evaluation. In addition, seeing the class inconsistency between the retrieval/classification results and the answer set, it is often illuminating to be able to explore the dataset to identify the error patterns in the result. Therefore a good tool for analysis of class consistency in datasets is in need. An analysis tool is proposed in this work to accommodate the above needs. Two indices are adopted to evaluate the class consistency of an answer set. One is the similarity gap index which computes the gap between the peaks of relevant-relevant and relevant-nonrelevant similarity distributions. A larger gap denotes better separation between relevant and nonrelevant documents. The other is the normalized mean reciprocal distance (nMRD) index which measures the compactness of relevant documents. A larger nMRD denotes better tightness of relevant documents. Through these two indices, low class consistency answer sets can be identified. Then these answer sets can be examined by FastMap for 3D projection or directly by word filtering to find the culprit documents causing the inconsistency. Lastly, by the top 10 common words or self-organizing map (SOM) tools, one can summarize the characteristics of the culprit documents. To facilitate use with various classifiers, this analysis tool has combined with Weka, a well-known open source machine learning package. The user can explore the class inconsistency between the classification result and the answer set to diagnose the error patterns in the result. For demonstration, a standard Chinese news dataset CIRB030 is used. The Chinese dataset is first segmented into words and represented as document vectors. The dataset is evaluated by the similarity gap and the nMRD indices to identify a low class consistency answer set. Then the answer set is examined by 3D FastMap to locate the outlier documents causing the inconsistency. A high class consistency answer set is also used for classification test. After using a classifier in Weka, the user can explore the class inconsistency between the classification result and the answer set. The tool can help analyze the characteristics of the misclassified documents.
Appears in Collections:	[資訊管理學系暨研究所] 學位論文

Files in This Item:

File	Size	Format
index.html	0Kb	HTML	359	View/Open

Loading...