利用關鍵字的觀念,我們可以從一群已經標示分類的文件,取得適當分類規則,也就是利用類別關鍵詞,並使用這樣的依據對未標示類別的文件進行分類的工作。 文件分類的訓練學習過程從學習樣本文件開始,計算樣本文件特徵詞的出現情形與分佈的狀況,經過統計後判斷該特徵詞是否屬於有類別代表意義的詞,若是,則將其作為一種分類的規則;只是,一份文件的特徵詞往往有字詞之間關係的問題,除此之外,在一份文件中,也可能帶著大量雜訊。如何有效解決關聯性問題,並且過濾掉不必要的雜訊,所以在本文提出了決策樹法來解決字詞間相關性的問題,再配合局部特徵化,弱化不重要的關鍵詞,以突顯出重要的關鍵字,根據本論文結果得知,在少量樣本中,決策樹與特徵二階化的配合,在文件分類的正確率與回收率上,也有不錯的效能。 By using feature keywords, we can obtain some appropriate rules from a group of labeled documents. According to this way, we can classify the documents which haven’t been labeled. In this paper, we will discuss how to choose some training datum to be a basic, to calculate all keywords’ weights, to judge the keywords’ importance by their distribution, and to solve the problems of keywords’ correlation.
We will try to solve to avoid the relation of keywords efficiently and filter the noise. So, we use decision tree to solve relative problems, because it can ignore the relation from word to words in first step. Second, we use the two-phase local feature to reduce amount of noisy. In chapter 4 we can observe the results that are more efficiency than before.