考量分類錯誤成本之漸進遺忘貝氏垃圾電子郵件分類器

淡江大學機構典藏 > 商管學院 > 資訊管理學系暨研究所 > 研究報告 > Item 987654321/76358

請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/76358

題名:	考量分類錯誤成本之漸進遺忘貝氏垃圾電子郵件分類器
其他題名:	Misclassification-Cost Sensitive Gradual Forgetting Bayesian Spam E-Mail Classifier
作者:	周清江
貢獻者:	淡江大學資訊管理學系
關鍵詞:	spam classification;concept drift;data skew;machine learning;gradual forgetting Bayesian
日期:	2011-08
上傳時間:	2012-05-07 15:04:43 (UTC+8)
摘要:	垃圾電子郵件氾濫的問題一直沒有得到徹底的解決，持續困擾網友及企業，學界及資安業界提出形形色色垃圾郵件防治機制，其中以基於機器學習技術進行郵件內容分類，再予以過濾最為盛行。然而這些方法一般都假設所有的郵件資料固定不變，擷取出郵件的特徵字集合後，再運用文件分類技術判斷是否為垃圾郵件。但是在實際環境中，發送垃圾郵件者會猜測垃圾郵件過濾器的特徵字，來改變垃圾郵件內容，以避免被垃圾郵件過濾器偵測到。發送垃圾郵件者也會隨著社會流行議題，變動其郵件內容，以增加其郵件被閱讀機會。這些原因造成垃圾郵件的內容概念會隨著時間不斷漂移，這也使得各垃圾郵件分類器雖然在模型建立之初，都有不錯的分類效果，但隨著時間的演進，郵件的分類正確率會逐漸下滑，必須重新訓練，相當耗費人力與時間。因此必須有一個自動學習的機制，針對新進與舊有郵件的概念飄移，進行相關的調整。另一個郵件分類的問題是資料偏斜，由於垃圾郵件的氾濫，垃圾郵件的比率較正常郵件大很多，造成分類的結果中，雖然垃圾郵件類別有不錯的被分類正確比率 (召回率)，但是正常郵件類別的召回率卻相對不佳。然而正常郵件的分類錯誤成本大於垃圾郵件分類錯誤的成本，因此在郵件資料呈現高度偏斜的情況下，必須有一機制來維持正常郵件的召回率。因此本研究提出MCGFB（Misclassification-Cost sensitive Gradual Forgetting Bayesian，考量分類錯誤成本漸進遺忘貝氏）演算法，以貝氏分類為基礎，採用DFICF （Document Frequency and Inverse Class Frequency，文件頻率與類別頻率倒數）擷取特徵字，結合漸進遺忘機制與指定分類錯誤成本架構，來解決郵件分類中概念漂移與資料偏斜的問題。 The overflow problem of spam emails has not been solved completely, which has induced many troubles for web users and enterprises. Miscellaneous anti-spam techniques have been proposed to tackle the problem by the academics and information security industry. Among them, email classifications and filtering using the machine learning approach are the most popular. However, most of these works are based on a static collection of email instances. After extracting features from the training set, they then classify incoming emails into spam and legitimate emails. In the real world, the spammers would guess the extracted features, and then avoid using them. In addition, they like to use up-to-date content to enchant the email receiver to open the spam email. Thus, email context would change with concept drift. That is why the classification result is usually good at the beginning, but the classification accuracy would drop down gradually. So a mechanism is needed to adjust the classifier to handle the concept drift issue. Another problem of email categorization is data skew. Because the number of spam emails is far more than that of legitimate ones, in most classification results, the majority class could obtain good recall, while the minority class poor. We propose a MCGFB algorithm, (Misclassification-Cost sensitive Gradual Forgetting Bayesian) to tackle the above two issues. MCGFB is based on Na�ve Bayesian classification, combined with DFICF (Document Frequency and Inverse Category Frequency) feature extraction, a gradual forgetting mechanism and misclassification cost assignment.
顯示於類別:	[資訊管理學系暨研究所] 研究報告

文件中的檔案:

沒有與此文件相關的檔案.

在機構典藏中所有的資料項目都受到原著作權保護.

TAIR相關文章

資料載入中.....