一個處理概念漂移的垃圾郵件分類演算法

淡江大學機構典藏 > 商管學院 > 資訊管理學系暨研究所 > 學位論文 > Item 987654321/34086

Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/34086

Title:	一個處理概念漂移的垃圾郵件分類演算法
Other Titles:	An anti-spam algorithm for handling concept drift
Authors:	陳昱辰;Chen, Yu-chen
Contributors:	淡江大學資訊管理學系碩士班周清江;Jou, Chichang
Keywords:	郵件分類;概念漂移;資料偏斜;e-mail categorization;concept drift;data skewedness
Date:	2009
Issue Date:	2010-01-11 04:54:15 (UTC+8)
Abstract:	垃圾郵件氾濫的問題一直沒有得到徹底的解決，各種垃圾郵件防治機制紛紛興起，其中以機器學習為主的垃圾郵件內容分類過濾最為盛行。而這些方法，主要都是基於所有的資料在固定不變的環境下之假設，但是在實際環境中，郵件內容會隨著概念的漂移而不斷變動，使得分類器在模型建立之初，都有不錯的分類效果，但隨著時間的演進與概念的漂移，郵件的分類正確率會逐漸下滑，因此必須有一個學習與調整的機制，針對資料集中新進與舊有郵件做相關的學習與調整。另一個郵件分類的問題是資料的偏斜，由於垃圾郵件的氾濫，垃圾郵件個數通常明顯的比正常郵件來的多，在分類的過程中，雖然垃圾郵件類別都有著較高的召回率，但是正常郵件類別的召回率卻相對不佳。因此本研究提出IFWB（Incremental Forgetting Weighted Bayesian，漸進遺忘權重貝氏）演算法，以貝氏分類為基礎，採用IGICF（Information Gain and Inverse Class Frequency，資訊增益與類別頻率倒數）擷取關鍵字，結合漸進遺忘機制與分類成本架構來解決郵件分類中概念漂移與資料偏斜的問題，最後透過實驗來驗證本研究所提出的郵件分類方法。 The overflow problem of spam has not been solved completely. Many anti-spam techniques have been proposed. Among them, the machine learning techniques are the most popular, but these works are based on a static environment assumption. In the real world application, the email context may change with concept drift. The classification result is usually good at the beginning, but along with time evolution and concept drift, the classification accuracy dropped down gradually. So a mechanism is needed to adjust the classifier according to the new incoming emails and the old emails in the dataset. Another problem of email categorization is data skewedness. Because of the spam overflow, the number of spam emails is far more than that of legitimate ones. In the classification result, the majority class is with good recall rate, but the minority class with poor recall rate. For these reasons, we propose an algorithm, IFWB (Incremental Forgetting Weighted Bayesian), based on Naïve Bayesian and IGICF (Information Gain and Inverse Class Frequency) feature extraction, combined with the gradual forgetting mechanism and cost-sensitive model to tackle concept drift and data skewedness. Finally, we demonstrate the effectiveness of the IFWB algorithm through a series of experiments.
Appears in Collections:	[資訊管理學系暨研究所] 學位論文

Files in This Item:

File	Size	Format
	0Kb	Unknown	520	View/Open

Loading...