English  |  正體中文  |  简体中文  |  Items with full text/Total items : 62819/95882 (66%)
Visitors : 4010985      Online Users : 965
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/34086


    Title: 一個處理概念漂移的垃圾郵件分類演算法
    Other Titles: An anti-spam algorithm for handling concept drift
    Authors: 陳昱辰;Chen, Yu-chen
    Contributors: 淡江大學資訊管理學系碩士班
    周清江;Jou, Chichang
    Keywords: 郵件分類;概念漂移;資料偏斜;e-mail categorization;concept drift;data skewedness
    Date: 2009
    Issue Date: 2010-01-11 04:54:15 (UTC+8)
    Abstract: 垃圾郵件氾濫的問題一直沒有得到徹底的解決,各種垃圾郵件防治機制紛紛興起,其中以機器學習為主的垃圾郵件內容分類過濾最為盛行。而這些方法,主要都是基於所有的資料在固定不變的環境下之假設,但是在實際環境中,郵件內容會隨著概念的漂移而不斷變動,使得分類器在模型建立之初,都有不錯的分類效果,但隨著時間的演進與概念的漂移,郵件的分類正確率會逐漸下滑,因此必須有一個學習與調整的機制,針對資料集中新進與舊有郵件做相關的學習與調整。另一個郵件分類的問題是資料的偏斜,由於垃圾郵件的氾濫,垃圾郵件個數通常明顯的比正常郵件來的多,在分類的過程中,雖然垃圾郵件類別都有著較高的召回率,但是正常郵件類別的召回率卻相對不佳。因此本研究提出IFWB(Incremental Forgetting Weighted Bayesian,漸進遺忘權重貝氏)演算法,以貝氏分類為基礎,採用IGICF(Information Gain and Inverse Class Frequency,資訊增益與類別頻率倒數)擷取關鍵字,結合漸進遺忘機制與分類成本架構來解決郵件分類中概念漂移與資料偏斜的問題,最後透過實驗來驗證本研究所提出的郵件分類方法。
    The overflow problem of spam has not been solved completely. Many anti-spam techniques have been proposed. Among them, the machine learning techniques are the most popular, but these works are based on a static environment assumption. In the real world application, the email context may change with concept drift. The classification result is usually good at the beginning, but along with time evolution and concept drift, the classification accuracy dropped down gradually. So a mechanism is needed to adjust the classifier according to the new incoming emails and the old emails in the dataset. Another problem of email categorization is data skewedness. Because of the spam overflow, the number of spam emails is far more than that of legitimate ones. In the classification result, the majority class is with good recall rate, but the minority class with poor recall rate. For these reasons, we propose an algorithm, IFWB (Incremental Forgetting Weighted Bayesian), based on Naïve Bayesian and IGICF (Information Gain and Inverse Class Frequency) feature extraction, combined with the gradual forgetting mechanism and cost-sensitive model to tackle concept drift and data skewedness. Finally, we demonstrate the effectiveness of the IFWB algorithm through a series of experiments.
    Appears in Collections:[Graduate Institute & Department of Information Management] Thesis

    Files in This Item:

    File SizeFormat
    0KbUnknown374View/Open

    All items in 機構典藏 are protected by copyright, with all rights reserved.


    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - Feedback