English  |  正體中文  |  简体中文  |  Items with full text/Total items : 49287/83828 (59%)
Visitors : 7149883      Online Users : 46
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: http://tkuir.lib.tku.edu.tw:8080/dspace/handle/987654321/76358


    Title: 考量分類錯誤成本之漸進遺忘貝氏垃圾電子郵件分類器
    Other Titles: Misclassification-Cost Sensitive Gradual Forgetting Bayesian Spam E-Mail Classifier
    Authors: 周清江
    Contributors: 淡江大學資訊管理學系
    Keywords: spam classification;concept drift;data skew;machine learning;gradual forgetting Bayesian
    Date: 2011-08
    Issue Date: 2012-05-07 15:04:43 (UTC+8)
    Abstract: 垃圾電子郵件氾濫的問題一直沒有得到徹底的解決,持續困擾網友及企業,學界及資安業界 提出形形色色垃圾郵件防治機制,其中以基於機器學習技術進行郵件內容分類,再予以過濾 最為盛行。然而這些方法一般都假設所有的郵件資料固定不變,擷取出郵件的特徵字集合 後,再運用文件分類技術判斷是否為垃圾郵件。但是在實際環境中,發送垃圾郵件者會猜測 垃圾郵件過濾器的特徵字,來改變垃圾郵件內容,以避免被垃圾郵件過濾器偵測到。發送垃 圾郵件者也會隨著社會流行議題,變動其郵件內容,以增加其郵件被閱讀機會。這些原因造 成垃圾郵件的內容概念會隨著時間不斷漂移,這也使得各垃圾郵件分類器雖然在模型建立之 初,都有不錯的分類效果,但隨著時間的演進,郵件的分類正確率會逐漸下滑,必須重新訓 練,相當耗費人力與時間。因此必須有一個自動學習的機制,針對新進與舊有郵件的概念飄 移,進行相關的調整。另一個郵件分類的問題是資料偏斜,由於垃圾郵件的氾濫,垃圾郵件 的比率較正常郵件大很多,造成分類的結果中,雖然垃圾郵件類別有不錯的被分類正確比率 (召回率),但是正常郵件類別的召回率卻相對不佳。然而正常郵件的分類錯誤成本大於垃圾 郵件分類錯誤的成本,因此在郵件資料呈現高度偏斜的情況下,必須有一機制來維持正常郵 件的召回率。因此本研究提出MCGFB(Misclassification-Cost sensitive Gradual Forgetting Bayesian,考量分類錯誤成本漸進遺忘貝氏)演算法,以貝氏分類為基礎,採用DFICF (Document Frequency and Inverse Class Frequency,文件頻率與類別頻率倒數)擷取特徵字, 結合漸進遺忘機制與指定分類錯誤成本架構,來解決郵件分類中概念漂移與資料偏斜的問 題。
    The overflow problem of spam emails has not been solved completely, which has induced many troubles for web users and enterprises. Miscellaneous anti-spam techniques have been proposed to tackle the problem by the academics and information security industry. Among them, email classifications and filtering using the machine learning approach are the most popular. However, most of these works are based on a static collection of email instances. After extracting features from the training set, they then classify incoming emails into spam and legitimate emails. In the real world, the spammers would guess the extracted features, and then avoid using them. In addition, they like to use up-to-date content to enchant the email receiver to open the spam email. Thus, email context would change with concept drift. That is why the classification result is usually good at the beginning, but the classification accuracy would drop down gradually. So a mechanism is needed to adjust the classifier to handle the concept drift issue. Another problem of email categorization is data skew. Because the number of spam emails is far more than that of legitimate ones, in most classification results, the majority class could obtain good recall, while the minority class poor. We propose a MCGFB algorithm, (Misclassification-Cost sensitive Gradual Forgetting Bayesian) to tackle the above two issues. MCGFB is based on Na�ve Bayesian classification, combined with DFICF (Document Frequency and Inverse Category Frequency) feature extraction, a gradual forgetting mechanism and misclassification cost assignment.
    Appears in Collections:[資訊管理學系暨研究所] 研究報告

    Files in This Item:

    There are no files associated with this item.

    All items in 機構典藏 are protected by copyright, with all rights reserved.


    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - Feedback