English  |  正體中文  |  简体中文  |  Items with full text/Total items : 62805/95882 (66%)
Visitors : 3905590      Online Users : 458
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/33880


    Title: 垃圾郵件過濾 : 資料採礦與中文斷詞技術之應用
    Other Titles: Spam filtering : application of data mining and Chinese word segmentation technique
    Authors: 葉采羚;Yeh, Tsai-ling
    Contributors: 淡江大學統計學系碩士班
    陳景祥;Chen, Ching-hsiang
    Keywords: 資料採礦;垃圾郵件;中文斷詞;機率類神經;C4.5;灰色區域;Data mining;Spam;Chinese Word Segmentation;Neural Network;C4.5;Gray Region
    Date: 2006
    Issue Date: 2010-01-11 04:37:38 (UTC+8)
    Abstract: 在尚未立法明確規範垃圾電子郵件的國家,運用科技來阻擋垃圾信為多數網路使用者自保的首要之道,阻擋垃圾信件的方法很多,近年來技術不斷翻新,但很少能達成百分之百的阻擋效果。本研究提出有效的過濾垃圾郵件方法,利用PHP網頁程式語言來擷取電子郵件特徵,再透過資料採礦技術工具中的C4.5決策樹及機率類神經網路法,經由中文斷詞系統辨析中文詞頻、詞序及詞性等因素,並加入「灰色區域」郵件分類作為新的輸出變數,輸入至本研究之郵件分類系統,比較中文電子郵件分類效果及總風險成本,結果在使用C4.5決策樹法,加入詞頻及詞序百分比為輸入變數,可提升垃圾郵件被辨識成功的分類正確率;而使用機率類神經網路法,加入詞性特徵為輸入變數後可提升正常郵件被辨識成功的分類正確率;加入「灰色區域」分類為輸出變數時,明顯提升了垃圾郵件的分類精確率及檢出率,而且多數高達98.5% 以上,及明顯降低總風險成本。
    In countries without established laws with regards to spam-mail blocking, spam filtering technologies are adopted to filter mails. Spam filtering technologies come in many forms and have staged a steady stream of improvement. However, none of the technology can completely filter out spam mails. The study suggests an effective method of spam filtering. Using PHP program to pick out the characteristics of spam mails, we perform data mining techniques such as C4.5 method and probability neural network (PNN) classifier to the E-mail classification. We also apply Chinese word segmentation system to calculate the frequency, rank, and characteristics of Chinese words. A “gray region” is also considered as our new output category.
    Our result shows that the C4.5 method together with the frequency and rank percentage of Chinese words promotes the accuracy of spam-mail filtering. Meanwhile, the PNN method with the percentages of Chinese word characteristics increases the accuracy of legitimate mail classification. Also, with the addition of our new “gray region” output category, the spam precision and recall rate both increase significantly, most of the classification rates goes over 98.5%, and the misclassification cost is also reduced.
    Appears in Collections:[統計學系暨研究所] 學位論文

    Files in This Item:

    File SizeFormat
    0KbUnknown338View/Open

    All items in 機構典藏 are protected by copyright, with all rights reserved.


    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - Feedback