English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 64176/96941 (66%)
造訪人次 : 9142220      線上人數 : 13315
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋
    請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/33883


    題名: 中文垃圾郵件客製化過濾系統之研究
    其他題名: A study of customizable Chinese spam E-mails filtering system
    作者: 吳泳慶;Wu, Yung-ching
    貢獻者: 淡江大學統計學系碩士班
    陳景祥;Chen, Ching-hsiang
    關鍵詞: 客製化郵件;C4.5;機率類神經網路;TF-IDF;CKIP;Customizable;C4.5;Decision tree;Probabilistic Neural Network (PNN);TF-IDF;CKIP
    日期: 2007
    上傳時間: 2010-01-11 04:37:47 (UTC+8)
    摘要: 收發電子郵件已經是現代人主要的通訊工具之ㄧ,而廣告電子郵件的大幅增加,使的我們的電子信箱經常在不知不覺中就充斥著一堆信件。過去對於廣告電子郵件則都歸類於垃圾郵件,然而在台灣ALS於2006年6月28日至7月28日間所做的調查中確有27.4%的受訪者表示曾經因為收到廣告郵件而確實有完成交易,可見在這些廣告電子信件中,有些對使用者言的確提供了所需的資訊及幫助,但有些則對使用者造成困擾及時間的浪費。因此,客製化郵件的分類則為本研究的主要議題。
    在本論文中使用機器學習法之C4.5決策樹法則及機率類神經法則為核心用以建制郵件分類系統,一般郵件分類所攫取的關鍵字通常都是以頻的高低做為選取條件,但有許多關鍵字的選取並不能真正代表該類別的郵件。所以本研究除了利用CKIP中文斷詞技術外,並計算TF-IDF的方法來攫取真正能表達每一種分類電子郵件的關鍵詞,再搭配14種發送特徵作為判斷郵件分類的準則。
    本研究將廣告信件分為九大類客製化郵件,並綜合評比整體準確率、正常郵件精確率、正常郵件檢出率、客製化郵件精確率和客製化郵件檢出率五種指標,其結果顯示本研究在個人日常郵件的測試上亦有不錯的結果。
    E-mail has become a very popular mode of communication in the modern world; however, along with the rapid growth of E-mail advertising, recipients often receive commercial E-mails that that are unsolicited and sent in bulk. In the past years all the Unsolicited Commercial E-mail were automatically categorized as spam. A survey done by Taiwan ALS from June 28th to July 28th in 2006 shows that 27.4% of interviewee had bought products through commercial E-mails. Accordingly, some of the commercial E-mails really provide recipients with information and assistance, but the others are often annoying and wasting time; therefore, Customizable e-mail Classification is the main theme in this research.
    In the research C4.5 decision tree and Probabilistic Neural Network (PNN) of machine learning method are used mainly to establish E-mail classification system. Usually the key words which are seized to categorize E-mails are chosen by their appearance rate, but many key words can not really represent the E-mails of their categories. In this research the CKIP and the method of calculating TF-IDF are used in order to seize the key words which can actually represent every categorized E-mail, accompanying 14 different sending characteristics as the rules to categorize E-mails.
    This research categorized commercial E-mails into nine major Customizable E-mails categories and comprehensively evaluates five indexes: overall precision rate, (normal) E-mail accuracy rate, (normal) E-mail detectable rate, Customizable E-mail precision rate, and Customizable E-mail detectable rate.
    顯示於類別:[統計學系暨研究所] 學位論文

    文件中的檔案:

    檔案 大小格式瀏覽次數
    0KbUnknown431檢視/開啟

    在機構典藏中所有的資料項目都受到原著作權保護.

    TAIR相關文章

    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - 回饋