English  |  正體中文  |  简体中文  |  Items with full text/Total items : 62805/95882 (66%)
Visitors : 3942885      Online Users : 937
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/33883


    Title: 中文垃圾郵件客製化過濾系統之研究
    Other Titles: A study of customizable Chinese spam E-mails filtering system
    Authors: 吳泳慶;Wu, Yung-ching
    Contributors: 淡江大學統計學系碩士班
    陳景祥;Chen, Ching-hsiang
    Keywords: 客製化郵件;C4.5;機率類神經網路;TF-IDF;CKIP;Customizable;C4.5;Decision tree;Probabilistic Neural Network (PNN);TF-IDF;CKIP
    Date: 2007
    Issue Date: 2010-01-11 04:37:47 (UTC+8)
    Abstract: 收發電子郵件已經是現代人主要的通訊工具之ㄧ,而廣告電子郵件的大幅增加,使的我們的電子信箱經常在不知不覺中就充斥著一堆信件。過去對於廣告電子郵件則都歸類於垃圾郵件,然而在台灣ALS於2006年6月28日至7月28日間所做的調查中確有27.4%的受訪者表示曾經因為收到廣告郵件而確實有完成交易,可見在這些廣告電子信件中,有些對使用者言的確提供了所需的資訊及幫助,但有些則對使用者造成困擾及時間的浪費。因此,客製化郵件的分類則為本研究的主要議題。
    在本論文中使用機器學習法之C4.5決策樹法則及機率類神經法則為核心用以建制郵件分類系統,一般郵件分類所攫取的關鍵字通常都是以頻的高低做為選取條件,但有許多關鍵字的選取並不能真正代表該類別的郵件。所以本研究除了利用CKIP中文斷詞技術外,並計算TF-IDF的方法來攫取真正能表達每一種分類電子郵件的關鍵詞,再搭配14種發送特徵作為判斷郵件分類的準則。
    本研究將廣告信件分為九大類客製化郵件,並綜合評比整體準確率、正常郵件精確率、正常郵件檢出率、客製化郵件精確率和客製化郵件檢出率五種指標,其結果顯示本研究在個人日常郵件的測試上亦有不錯的結果。
    E-mail has become a very popular mode of communication in the modern world; however, along with the rapid growth of E-mail advertising, recipients often receive commercial E-mails that that are unsolicited and sent in bulk. In the past years all the Unsolicited Commercial E-mail were automatically categorized as spam. A survey done by Taiwan ALS from June 28th to July 28th in 2006 shows that 27.4% of interviewee had bought products through commercial E-mails. Accordingly, some of the commercial E-mails really provide recipients with information and assistance, but the others are often annoying and wasting time; therefore, Customizable e-mail Classification is the main theme in this research.
    In the research C4.5 decision tree and Probabilistic Neural Network (PNN) of machine learning method are used mainly to establish E-mail classification system. Usually the key words which are seized to categorize E-mails are chosen by their appearance rate, but many key words can not really represent the E-mails of their categories. In this research the CKIP and the method of calculating TF-IDF are used in order to seize the key words which can actually represent every categorized E-mail, accompanying 14 different sending characteristics as the rules to categorize E-mails.
    This research categorized commercial E-mails into nine major Customizable E-mails categories and comprehensively evaluates five indexes: overall precision rate, (normal) E-mail accuracy rate, (normal) E-mail detectable rate, Customizable E-mail precision rate, and Customizable E-mail detectable rate.
    Appears in Collections:[Graduate Institute & Department of Statistics] Thesis

    Files in This Item:

    File SizeFormat
    0KbUnknown402View/Open

    All items in 機構典藏 are protected by copyright, with all rights reserved.


    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - Feedback