淡江大學機構典藏:Item 987654321/114703
English  |  正體中文  |  简体中文  |  Items with full text/Total items : 64178/96951 (66%)
Visitors : 9370560      Online Users : 14775
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/114703


    Title: 使用文字探勘實作新聞事件追蹤
    Other Titles: News event tracking using text mining
    Authors: 郝志揚;Hau, Jr-Yang
    Contributors: 淡江大學資訊工程學系碩士班
    蔡憶佳
    Keywords: 文字探勘;網路爬蟲;結巴斷詞;文章分群;text mining;web crawler;Jieba Segmentation;Cluster Analysis
    Date: 2017
    Issue Date: 2018-08-03 15:01:17 (UTC+8)
    Abstract: 現今網路上有大量的文字資料,例如:新聞網,PTT、facebook.. 等,由於這些資料繁多雜亂,可以透過文字探勘的方式淬取出有用的資訊,讓人們能有效率的掌握這些網路文字所提供的訊息。
    本論文利用R 語言建立一個新聞事件追蹤系統,透過網路爬蟲爬取新聞文章,將爬取的文章做清理,利用jieba 斷詞後,依據各文章中斷詞的結果建立詞頻矩陣,透過TF-IDF 的計算找出關鍵字,最後將每篇文章中所切出來的關鍵字做文章相似度分析來實踐相似文章追蹤的系統。
    本論文擷取了1500 篇新聞文章,透過上述文字探勘的步驟,將這1500 篇新聞透過計算文章間的餘弦距離來做文章相似度分析,加入沃德法(Ward‘s method)使群內的總變異變小,使群間的總變異變大,以判斷出最佳分群數目,實驗結果顯示爬取的1500 篇新聞經過此文字探勘步驟後,可以透過文章查詢函式來查詢相似的新聞,實踐新聞事件的追蹤。
    Nowadays, there are massive text data on the internet. For example, news websites, PTT, facebook etc. Since these data are all disordered, it is important to apply text-mining in order to extract the useful information for people to efficiently grasp the main idea the text contains. This thesis utilizes R language to construct a news event tracking system. Using crawler to crawl and cleans news articles, segmenting Chinese words using jiebaR.Then, based on the segmentation result to build a frequency matrix and find key words through computing TF-IFD. Lastly, compares the similarities of each articles by their key words to carry out the similar article tracking system. Implementing these steps of text mining, this thesis retrieved 1500 news articles and calculates the cosine distance of every article to analyze their similarity. In addition, to find the best amount of groups, we made use of Ward’s method to minimize the total variation of each group and maximize the total variation between groups. The experiment result shows that after applying the proposed text-mining method on 1500 news articles, we can achieve news event tracking to find similar articles via news inquiry function.
    Appears in Collections:[Graduate Institute & Department of Computer Science and Information Engineering] Thesis

    Files in This Item:

    File Description SizeFormat
    index.html0KbHTML155View/Open

    All items in 機構典藏 are protected by copyright, with all rights reserved.


    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - Feedback