淡江大學機構典藏:Item 987654321/114439
English  |  正體中文  |  简体中文  |  Items with full text/Total items : 62828/95882 (66%)
Visitors : 4030474      Online Users : 835
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/114439


    Title: R軟體新詞產生套件開發 : 應用於PTT文章
    Other Titles: Development of new-word extraction package in R with application in PTT articles
    Authors: 劉庭宇;Liu, Ting-Yu
    Contributors: 淡江大學統計學系碩士班
    陳景祥
    Keywords: New words;R Package;Rcpp;R套件;text mining;文字探勘;新詞
    Date: 2017
    Issue Date: 2018-08-03 14:52:38 (UTC+8)
    Abstract: 目前一些萃取文本新詞的開源程式,都是以C++、JAVA以及Python來撰寫,原因在於無法在R中使用一些資料結構來儲存字串。 R軟體的Rcpp套件可以讓R串接使用其他程式語言的程式。本論文使用Rcpp讓R軟體能串接使用開放原始碼的Wordmaker C++程式, 並命名為wordmakerR,因使用鄰接熵(Branch Entropy)所擷取出的詞較多且有些許詞彙是不能成詞,因此本論文在Wordmaker的程式架構中加入給與詞頻閥值的機制來解決這個缺點。此套件還包含了兩個處理垃圾字串的函數。
    Rcpp這個API套件好用且直覺,能讓R與C++ 程式或函數庫的對接變得夠非常的容易。本篇所開發的套件可以擷取文本中新詞,直接在R中分析使用,且精簡了剔除無意義字串的步驟,在使用上面方便許多,只要輸入欲剔除的有關字詞就可刪除。本論文最後將wordmakerR應用在實例分析,搜集450000篇從批踢踢實業坊(PTT)八卦版、男女版以及女版的文章,搭配wordmakerR來做後續分析,總共找出1853筆新詞。這些新詞都是2016年到2017年四月份的時間所產生的。我們也個別比較了三個討論版的新詞增加速度與數量。八卦版每月平均243.57筆;男女版每月平均4.28筆;女版平均26.42筆,很明顯的,八卦版所產生的新詞速度最快且最多。
    At present, open source programs that extract new Chinese words are mostly written in C ++, JAVA, and Python because they can not use some data structures in R to store strings, but luckily Rcpp package in R allows us to port functionalities of external programs written in other programming languages into R.

    The main goal of our study is the development of a new R package wordmakerR, using Rcpp to port the C++-based open source project Wordmaker into R.

    Since Wordmaker often generates many meaningless terms due to the use of Branch Entropy algorithm, we also develop a mechanism using word frequency threshold and two junk-word filtering functions to solve such problem. Hence, our wordmakeR package simplifies new term extraction process in R and ease the steps to remove the meaningless terms.

    At the end of this study, we apply wordmakerR to analyze real-world data, including 450000 articles from the Gossiping, Boy-Girl and Women forums in PTT discussion board web site.
    Appears in Collections:[Graduate Institute & Department of Statistics] Thesis

    Files in This Item:

    File Description SizeFormat
    index.html0KbHTML229View/Open

    All items in 機構典藏 are protected by copyright, with all rights reserved.


    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - Feedback