目前一些萃取文本新詞的開源程式,都是以C++、JAVA以及Python來撰寫,原因在於無法在R中使用一些資料結構來儲存字串。 R軟體的Rcpp套件可以讓R串接使用其他程式語言的程式。本論文使用Rcpp讓R軟體能串接使用開放原始碼的Wordmaker C++程式, 並命名為wordmakerR,因使用鄰接熵(Branch Entropy)所擷取出的詞較多且有些許詞彙是不能成詞,因此本論文在Wordmaker的程式架構中加入給與詞頻閥值的機制來解決這個缺點。此套件還包含了兩個處理垃圾字串的函數。 Rcpp這個API套件好用且直覺,能讓R與C++ 程式或函數庫的對接變得夠非常的容易。本篇所開發的套件可以擷取文本中新詞,直接在R中分析使用,且精簡了剔除無意義字串的步驟,在使用上面方便許多,只要輸入欲剔除的有關字詞就可刪除。本論文最後將wordmakerR應用在實例分析,搜集450000篇從批踢踢實業坊(PTT)八卦版、男女版以及女版的文章,搭配wordmakerR來做後續分析,總共找出1853筆新詞。這些新詞都是2016年到2017年四月份的時間所產生的。我們也個別比較了三個討論版的新詞增加速度與數量。八卦版每月平均243.57筆;男女版每月平均4.28筆;女版平均26.42筆,很明顯的,八卦版所產生的新詞速度最快且最多。 At present, open source programs that extract new Chinese words are mostly written in C ++, JAVA, and Python because they can not use some data structures in R to store strings, but luckily Rcpp package in R allows us to port functionalities of external programs written in other programming languages into R.
The main goal of our study is the development of a new R package wordmakerR, using Rcpp to port the C++-based open source project Wordmaker into R.
Since Wordmaker often generates many meaningless terms due to the use of Branch Entropy algorithm, we also develop a mechanism using word frequency threshold and two junk-word filtering functions to solve such problem. Hence, our wordmakeR package simplifies new term extraction process in R and ease the steps to remove the meaningless terms.
At the end of this study, we apply wordmakerR to analyze real-world data, including 450000 articles from the Gossiping, Boy-Girl and Women forums in PTT discussion board web site.