R軟體新詞產生套件開發 : 應用於PTT文章

機構典藏 > College of Business and Management > Graduate Institute & Department of Statistics > Thesis > Item 987654321/114439

Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/114439

Title:	R軟體新詞產生套件開發 : 應用於PTT文章
Other Titles:	Development of new-word extraction package in R with application in PTT articles
Authors:	劉庭宇;Liu, Ting-Yu
Contributors:	淡江大學統計學系碩士班陳景祥
Keywords:	New words;R Package;Rcpp;R套件;text mining;文字探勘;新詞
Date:	2017
Issue Date:	2018-08-03 14:52:38 (UTC+8)
Abstract:	目前一些萃取文本新詞的開源程式，都是以C++、JAVA以及Python來撰寫，原因在於無法在R中使用一些資料結構來儲存字串。 R軟體的Rcpp套件可以讓R串接使用其他程式語言的程式。本論文使用Rcpp讓R軟體能串接使用開放原始碼的Wordmaker C++程式，並命名為wordmakerR，因使用鄰接熵(Branch Entropy)所擷取出的詞較多且有些許詞彙是不能成詞，因此本論文在Wordmaker的程式架構中加入給與詞頻閥值的機制來解決這個缺點。此套件還包含了兩個處理垃圾字串的函數。 Rcpp這個API套件好用且直覺，能讓R與C++ 程式或函數庫的對接變得夠非常的容易。本篇所開發的套件可以擷取文本中新詞，直接在R中分析使用，且精簡了剔除無意義字串的步驟，在使用上面方便許多，只要輸入欲剔除的有關字詞就可刪除。本論文最後將wordmakerR應用在實例分析，搜集450000篇從批踢踢實業坊（PTT)八卦版、男女版以及女版的文章，搭配wordmakerR來做後續分析，總共找出1853筆新詞。這些新詞都是2016年到2017年四月份的時間所產生的。我們也個別比較了三個討論版的新詞增加速度與數量。八卦版每月平均243.57筆；男女版每月平均4.28筆；女版平均26.42筆，很明顯的，八卦版所產生的新詞速度最快且最多。 At present, open source programs that extract new Chinese words are mostly written in C ++, JAVA, and Python because they can not use some data structures in R to store strings, but luckily Rcpp package in R allows us to port functionalities of external programs written in other programming languages into R. The main goal of our study is the development of a new R package wordmakerR, using Rcpp to port the C++-based open source project Wordmaker into R. Since Wordmaker often generates many meaningless terms due to the use of Branch Entropy algorithm, we also develop a mechanism using word frequency threshold and two junk-word filtering functions to solve such problem. Hence, our wordmakeR package simplifies new term extraction process in R and ease the steps to remove the meaningless terms. At the end of this study, we apply wordmakerR to analyze real-world data, including 450000 articles from the Gossiping, Boy-Girl and Women forums in PTT discussion board web site.
Appears in Collections:	[Graduate Institute & Department of Statistics] Thesis

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	298	View/Open

Loading...