基於MapReduce程式架構下的分散式循序樣式探勘方法之研究

淡江大學機構典藏 > 商管學院 > 資訊管理學系暨研究所 > 學位論文 > Item 987654321/111175

請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/111175

題名:	基於MapReduce程式架構下的分散式循序樣式探勘方法之研究
其他題名:	A study of distributed sequential pattern mining method based on MapReduce programming model
作者:	陳智翔;Chen, Jhih-Siang
貢獻者:	淡江大學資訊管理學系碩士班徐煥智
關鍵詞:	Hadoop;MapReduce;循序樣式;資料探勘;sequential pattern;data mining
日期:	2016
上傳時間:	2017-08-24 23:45:53 (UTC+8)
摘要:	循序樣式探勘是在巨量循序資料庫中用來取得頻繁循序樣式的一種資料探勘方法，常見的循序資料探勘方法可以分為兩大類，候選樣式產生與樣式成長方法，這些演算法主要執行於單機的環境，便會造成一些缺點，像是對於巨量資料的掃描時間、可擴展性的問題、對於巨量資料及的效率較低。為了增進循序資料探勘的性能，並且改善可擴展性的問題，本研究提出了以Hadoop平台與MapReduce軟體架構為基礎的循序資料探勘方法。探勘任務被分解為許多分散式任務，Map方法用來挖掘資料集中的所有循序樣式，然後Reduce方法合併所有被找出來的樣式。簡化了搜尋的空間以及獲得了更高的探勘效能。在這次研究當中，我們對於用戶所設定最小支持度的影響有更進一步的討論，根據我們的實驗，我們發現在探勘過程中的Map與Reduce階段對於最小支持度的設定應該不同，否則會產生頻繁樣式流失的可能。 Sequential pattern mining is a data mining method for obtaining frequent sequential patterns in a large sequential database. Conventional sequence data mining methods could be divided into two categories: Apriori-like methods and pattern growth methods. These algorithms are mainly executed on standalone environment. There are some disadvantages like large database scanning time, scalability problem, less efficient for massive dataset. To improve the performance of sequential pattern mining and to improve the scalability issues, this study presents a distributed sequential pattern mining method based on Hadoop platform and Map Reduce programming model. Mining tasks are decomposed to many distributed tasks, the Map function is used to mine each sequential pattern in a subset of database. Then the Reduce function merges together all these identified patterns. It simplifies the search space and acquires a higher mining efficiency. In this study, we have further discussion on the influence of the setting of user-specified minimum support threshold on the distributed mining process. According to our experiments, it has been found that the threshold setting should be different in Map and Reduce mining process to prevent loss of some frequent patterns.
顯示於類別:	[資訊管理學系暨研究所] 學位論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	240	檢視/開啟

在機構典藏中所有的資料項目都受到原著作權保護.

TAIR相關文章

資料載入中.....