尋求有效率之資料串流頻繁樣式探勘

淡江大學機構典藏 > 工學院 > 電機工程學系暨研究所 > 學位論文 > Item 987654321/111450

Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/111450

Title:	尋求有效率之資料串流頻繁樣式探勘
Other Titles:	In pursuit of efficient frequent pattern mining in data streams
Authors:	凃耘昇;Tu, Yun-Sheng
Contributors:	淡江大學電機工程學系碩士班莊博任;Chuang, Po-Jen
Keywords:	資料串流;頻繁樣式探勘;Data stream;Frequent pattern mining
Date:	2016
Issue Date:	2017-08-24 23:53:09 (UTC+8)
Abstract:	資料串流探勘(Data stream mining)是現今大數據(Big data)領域的研究主題之一。資料串流指的是日常生活中隨著時間不斷產生的大量資料。例如：每個人上網的點擊串流、感測器產生的資料、購物資料、網路流量資料…等。這些資料串流中隱含著許多有價值的資訊，利用資料探勘(Data mining)的技術，可以分析出針對不同應用之下所需要的結果，最後做出更佳的決策。在資料探勘的領域中，其中有一個主題是頻繁樣式探勘(Frequent pattern mining)。目標是要從資料當中找出哪些樣式出現頻率較高，利用找出來的樣式可以進一步運用。目前在資料串流中找出頻繁樣式的研究領域，針對即時分析的需求以及爆量資料的問題，近年來有LCSS演算法。它能夠用有限的時間與空間處理Data stream。但是，由於沒有使用較佳的資料結構，使得運算的時間很長。針對這個問題，我們使用較佳的資料結構Trie存取資料。Trie是一個字典排序(Lexicographic)的Tree。因此，在搜尋pattern時，能夠用更少的步驟處理，使得速度能夠有效的提升。此外，LCSS演算法有另外一個問題。當最小支持度(Minimum support threshold)較低時，無法正確的找出頻繁樣式，導致準確率趨近於零。因為LCSS是用有限的空間儲存pattern，當資料較龐大且複雜時，沒有足夠緩衝的空間能夠存取pattern。為了改善準確率較低的問題，我們提出了Length Skip。犧牲Long frequent pattern，將運算資源集中在Short frequent pattern，使其準確率提高。Length Skip是在處理Data stream的過程中，使用滑動視窗(Sliding Window)保持最新的10K筆資料。每到達Check point時，利用Sliding window內的data計算準確率。若準確率很低，計算出恰當的X限制pattern長度。使得長度1~X的pattern有更充裕的空間可以存取Summary，有效提高準確率。實驗結果證實，使用Trie存取pattern能夠用較少的步驟處理完畢，大幅減少執行時間。在改善準確率方面，有效的使用Sliding window分析準確率，並計算出恰當的X限制長度。避免過多的Long pattern充斥在Summary之中，進而大大地提升了準確率。 Data stream mining is a research topic in the big data field. Data streams contain a lot of data which are generated continuously in our daily life. They are click data, sensor data, transaction data, network traffic data and so on. There are a lot of valuable information in data streams. Data mining techniques have been used to analyze data in different applications so that making better decisions. In data mining fields, frequent pattern mining is an important topic. It is used to find patterns which appeared frequently. Frequent patterns are used in several applications and other analysis. In recent years, there are a lot of researches about “frequent pattern mining in data streams”. There is a method called LCSS algorithm which has ability to do real-time response and handle bursty data. The LCSS algorithm uses finite run time and memory space to process data streams, but, inappropriate data structures cost a lot of time. In order to solve the problem, a better data structure called “Trie” is used, which is a lexicographic tree. As a result, run time decreases because fewer steps are used to access patterns in Trie. In addition, there is a problem in the LCSS algorithm. Frequent patterns can’t be found accurately when minimum support threshold is low. Therefore, accuracy closes to zero. Accuracy is low because the LCSS algorithm uses finite memory space to store patterns. There are not enough memory space store patterns when data streams are complex and large. In order to increase accuracy, we propose “Length Skip” which sacrifices long frequent patterns and focuses on short frequent patterns. When it processes data streams, uses a fixed size sliding window to hold the last ten thousand data. In each check point, data in sliding window are used to calculate accuracy. If accuracy is low, appropriate X would be calculated to limit the length of patterns. Experimental results show that using Trie to access patterns could reduce the number of process steps so that it spends shorter time to process data streams. On the other hand, sliding window is used to calculate accuracy and appropriate limited length X so that there are not many long frequent patterns are stored in a summary. Therefore, accuracy increases efficiently.
Appears in Collections:	[電機工程學系暨研究所] 學位論文

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	245	View/Open

Loading...