Efficient mining and maintaining algorithms for sequential patterns

淡江大學機構典藏 > 工學院 > 資訊工程學系暨研究所 > 學位論文 > Item 987654321/35102

Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/35102

Title:	Efficient mining and maintaining algorithms for sequential patterns
Other Titles:	高效率挖掘及維護序列型樣演算法之研究
Authors:	郝維華;Hao, Wei-hua
Contributors:	淡江大學資訊工程學系博士班林丕靜;Lin, Nancy Pei-ching
Keywords:	資料採礦;資料挖掘;封閉序列型樣;最大序列型樣;維護演算法;遞增採礦;data mining;sequential pattern;Closed Sequence;Maximal Sequence;Maintain algorithm;Incremental Mining
Date:	2008
Issue Date:	2010-01-11 06:01:21 (UTC+8)
Abstract:	自有文字記載以來，各種文字與數字的紀錄已經充滿於各個領域。科學界詳實的實驗記錄、商業界的交易資料、學生的學習資料、瀏覽網頁序列、大賣場客戶交易資料，隨著電腦的廣泛應用而成指數型成長。尤其是近代資訊自動化，許多領域的資料，鉅細靡遺的紀錄於資料庫當中。這些電子資料可說是真實世界的電子版，我們嘗試透過查詢、統計、繪圖等方法，了解這些雜散的巨量資料背後所隱含的真實意涵，進而更貼近真實的了解真實世界。在眾多的方法當中，近年資料探勘的研究與應用層面迅速成長並於許多領域獲致驚人成果，而頻繁樣式的探勘屬於較新的研究焦點。本論文主旨在於研究高效率的探勘與維護序列樣式的演算法。在研讀相關研究當中，發現這領域先前研究提出之演算法集中於解決巨量資料的問題，而著重在加快搜尋速度。主要的量測指標有二:掃瞄資料庫次數以及搜尋空間大小。多次掃描資料庫將巨幅拖慢處理資料的速度，因此，掃描資料庫的次數愈少愈好。另外演算法所需要蒐尋的資料空間也是愈小愈有可能完全於記憶體當中運算，加速處理速度。就此，本論文提出之演算法皆僅需掃描資料庫一次，依據序列出現的頻率以及序列長度這二個我們感興趣的兩個屬性建立序列資料庫之資料模型。以此為題，首先提出FAL演算法，掃描資料庫一次，應用向下封閉特性(Downward Closure Property)建立序列晶格，再依據向上封閉特性(Upward Closure Property)刪除晶格中不頻繁序列，建立完全最大序列(Maximal Sequence)晶格，有效縮小晶格所佔用之空間。其次，提出FMCSP演算法，以等效類別(Equivalent Class)的觀念，建立封閉序列(Closed Sequence)晶格。然而，實際運轉的資料庫都會是動態的、時變的，無法避免繼續新增資料的產生。為此，針對時變遞增的資料庫如何維護序列型樣資料模型?!本論文提出MMSP演算法，探討挖掘序列型樣演算法之遞增特性，充分運用已經挖掘之資訊，減少重複挖掘工作。 Since the invention of characters, all kinds of records with numbers and words, increased dramatically in many domains. The invention of computer had trigger rapid data accumulation in science, education, e-Learning, business and supermarket, exponentially. Automation of information system has urged this situation and result in huge amount of data stored in databases worldwide. These electronic data is considered to be the mirror of the real world, which we can make full use it, properly. We try to discover interesting information or knowledge that conceived in databases via various methods, such as statistic, query, graphics and data mining, to further understand the world we lived in. A modern day information system is accountable for this purpose. In these diverse approaches, data mining has caught the eyes of many domain experts, and gain achievements. In the last decade, mining sequential patterns became one promising topic and arouse our interest. The essence of data mining is to dealing with huge amount of data, many previous researches focus on propose efficient algorithm with less search time and run time. Hence, this dissertation has focused on develop algorithms that mining sequential pattern efficiently, and to be maintainable. In our point of view there are two criteria to evaluate mining algorithm: scan times of database, volume of working space and searching space. As we already known the speed of accessing data from hard drive is slower than from main memory, usually by factor of two. This implies that the less scan times the better. Working space is required to host data during mining process. Searching space is the space to store the result, frequent sequence set or data model. The less space required by algorithm the more chance to fit all processed data into main memory. Consequently, both the efficiency and performance will be improved. With these in mind, all three algorithms been proposed in this dissertation have these three characters: scan database once, mining without candidates and mining full set of frequent sequences. First algorithm, FAL, is designed to fully utilize both the downward closure property and upward closure property to construct a lattice data model with maximal sequence representation. Second algorithm is FMCSP that inherit the legacy of FAL, but applied closed sequence concept instead of maximal sequence. Note that, closed sequence is the longest sequence in its equivalent class, it can shrink the size of searching space, and furthermore, with adjustable ability for user to set, or tune, the threshold of minimum support after the mining of data model had been constructed. Finally, algorithm MMSP has inherited the legacy of previous algorithms to deal with incremental sequence database. MMSP is capable to handle incremental data added into data model one by one and batch data without rerun whole database from scratch.
Appears in Collections:	[資訊工程學系暨研究所] 學位論文

Files in This Item:

File	Size	Format
	0Kb	Unknown	526	View/Open

Loading...