|摘要: ||本研究以提昇循序樣本探勘(sequential pattern mining)之效率為目標，發展以多部電腦為基礎的分散式探勘系統。分散式探勘首重節點間的負載平衡，同時需降低節點間通訊負擔。負載平衡與通訊負擔的降低均有賴於準確的任務工作量預測，此外，也應避免透過繁複的任務重分配來解決負載失衡的狀況。|
針對上述重點，本研究首先針對相關文獻中常利用之工作量預測方法進行分析，發現以序列(sequence)支持度(support)為基礎之預測方式與實際狀況差異甚大，將嚴重影響負載平衡。此外，我們也發現不同特性之資料庫可能適用於不同之探勘演算法，此發現可導致全新的分散式演算法設計。根據上述結果，我們發展了一套新的分散式探勘演算法，Segmented Dynamic Load Balance algorithm (SDLB)。演算法中對於任務工作分配採取分段實施，並依狀況進行靜態與動態分配，期能在負載平衡與降低通訊負擔間取得協調。此外，本研究亦導入全新的演算法切換概念，能讓各節點依照分配任務之特性，選用合適之探勘演算法，以進一步提升總體效率。由實驗結果顯示，與前人研究相較，本系統能有效縮短執行時間，並取得更良好之加速比，此結果也顯示本系統具有處理超大型資料庫之潛力。
In this thesis, an effective distributed mining system is developed to increase the efficiency of sequential pattern mining. Two important concerns for distributed mining, load balance and communication overhead reduction, are carefully examined in our works. To achieve a good load balance, accurate predictions on work load of subtasks are indispensable. In addition, the independency of subtask should be effectively controlled to decrease the communication overhead among nodes.
According to above objectives, the work load prediction methods used in the previous works are analyzed at first. We finds that the commonly-used prediction method does not show a well metrics for work load. Eventually, a skew presented in task partition occurs frequently due to the inaccurate prediction. Besides, we also found that, the efficiency of performing a sequence mining algorithm on a transaction database is strongly related to the distribution of sequential patterns in this database. To increase the mining efficiency, we should choose a proper mining algorithm for each database but not apply a single algorithm for all databases with different features. Base on the above observations, a novel distributed data mining algorithm, Segmented Dynamic Load Balance algorithm (SDLB), is developed. Different previous works, SDLB divides subtask dispatches into several stages. According to different situations, the static and dynamic load balance method are applied adeptly to prevent the task partition from skew and reduce the communication overhead simultaneously. Furthermore, SDLB inducts a brand-new concept that the nodes may adopt different basic mining algorithm for subtasks with different characteristics to promote the mining efficiency. In comparison with the previous works, the experimental results shows SDLB can effectively reduce the runtime and obtain a better speed-up ratio. This result demonstrates the potentials of SDLB for mining sequential pattern in Very Large Data Bases (VLDBs).