Unsupervised clustering techniques based on genetic algorithms

淡江大學機構典藏 > 工學院 > 資訊工程學系暨研究所 > 學位論文 > Item 987654321/35137

Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/35137

Title:	Unsupervised clustering techniques based on genetic algorithms
Other Titles:	以基因演算法為基礎之非監督式分群技術
Authors:	楊富文;Yang, Fu-wen
Contributors:	淡江大學資訊工程學系博士班林慧珍;Lin, Hwei-jen
Keywords:	非監督式分群;基因演算法;交配;突變;群聚效度;馬可夫鏈;unsupervised clustering;Genetic Algorithm;crossover;mutation;cluster validity;Markov Chain
Date:	2005
Issue Date:	2010-01-11 06:03:46 (UTC+8)
Abstract:	聚類是一種非監督式地將輸入資料自動分類分群，而聚類方法之好壞，一般須具備下列幾點特性：（１）同類者具同質性，即同一類之資料間差異性要越小越好；（２）不同類者具異質性，即不同類之類別間差異性要越大越好。在許多文獻中，對於監督式聚類之問題，已有許多演算法被提出，如常見的有：K-means, Fuzzy-c-means, EM, …,等方法。然而，現實生活中卻存在著許多非監督式聚類之問題，要解決此類問題，常見的方法是利用基因演算法演化過程來得到一個不錯的解。在這篇論文中，我們提出了兩種不同策略之演化聚類演算法，來解決此非監督式聚類之問題；此兩種不同策略之演化聚類演算法都是利用資料本身來當作各類群中心之候選者來編碼染色體，也就是所謂的二進位編碼方式，故我們可事先建好各資料間之距離表，以減少評估適合度函數之計算時間。第一種策略利用較有效率基因運算子（複製、交配和突變）之基因演算法演化過程來設計，且因採二進位編碼方式，故在作基因運算子運算時，不需大量較費時之浮點運算，因此本方法節省了大量的運算時間。另一種提出之策略，則是改良Yong Gao所提出方法─在傳統基因演算法各染色體演化過程中產生早熟之收斂現象，將其轉換成染色體各基因間之馬可夫鏈 (Markov chain) 機率模式來模擬演化過程。此法因不須作任何傳統基因運算子之運算，只須利用染色體各基因之Markov鏈機率模式來模擬演化過程。因此本方法又節省了大量的運算時間。至於適合度函數之設計方面，都採用了Davies-Bouldin index來作適合度函數－其具備前述之聚類所需之特性。我們也實做了文獻中所提出較具效率之其它基因演算法式聚類方法來做比較，實驗結果展示出我們提出此兩種不同策略之基因聚類演算法皆比其它方法快速且有效。此結果也驗證了此篇論文的貢獻。 The number of clusters of a data set is not known in most real life situations, and no clustering system is capable of efficiently automated forming nature groups of the input patterns in these situations. Such difficult problems are called unsupervised clustering or non-parametric clustering, for which the evolutionary approaches are often employed to provide appropriate clustering results. The best-known evolutionary techniques are genetic algorithms (GAs). In this thesis, we propose two evolutionary strategies for unsupervised clustering. One is GA-based, and the other is based on population Markov chain modeling, which improves the Yong Gao’s algorithm to transform the evolutionary process of the population in the canonical genetic algorithms into the probability of Markov chain modeling for each gene. In both strategies, our algorithms select data from the data set as the candidates of cluster centers, and adopt binary representation to encode the number of cluster centers. In order to speed up the evaluation of fitness functions, a look up table of distances between all pairs of data points is developed in advance. In the first strategy, more effective operators of crossover and mutation are introduced. Because we use a binary representation to encode the number of clusters, unlike string representation (real-number encoding), we save a great deal of time for float-point computation during GAs operations (e.g. reproduction, crossover and mutation). In the second strategy, the algorithms produce a new chromosome according to the probability of Markov chain modeling for each gene without any conventional GAs operators. Hence, these proposed algorithms save a lot of computational costs than other proposed GA-based clustering algorithms. Finally, the Davies-Bouldin index is employed to measure the validity of the clusters. The superiority of the proposed algorithms over others is demonstrated in the experimental results, which show that the proposed algorithms achieve better performance in less computation time in comparison with other proposed genetic clustering algorithms.
Appears in Collections:	[資訊工程學系暨研究所] 學位論文

Files in This Item:

File	Size	Format
	0Kb	Unknown	468	View/Open

Loading...