深層網路查詢介面之綱要擷取研究

機構典藏 > College of Business and Management > Graduate Institute & Department of Information Management > Thesis > Item 987654321/77414

Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/77414

Title:	深層網路查詢介面之綱要擷取研究
Other Titles:	A study of schema extraction for deep web search interfaces
Authors:	鄭又誠;Cheng, Yu-Cheng
Contributors:	淡江大學資訊管理學系碩士班周清江
Keywords:	深層網路;綱要擷取;經驗法則;Deep Web;Schema Extraction
Date:	2012
Issue Date:	2012-06-21 06:41:26 (UTC+8)
Abstract:	隨著網際網路快速普及，網路資料庫的內容持續快速增加，這些內容被隱藏在查詢介面之後，一般稱為深層網路(簡稱深網)。由於網路使用者必須於深網查詢介面輸入適當的參數，才能瀏覽符合參數條件的動態內容，因此這些內容不被搜尋引擎收錄，常導致網路使用者錯失重要資訊。而在建立自動擷取深網內容的系統之前，必須先有一個擷取其查詢介面綱要的系統，以了解查詢介面中輸入元素與標籤的對應關係、元素可填值的資料型態、元素可填值的範圍限制等等，才能更進一步對這些元素填值，以擷取內容。本研究即在建立上述深網查詢介面綱要擷取系統，我們以He等人所提出之基於排版表示式的表單擷取方法，找出查詢介面中的元素、標籤以及換列符號，以產生其介面表示式(Interface Expression, IEXP)，再結合使用者觀點以及設計者觀點的角度，利用ICQ資料集為基礎，以經驗法則剖析IEXP，以擷取出其綱要。我們解決了視覺上元素和其對應標籤彼此距離接近但卻不為對應的缺點，並保留元素和其對應標籤不會相離太遠的概念。我們所提出之綱要分層表達方式，不但有助擷取深網內容，預期也將有利於後續綱要匹配以及綱要合併的效能。我們最後以TEL-8資料集以及過去研究所收集的查詢介面來檢驗其輸入元素與標籤的對應關係是否正確，實驗結果顯示本研究可得到很好的效能。 Along with the fast popularity of the internet, the contents inside web databases also increase quickly. These data, hidden behind the query interfaces, are called Deep Web. In order to obtain the dynamic contents which satisfy the conditions imposed by the input parameters, the internet users must keyin proper parameters. This is the reason why the above contents are not collected by the search engines, which cause the internet users lose important information easily. However, before building a system which could collect the contents of Deep Web automatically, a system for extracting schemas of query interfaces must be established first to obtain mappings of input elements and labels, data types of legitimate input values, and range constraints of the input values, etc. Then it is possible to automatically input proper values for elements in the query interfaces to extract the dynamic contents. We would like to build a schema extraction system for query interfaces of the deep web. Based on the layout expressions for form extraction proposed by He, we extract elements, labels and new lines of query interfaces to produce their IEXP, Interface Expression. Besides, we combine the users'' view and the designers'' view, and use ICQ dataset as the foundation to propose the heuristic rules for extracting the schema. We solve the problem that visional elements and their mapping labels are close but not mapped correctly, without abandoning the concept that elements and their mapping labels should not be separated far away. The proposed layered model for schema not only helps extracting contents of the Deep Web, but also benefits the processes of schema matching and schema merge. We examine the performance of the schema extraction system by the TEL-8 dataset and query interfaces gathered by the past research. The result reveals that our system produces effective results.
Appears in Collections:	[Graduate Institute & Department of Information Management] Thesis

Files in This Item:

File	Size	Format
index.html	0Kb	HTML	285	View/Open

Loading...