一個識別特定主題深網查詢介面的分類器

淡江大學機構典藏 > 商管學院 > 資訊管理學系暨研究所 > 學位論文 > Item 987654321/54120

請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/54120

題名:	一個識別特定主題深網查詢介面的分類器
其他題名:	A domain-specific deep web query interface classifier
作者:	張珮慈;Chang, Pei-Tzu
貢獻者:	淡江大學資訊管理學系碩士班周清江;Jou, Chichang
關鍵詞:	深層網路;查詢介面;搜尋引擎;Deep Web;Query Interface;Search Engine
日期:	2011
上傳時間:	2011-06-16 22:02:12 (UTC+8)
摘要:	根據研究估算，深層網路(Deep Web)的規模大約為表層網路(Surface Web)的400~550倍，為了擷取深網資料庫的內容，首先必須找出資料庫的入口，即深網查詢表單。此外，由於深網內容通常屬於某個特定主題，為了從眾多該特定主題的網頁表單中識別出深網查詢表單，本研究提出一個兩階段的分析方法，結合提交查詢前之表單分析以及提交查詢後之表單分析，發展一個自動化深網查詢介面識別技術。不同於其他研究，本研究不僅能識別出查詢表單，更能進一步過濾搜尋引擎、站內搜尋這類只對靜態網頁進行索引的非深網查詢表單。在前置準備階段，我們會建立非查詢表單欄位特徵字，並透過大量爬行特定主題查詢表單以擷取出該主題常見欄位語意。我們的分類系統，在提交查詢前之表單分析這個階段，我們使用非查詢表單欄位特徵字優先過濾常見的非查詢表單，以降低提交查詢的時間成本。在參考提交查詢結果之表單分析這個階段，我們利用常見欄位語意對表單自動填值以實際對表單自動提交查詢，並根據查詢回傳的結果進一步分析，以判定表單是否為特定主題的深網查詢介面。實驗結果顯示，我們提出的方法可以得到高精確度(precision)，不僅可過濾搜尋引擎這類的非深網查詢表單，更可自動偵測及過濾連結失效的查詢表單。 From previous research, the amount of data of the deep web is about 400 to 550 times larger than that of the surface web. In order to retrieve the deep web content residing in databases, we need to find the entrances of the databases, which are the deep web query interfaces. Moreover, since the content of deep web is domain-specific, to identify the deep web query interfaces from various web forms, we propose a two-phase analysis methodology which combines pre-query and post-query analyses, and develop an automatic deep web query interface classification technique. We not only can identify deep web query forms, but also can filter out search engine forms and site search forms, which are to extract static web pages inside a site. Before the classification, we would build feature words for the non-query forms, and would crawl a large scale of domain-specific query forms to extract the semantics of popular fields of that domain. In our classification system, in the pre-query analysis phase, we use feature words for the non-query forms to filter out non-query forms so that processing time at the next phase could be reduced. In the post-query analysis stage, we use the field semantics to fill in values and submit forms automatically, and then classify forms according to the query results of the forms. The experimental result shows our two-phase analysis methodology can obtain high precision. We can filter out not only the search engine forms and site search forms, but also deep web query forms which link to disabled databases.
顯示於類別:	[資訊管理學系暨研究所] 學位論文

文件中的檔案:

檔案	大小	格式	瀏覽次數
index.html	0Kb	HTML	578	檢視/開啟

在機構典藏中所有的資料項目都受到原著作權保護.

TAIR相關文章

資料載入中.....