一個識別特定主題深網查詢介面的分類器

淡江大學機構典藏 > 商管學院 > 資訊管理學系暨研究所 > 學位論文 > Item 987654321/54120

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/54120

题名:	一個識別特定主題深網查詢介面的分類器
其它题名:	A domain-specific deep web query interface classifier
作者:	張珮慈;Chang, Pei-Tzu
贡献者:	淡江大學資訊管理學系碩士班周清江;Jou, Chichang
关键词:	深層網路;查詢介面;搜尋引擎;Deep Web;Query Interface;Search Engine
日期:	2011
上传时间:	2011-06-16 22:02:12 (UTC+8)
摘要:	根據研究估算，深層網路(Deep Web)的規模大約為表層網路(Surface Web)的400~550倍，為了擷取深網資料庫的內容，首先必須找出資料庫的入口，即深網查詢表單。此外，由於深網內容通常屬於某個特定主題，為了從眾多該特定主題的網頁表單中識別出深網查詢表單，本研究提出一個兩階段的分析方法，結合提交查詢前之表單分析以及提交查詢後之表單分析，發展一個自動化深網查詢介面識別技術。不同於其他研究，本研究不僅能識別出查詢表單，更能進一步過濾搜尋引擎、站內搜尋這類只對靜態網頁進行索引的非深網查詢表單。在前置準備階段，我們會建立非查詢表單欄位特徵字，並透過大量爬行特定主題查詢表單以擷取出該主題常見欄位語意。我們的分類系統，在提交查詢前之表單分析這個階段，我們使用非查詢表單欄位特徵字優先過濾常見的非查詢表單，以降低提交查詢的時間成本。在參考提交查詢結果之表單分析這個階段，我們利用常見欄位語意對表單自動填值以實際對表單自動提交查詢，並根據查詢回傳的結果進一步分析，以判定表單是否為特定主題的深網查詢介面。實驗結果顯示，我們提出的方法可以得到高精確度(precision)，不僅可過濾搜尋引擎這類的非深網查詢表單，更可自動偵測及過濾連結失效的查詢表單。 From previous research, the amount of data of the deep web is about 400 to 550 times larger than that of the surface web. In order to retrieve the deep web content residing in databases, we need to find the entrances of the databases, which are the deep web query interfaces. Moreover, since the content of deep web is domain-specific, to identify the deep web query interfaces from various web forms, we propose a two-phase analysis methodology which combines pre-query and post-query analyses, and develop an automatic deep web query interface classification technique. We not only can identify deep web query forms, but also can filter out search engine forms and site search forms, which are to extract static web pages inside a site. Before the classification, we would build feature words for the non-query forms, and would crawl a large scale of domain-specific query forms to extract the semantics of popular fields of that domain. In our classification system, in the pre-query analysis phase, we use feature words for the non-query forms to filter out non-query forms so that processing time at the next phase could be reduced. In the post-query analysis stage, we use the field semantics to fill in values and submit forms automatically, and then classify forms according to the query results of the forms. The experimental result shows our two-phase analysis methodology can obtain high precision. We can filter out not only the search engine forms and site search forms, but also deep web query forms which link to disabled databases.
显示于类别:	[資訊管理學系暨研究所] 學位論文

文件中的档案:

档案	大小	格式	浏览次数
index.html	0Kb	HTML	377	检视/开启

在機構典藏中所有的数据项都受到原著作权保护.

TAIR相关文章

数据加载中.....