English  |  正體中文  |  简体中文  |  Items with full text/Total items : 64198/96992 (66%)
Visitors : 7992514      Online Users : 2730
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/54120


    Title: 一個識別特定主題深網查詢介面的分類器
    Other Titles: A domain-specific deep web query interface classifier
    Authors: 張珮慈;Chang, Pei-Tzu
    Contributors: 淡江大學資訊管理學系碩士班
    周清江;Jou, Chichang
    Keywords: 深層網路;查詢介面;搜尋引擎;Deep Web;Query Interface;Search Engine
    Date: 2011
    Issue Date: 2011-06-16 22:02:12 (UTC+8)
    Abstract: 根據研究估算,深層網路(Deep Web)的規模大約為表層網路(Surface Web)的400~550倍,為了擷取深網資料庫的內容,首先必須找出資料庫的入口,即深網查詢表單。此外,由於深網內容通常屬於某個特定主題,為了從眾多該特定主題的網頁表單中識別出深網查詢表單,本研究提出一個兩階段的分析方法,結合提交查詢前之表單分析以及提交查詢後之表單分析,發展一個自動化深網查詢介面識別技術。不同於其他研究,本研究不僅能識別出查詢表單,更能進一步過濾搜尋引擎、站內搜尋這類只對靜態網頁進行索引的非深網查詢表單。
    在前置準備階段,我們會建立非查詢表單欄位特徵字,並透過大量爬行特定主題查詢表單以擷取出該主題常見欄位語意。我們的分類系統,在提交查詢前之表單分析這個階段,我們使用非查詢表單欄位特徵字優先過濾常見的非查詢表單,以降低提交查詢的時間成本。在參考提交查詢結果之表單分析這個階段,我們利用常見欄位語意對表單自動填值以實際對表單自動提交查詢,並根據查詢回傳的結果進一步分析,以判定表單是否為特定主題的深網查詢介面。實驗結果顯示,我們提出的方法可以得到高精確度(precision),不僅可過濾搜尋引擎這類的非深網查詢表單,更可自動偵測及過濾連結失效的查詢表單。
    From previous research, the amount of data of the deep web is about 400 to 550 times larger than that of the surface web. In order to retrieve the deep web content residing in databases, we need to find the entrances of the databases, which are the deep web query interfaces. Moreover, since the content of deep web is domain-specific, to identify the deep web query interfaces from various web forms, we propose a two-phase analysis methodology which combines pre-query and post-query analyses, and develop an automatic deep web query interface classification technique. We not only can identify deep web query forms, but also can filter out search engine forms and site search forms, which are to extract static web pages inside a site.
    Before the classification, we would build feature words for the non-query forms, and would crawl a large scale of domain-specific query forms to extract the semantics of popular fields of that domain. In our classification system, in the pre-query analysis phase, we use feature words for the non-query forms to filter out non-query forms so that processing time at the next phase could be reduced. In the post-query analysis stage, we use the field semantics to fill in values and submit forms automatically, and then classify forms according to the query results of the forms. The experimental result shows our two-phase analysis methodology can obtain high precision. We can filter out not only the search engine forms and site search forms, but also deep web query forms which link to disabled databases.
    Appears in Collections:[資訊管理學系暨研究所] 學位論文

    Files in This Item:

    File SizeFormat
    index.html0KbHTML374View/Open

    All items in 機構典藏 are protected by copyright, with all rights reserved.


    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - Feedback