一個基於語意之二階段特定主題深網查詢介面分類器

淡江大學機構典藏 > 商管學院 > 資訊管理學系暨研究所 > 研究報告 > Item 987654321/103123

請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/103123

題名:	一個基於語意之二階段特定主題深網查詢介面分類器
其他題名:	A Semantics-Based Two-Phase Specific Domain Deep Web Query Interface Classifier
作者:	周清江
貢獻者:	淡江大學資訊管理學系
關鍵詞:	語意網路;網路資料庫;深網;查詢介面;分類器;semantic web;web database;deep web;query interface;classifier
日期:	2012-08
上傳時間:	2015-05-19 13:53:45 (UTC+8)
摘要:	隨著網際網路的蓬勃發展，網頁內容持續快速成長，然而同時也有大量結構化資料來源，被隱藏在網頁查詢表單介面之後的資料庫，只有在輸入表單查詢條件後才能將符合條件的資料內容動態展現，傳統的搜索引擎無法為這些資料庫內容建立索引。一般將被建立索引的靜態網頁內容稱為表層網路(Surface Web)，而將隱藏於網路資料庫的資料來源稱為看不見的網路(Invisible Web)，也稱為深網(Deep Web)。據Gil的估計，至2011年春季為止，Google為265億靜態網頁建立索引，但至少有3000億個深網資料庫動態內容並未被Google建立索引。深網的內容通常是針對特定領域整理過的資料，因此相較於表層網路，深網可提供品質更好的資訊，如何讓網路使用者知道這些深網查詢表單的存在，進而運用這些深網的內容成為一個很重要的課題。過去辨識深網查詢表單的研究大都是透過擷取及解析表單特徵，例如表單控制項的標籤、名稱及屬性值等，藉由機器學習分類技術，辨識網頁上的表單是否為深網查詢表單，然而由於表單設計理念極為分歧，這些分類器運用於隨機網頁表單時並沒有很好的效能。我們初步搜集深網查詢介面與非深網查詢介面後，發現：(1)非深網查詢表單的標籤常出現某些特殊字，而這些特殊字並未出現於深網查詢表單；(2)深網查詢表單及其結果內含許多特定領域的語意資訊；(3)很多深網查詢介面都有一個關鍵字文字輸入控制項，允許使用者輸入任意值，而且只要在這個欄位填入值後，即能提交得到深網內容。我們因此提出一個兩階段分類方法，事先建立非查詢表單欄位特徵字、特定主題常見欄位語意及範例值等表單語意資料庫，結合提交查詢前之表單特徵分析以及提交查詢後之結果內容分析，發展一個自動化深網查詢表單分類技術，期望能提升辨識隨機網頁表單是否為深網查詢表單之效能。 Along with the rapid developments of the web, web contents have been growing in amazing speed. Meanwhile, there are huge amounts of structured data sources hidden in the databases behind web query interfaces. They are only dynamically visible when proper data input have been filled to the query form components. Traditional search engines would not build index for these database contents. The static web pages that are indexed are generally called Surface Web, while those data sources hidden in web databases are called Deep Web. According to Gil’s estimation, up to Spring 2011, Google indexed 26.5 billion static web pages, while there were more than 300 billion database-driven pages completely invisible to Google. Deep web contents normally are organized for specific domain. Compared to surface web, deep web could provide data with better quality. How to make web users acknowledge the existence of these deep web query interfaces to make use of their contents has become a very important research topic. Previous research in this topic generally would first extract and analyze form characteristics, like label, name, attribute values of form components. They would then use machine learning techniques to classify the query forms. Due to the huge divergence in query form design concepts, these classifiers do not have effective results. After initial study of deep web and non-deep web query forms, we have the following observations: (1) Some specific words tend to appear in labels of non-deep web query forms, and not in labels of deep web query forms; (2) Deep web query forms and their query results contain lots of semantic information for specific domains; (3) Many deep web query interfaces have a keyword input component, which could be filled with any value to obtain query results. We thus, propose to design and implement a two-phase classification technique. We would build in advance non-deep web query form feature words, semantics and instances of specific domain frequent fields. We would combine pre-query form characteristics analysis and post-query results analysis to develop an automatic deep web query form classification system. This system is expected to produce highly effective result in classifying random web forms.
顯示於類別:	[資訊管理學系暨研究所] 研究報告

文件中的檔案:

沒有與此文件相關的檔案.

在機構典藏中所有的資料項目都受到原著作權保護.

TAIR相關文章

資料載入中.....