淡江大學機構典藏:Item 987654321/34963
English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 64178/96951 (66%)
造訪人次 : 9409019      線上人數 : 8932
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋
    請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/34963


    題名: Progressive analysis scheme for web document classification
    其他題名: 漸進式網頁文件分類技術
    作者: 宋立群;Sung, Li-chun
    貢獻者: 淡江大學資訊工程學系博士班
    郭經華;Kuo, Chin-hwa
    關鍵詞: 網頁探勘;網頁分類;漸進式分析;Web Mining;Web Document Classification;Progressive Analysis
    日期: 2007
    上傳時間: 2010-01-11 05:49:47 (UTC+8)
    摘要: 在本篇論文中,我們提出了一種漸進式網頁文件分類技術(簡稱PAS)。透過這種分類技術,由於分類器只需分析文件中部分關鍵區塊的內容,就足以確認文件之所屬類別,因此可以達到提升網頁分類效率的目的。
    一般而言,網頁文件可以根據其DOM架構分割為許多小的標籤區域。每塊標籤區域,通常會被以特定的視覺型態加以呈現於瀏覽器視窗中。而這種視覺型態,則由附加於此標籤區域上之HTML成對標籤所構成。根據我們的觀察,由於網頁的寫作習慣,標籤區域中內容對分類的益助性會隨著其視覺型態的不同而有不同的傾向。除此之外,在文件中具有相同視覺型態的標籤區域,也會因為文件寫作技巧的考量而具有不同的分類益助性。
    在本篇論文中,我們藉由分析大量網頁文件,並藉由EM與HMM等模式識別技術的輔助,識別出每種視覺型態的益助性特質,包括:益助性傾向、與益助性變化模式。我們將這兩種特質加以整合,提出了一套標籤區域益助性預測機制。在進行分類時,我們可以透過這套機制動態地預測每塊還未被分析之標籤區域的益助性,並漸進地擷取最有益助性之標籤區域進行分類運算,直到網頁類別被確認為止。
    為了減少錯誤預測的機率,預測機制會根據已分析過標籤區域之實際益助性,進行自身最佳化調整。此外,對於罕見視覺型態之益助性預測,預測機制會同時參考其近似之視覺型態的益助性特質,以期獲得較正確之預測。
    透過實驗,我們說明了參數設定對分類器效能的影響,並驗證了所提出之網頁分類技術的優越性。
    In this thesis, we propose a web document classification scheme, called the Progressive Analysis Scheme (PAS), whose classification performance is improved by just analyzing few key parts sufficient for category confirmation.
    Based on the DOM tag-tree structure, a web document can be segmented into small tag-regions. Each tag-region is visualized by a visual type which corresponds to a specific nested combination of tag-pairs. Under observation, the profitabilities of tag-regions for classification will vary among visual types caused by the web authoring convention. In addition, in a document, the profitabilities of tag-regions of a visual type may also vary caused by the document writing knacks.
    In the thesis, for each visual type, we model the two kinds of profitability variations as the profitability tendencies and the tendency transition patterns based on the Expectation Maximization scheme and the Hidden Markov Model scheme. For classification, we integrate them into a profitability forecasting strategy further. Based on the forecasting strategies, we will forecast the potential profitabilities of unanalyzed tag-regions and extract continuously the most profitable unanalyzed tag-regions for classification until category confirmation.
    Dynamically, the forecasting strategies will be optimized for the document by feeding back the actual profitabilities of analyzed tag-regions to them. Thus, the profitabilities of next tag-regions can be forecasted more accurately. In addition, for each unreliable model generated by a sparse set of training samples, we propose a solution which is to support its forecasting process by the strategies of other similar visual types.
    Through simulations, the results will show that PAS has better classification performance than the previous approaches, such as the full-text (e.g. SVM) and sequential classifiers.
    顯示於類別:[資訊工程學系暨研究所] 學位論文

    文件中的檔案:

    檔案 大小格式瀏覽次數
    0KbUnknown378檢視/開啟

    在機構典藏中所有的資料項目都受到原著作權保護.

    TAIR相關文章

    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - 回饋