English  |  正體中文  |  简体中文  |  Items with full text/Total items : 57352/90925 (63%)
Visitors : 13071436      Online Users : 237
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: http://tkuir.lib.tku.edu.tw:8080/dspace/handle/987654321/34963

    Title: Progressive analysis scheme for web document classification
    Other Titles: 漸進式網頁文件分類技術
    Authors: 宋立群;Sung, Li-chun
    Contributors: 淡江大學資訊工程學系博士班
    郭經華;Kuo, Chin-hwa
    Keywords: 網頁探勘;網頁分類;漸進式分析;Web Mining;Web Document Classification;Progressive Analysis
    Date: 2007
    Issue Date: 2010-01-11 05:49:47 (UTC+8)
    Abstract: 在本篇論文中,我們提出了一種漸進式網頁文件分類技術(簡稱PAS)。透過這種分類技術,由於分類器只需分析文件中部分關鍵區塊的內容,就足以確認文件之所屬類別,因此可以達到提升網頁分類效率的目的。
    In this thesis, we propose a web document classification scheme, called the Progressive Analysis Scheme (PAS), whose classification performance is improved by just analyzing few key parts sufficient for category confirmation.
    Based on the DOM tag-tree structure, a web document can be segmented into small tag-regions. Each tag-region is visualized by a visual type which corresponds to a specific nested combination of tag-pairs. Under observation, the profitabilities of tag-regions for classification will vary among visual types caused by the web authoring convention. In addition, in a document, the profitabilities of tag-regions of a visual type may also vary caused by the document writing knacks.
    In the thesis, for each visual type, we model the two kinds of profitability variations as the profitability tendencies and the tendency transition patterns based on the Expectation Maximization scheme and the Hidden Markov Model scheme. For classification, we integrate them into a profitability forecasting strategy further. Based on the forecasting strategies, we will forecast the potential profitabilities of unanalyzed tag-regions and extract continuously the most profitable unanalyzed tag-regions for classification until category confirmation.
    Dynamically, the forecasting strategies will be optimized for the document by feeding back the actual profitabilities of analyzed tag-regions to them. Thus, the profitabilities of next tag-regions can be forecasted more accurately. In addition, for each unreliable model generated by a sparse set of training samples, we propose a solution which is to support its forecasting process by the strategies of other similar visual types.
    Through simulations, the results will show that PAS has better classification performance than the previous approaches, such as the full-text (e.g. SVM) and sequential classifiers.
    Appears in Collections:[Graduate Institute & Department of Computer Science and Information Engineering] Thesis

    Files in This Item:

    File SizeFormat

    All items in 機構典藏 are protected by copyright, with all rights reserved.

    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - Feedback