淡江大學機構典藏:Item 987654321/74422
English  |  正體中文  |  简体中文  |  Items with full text/Total items : 62797/95867 (66%)
Visitors : 3736853      Online Users : 376
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/74422


    Title: 一個自動化網頁資料表格結構辨識系統
    Other Titles: An automatic web data table structure recognition system
    Authors: 陳雅伶;Chen, Ya-Ling
    Contributors: 淡江大學資訊管理學系碩士班
    周清江;Jou, Chi-chang
    Keywords: 表格結構;資訊擷取;表格探勘;Table Structure;Web Mining;information extraction
    Date: 2011
    Issue Date: 2011-12-28 18:36:30 (UTC+8)
    Abstract: 為了從網頁表格中擷取出重要的資訊,過去的研究已提出許多
    不錯的表格結構辨識方法,但在複雜的表格結構中,往往因為儲存格相似性的比對不足或相關表格資料不足,造成表格結構辨識準確率降低。本文設計並實作一個自動化網頁資料表格結構辨識系統,經由經驗法則分析表格結構特徵(TSF, Table Structure Feature)和儲存格內容型態(CT, Cell Type),先將表格分類至九種不同的表格類型,分類後的表格再使用儲存格內容型態值個別區分屬性名稱和屬性值,複雜表格也增加了經驗法則與2x2表格的常見的屬性名稱輔助辨識,使各種領域的表格皆可正確的進行分析,且為了不浪費記憶體的空間以及可簡單又清楚的找到每一筆資料的紀錄,本研究將區分出的屬性名稱和屬性值轉換為關聯式資料表格式呈現。最後我們透過比對人工建立的驗證資料,證實本系統確實能有效改善網頁表格結構辨識準確率,最後進一步分析辨識錯誤的表格,找出原因及提出後續處理對策。
    Many techniques have been proposed to extract important information in web tables. Many of these information extraction techniques are successful for simple tables. However, their applications to complex tables usually obtain unsatisfactory accuracy, due to inadequate similarity comparison among table cells and insufficient table information collection. We design and implement an automatic web data table structure recognition system to tackle this problem. This system would first classify web data tables into nine table categories by analyzing TSF (Table Structure Feature) and CT (Cell Type) through heuristics. After the classification phase, each cell is identified as table attributes or table values by analyzing table structures in each category. For complex tables, we use heuristics and common attribute name recognition in 2x2 tables to recognize table structures. Furthermore, table attributes and table values are presented as relational tables to save memory space and to identify each record clearly. We not only test the effectiveness of our system, but also analyze why some table structures are wrongly recognized. The reasons are identified and future developments to handle these cases are suggested.
    Appears in Collections:[Graduate Institute & Department of Information Management] Thesis

    Files in This Item:

    File SizeFormat
    index.html0KbHTML336View/Open

    All items in 機構典藏 are protected by copyright, with all rights reserved.


    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - Feedback