English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 62805/95882 (66%)
造訪人次 : 3949658      線上人數 : 1017
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library & TKU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋
    請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/74422


    題名: 一個自動化網頁資料表格結構辨識系統
    其他題名: An automatic web data table structure recognition system
    作者: 陳雅伶;Chen, Ya-Ling
    貢獻者: 淡江大學資訊管理學系碩士班
    周清江;Jou, Chi-chang
    關鍵詞: 表格結構;資訊擷取;表格探勘;Table Structure;Web Mining;information extraction
    日期: 2011
    上傳時間: 2011-12-28 18:36:30 (UTC+8)
    摘要: 為了從網頁表格中擷取出重要的資訊,過去的研究已提出許多
    不錯的表格結構辨識方法,但在複雜的表格結構中,往往因為儲存格相似性的比對不足或相關表格資料不足,造成表格結構辨識準確率降低。本文設計並實作一個自動化網頁資料表格結構辨識系統,經由經驗法則分析表格結構特徵(TSF, Table Structure Feature)和儲存格內容型態(CT, Cell Type),先將表格分類至九種不同的表格類型,分類後的表格再使用儲存格內容型態值個別區分屬性名稱和屬性值,複雜表格也增加了經驗法則與2x2表格的常見的屬性名稱輔助辨識,使各種領域的表格皆可正確的進行分析,且為了不浪費記憶體的空間以及可簡單又清楚的找到每一筆資料的紀錄,本研究將區分出的屬性名稱和屬性值轉換為關聯式資料表格式呈現。最後我們透過比對人工建立的驗證資料,證實本系統確實能有效改善網頁表格結構辨識準確率,最後進一步分析辨識錯誤的表格,找出原因及提出後續處理對策。
    Many techniques have been proposed to extract important information in web tables. Many of these information extraction techniques are successful for simple tables. However, their applications to complex tables usually obtain unsatisfactory accuracy, due to inadequate similarity comparison among table cells and insufficient table information collection. We design and implement an automatic web data table structure recognition system to tackle this problem. This system would first classify web data tables into nine table categories by analyzing TSF (Table Structure Feature) and CT (Cell Type) through heuristics. After the classification phase, each cell is identified as table attributes or table values by analyzing table structures in each category. For complex tables, we use heuristics and common attribute name recognition in 2x2 tables to recognize table structures. Furthermore, table attributes and table values are presented as relational tables to save memory space and to identify each record clearly. We not only test the effectiveness of our system, but also analyze why some table structures are wrongly recognized. The reasons are identified and future developments to handle these cases are suggested.
    顯示於類別:[資訊管理學系暨研究所] 學位論文

    文件中的檔案:

    檔案 大小格式瀏覽次數
    index.html0KbHTML336檢視/開啟

    在機構典藏中所有的資料項目都受到原著作權保護.

    TAIR相關文章

    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library & TKU Library IR teams. Copyright ©   - 回饋