一個自動化網頁資料表格結構辨識系統

淡江大學機構典藏 > 商管學院 > 資訊管理學系暨研究所 > 學位論文 > Item 987654321/74422

請使用永久網址來引用或連結此文件: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/74422

題名:	一個自動化網頁資料表格結構辨識系統
其他題名:	An automatic web data table structure recognition system
作者:	陳雅伶;Chen, Ya-Ling
貢獻者:	淡江大學資訊管理學系碩士班周清江;Jou, Chi-chang
關鍵詞:	表格結構;資訊擷取;表格探勘;Table Structure;Web Mining;information extraction
日期:	2011
上傳時間:	2011-12-28 18:36:30 (UTC+8)
摘要:	為了從網頁表格中擷取出重要的資訊，過去的研究已提出許多不錯的表格結構辨識方法，但在複雜的表格結構中，往往因為儲存格相似性的比對不足或相關表格資料不足，造成表格結構辨識準確率降低。本文設計並實作一個自動化網頁資料表格結構辨識系統，經由經驗法則分析表格結構特徵(TSF, Table Structure Feature)和儲存格內容型態(CT, Cell Type)，先將表格分類至九種不同的表格類型，分類後的表格再使用儲存格內容型態值個別區分屬性名稱和屬性值，複雜表格也增加了經驗法則與2x2表格的常見的屬性名稱輔助辨識，使各種領域的表格皆可正確的進行分析，且為了不浪費記憶體的空間以及可簡單又清楚的找到每一筆資料的紀錄，本研究將區分出的屬性名稱和屬性值轉換為關聯式資料表格式呈現。最後我們透過比對人工建立的驗證資料，證實本系統確實能有效改善網頁表格結構辨識準確率，最後進一步分析辨識錯誤的表格，找出原因及提出後續處理對策。 Many techniques have been proposed to extract important information in web tables. Many of these information extraction techniques are successful for simple tables. However, their applications to complex tables usually obtain unsatisfactory accuracy, due to inadequate similarity comparison among table cells and insufficient table information collection. We design and implement an automatic web data table structure recognition system to tackle this problem. This system would first classify web data tables into nine table categories by analyzing TSF (Table Structure Feature) and CT (Cell Type) through heuristics. After the classification phase, each cell is identified as table attributes or table values by analyzing table structures in each category. For complex tables, we use heuristics and common attribute name recognition in 2x2 tables to recognize table structures. Furthermore, table attributes and table values are presented as relational tables to save memory space and to identify each record clearly. We not only test the effectiveness of our system, but also analyze why some table structures are wrongly recognized. The reasons are identified and future developments to handle these cases are suggested.
顯示於類別:	[資訊管理學系暨研究所] 學位論文

文件中的檔案:

檔案	大小	格式	瀏覽次數
index.html	0Kb	HTML	536	檢視/開啟

在機構典藏中所有的資料項目都受到原著作權保護.

TAIR相關文章

資料載入中.....