Data hidden inside the deep web are of much higher quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Unfortunately, deep web data from one site normally is insufficient for users. Users usually need to integrate information from several deep web sites. It is time-consuming to manually perform form filling for many web sites and to collect their query results. An integrated deep web query interface could help alleviate the above web users’ burdens. One of the key technologies in building such integrated query interface is schema matching and merging. Previous solutions usually perform schema matching and merging separately in a holistic approach by utilizing the statistical information of attributes of the involved schemas. That approach does not take user preference of the web sites into account. We propose new deep web query interface integration (DWQII) methodology based on incremental schema matching and merging. Our matching method is based on string similarity and synonyms of labels. Besides schema matching and merging, our system also automatically transforms query conditions from the integrated query interface into those suitable for individual web sites. Our methodology has the benefit of being able to easily supplement new deep web query interfaces into previously established integrated query interfaces. We design and implement DWQII using object oriented approach. To test DWQII, we integrate nine search interfaces in the books domain. These web sites are collected from the open directory dmoz.org, including Amazon, eBay, and other popular sites. We also conduct query experiments using our integrated query interface to verify feasibility and measure performance of the methodology.
Proceedings of The 3rd Multidisciplinary International Social Networks Conference on SocialInformatics 2016, Data Science 2016