Skip to content

mantask/thesis-wrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Summary

The web contains semi-structured information in HTML. To extract structured data from a web page, which gets constant user interface updates, a non-breaking method is required. Building a robust and fast record-level wrapper from a single annotated web page is the subject of this project. Current state of the art methods include mining data regions to recognize template generated areas on page, probabilistic wrapper induction to extract data from a single data region in a robust way, and partial tree alignment to repeatedly extract data from multiple regions. In this thesis, we combine these three ideas into a new method and design a system for robust data extraction. Experimental results using a large number of web pages from multiple domains show that the proposed approach works with a high precision and within reasonable execution time on commodity hardware.

Refer to status updates for more information about the project.

About

An experimental setup for my master thesis "Optimizing web extraction queries for robustness"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published