Skip to content

✂ Screen scraping utility for Java using annotations

Notifications You must be signed in to change notification settings

umjammer/vavi-util-screenscraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Release Java CI CodeQL Java

Screen Scraping Library for Java

🌏 Scrape the world!

Introduction

This library screen-scrapes data from html and injects data into POJO using annotation.

    @WebScraper(url = "http://foo.com/bar.html")
    public class Baz {
        @Target(value = "//TABLE//TR/TD[2]/DIV/text()")
        String artist;
        @Target(value = "//TABLE//TR/TD[4]/A/text()")
        String title;
        @Target(value = "//TABLE//TR/TD[4]/A/@href")
        String url;
    }
    
    :
    
    List<Baz> bazs = WebScraper.Util.scrape(Baz.class);

Install

Details

  • InputHandler ... apply any processing before parsing

  • Parser

    • XPathParser ... default
    • HtmlXPathParser ... for original purpose
    • SaxonXPathParser ... for huge xml file
    • JsonPathParser ... for json return
  • Parser#foreach() ... like java collection stream

Sample

TODO

  • Tidy version
  • deleted garbled text
  • InputHandler w/o cache
  • argument injection into WebScraper#url
        @WebScraper(url = "http://foo.com?bar={bar}")
        public static class Result {
            :
    
        List<Result> data = WebScraper.Util.scrape(Result.class, @UrlParam(bar) args[0]);
    
  • json parser
  • css selector
  • integrate serdes
  • @WebScraper#encoding()
  • @Target add exception handler or second, third option
  • xml2xpath