Web Crawler

Prerequistes

Java 8
Maven

Library Used

Jsoup-1.11.3
Gson-2.1.1

Overview

Web crawler provide an async web crawling over a given URL. There are below input parameter required:

URL : URL on which web crawling going to done
domain : As page can have multiple other source urls so just serach to the given domain
depth : number of recursion with url
breath : limit on number of Urls on the page

How to build & test

mvn clean compile build : used to build apis
mvn test : used to run testcases over the AyncWebCrawler

Code Walkthrough

AsyncWebCrawler is the singleton class responsible to crawl

public static AyncWebCrawler instance(String domain){
  	if(instance == null) {
  		synchronized (AyncWebCrawler.class) {
  			instance = new AyncWebCrawler(domain);
  		}
  	}
  	return instance;
  }

crawl method is a recursive function which uses Completable Future and common fork join pool

supplyAsync(content(startingUrl))
  			.thenApply(fetchUrls(domain))
  			.thenApply(doForEach(depth))
  			.thenApply(futures -> futures.toArray(CompletableFuture[]::new))
  			.thenAccept(CompletableFuture::allOf).join();

it uses visited queue to track url which are visited on that depth
fetch url uses Jsoup library to get content and urls from the page

private Function<Document, Set<String>> fetchUrls(String domain) {
  	return doc -> {
  		return doc != null ? doc.select("a[href]").stream()
  				.map(link -> link.attr("abs:href"))
  				.filter(url -> url != null && url.contains(domain))
  				.peek(System.out::println)
  				.collect(toSet()) : new HashSet<>();
  	};
  }

code snippet to crawl and get response for the given url

ResponseVo responseVo = crawler.respone(domain);

response vo contains domain name, count of the unique urls searched and list of urls :

public class ResponseVo {
  
  private String domain;
  private int count;
  private List<String> urls;
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
src		src
target		target
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

target

target

README.md

README.md

pom.xml

pom.xml

Repository files navigation

Web Crawler

Prerequistes

Library Used

Overview

How to build & test

Code Walkthrough

About

Releases

Packages

Languages

aditya3322/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Prerequistes

Library Used

Overview

How to build & test

Code Walkthrough

About

Topics

Resources

Stars

Watchers

Forks

Languages