Skip to content

wsdookadr/femtocrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intro

This is a minimalistic dynamic page crawler accompanied by a variety of tools used to produce sitemaps and some tools to deal with WARC files.

Quick start

Make sure there is a group "docker" on your machine and that the user you’ll run femtocrawl under is added to that group. Also see steps 1-2 in the docker docs:

sudo groupadd docker
sudo usermod -aG docker $USER

Clone femtocrawl and pull the latest femtocrawl image:

git clone https://github.com/wsdookadr/femtocrawl
cd femtocrawl
docker pull wsdookadr/femtocrawl

Create a top-level directory for the crawl.

This will hold all the directory hierarchy for the crawl.

mkdir ~/crawl1

Create the symlinks:

./bin/op.py --symcreate ~/crawl1

Add the urls to be crawled to input/list_urls.txt:

https://lobste.rs
https://news.ycombinator.com
http://google.com
http://mozilla.org

Run the crawl

./bin/op.py --crawl

After the previous step, you’ll see the following files:

user@garage3:~/zim-bench/femtocrawl$ ls -l warc/
total 1088
-rw-r--r-- 1 user user      84 Aug 22 01:53 1.urls
-rw-r--r-- 1 user user 1109534 Aug 22 01:53 1.warc

At this point, you can check the contents of the WARC using replayweb.page

Now it’s time for validation

./bin/op.py --validate

If invalid WARCs are reported, you can investigate further or exclude them by deleting them.

When all WARCs in the warc/ directory are valid, you can join all the WARCs into warc/big.zim

./bin/op.py --join

At this point you can convert to ZIM

./bin/op.py --zim

And you can serve the archive locally at http://localhost:8083 like this:

./bin/op.py --kiwix

If you want to do offline searches via bin/warc_query.py then you should also index the data:

./bin/op.py --index

You can also use multiple of these switches at the same time.

Roadmap

  • ✓ basic crawling of web pages

  • ✓ warc joining

  • ✓ zim conversion

  • ✓ kiwix integration

  • ✓ warc indexing and search (for html and pdf records)

  • ✓ improve docker image build times. a lot of steps are re-run every time now.

  • ✓ rewrite bin/femtocrawl.sh in python but retain logic, add args including har output switch

  • [50%] add chromium support

  • ✓ the bash version of femtocrawl knew where to pick up where it left off, do the same for the py version

  • ✓ merge validation into crawling, right after the warc has been produced, as a background process independent of the rest.

  • ✓ pdf files don’t render properly. comparing the original with the archived one shows differences in size. the issue is most likely in har_dump.py or har2warc. take the smallest pdf possible and analyze

  • ❏ make it easy to check and handle redownload of broken resources in the already downloaded warc (resources that didn’t have time to complete within the given timeout)

  • ❏ pdf autodownload (when the pdf does not load, but a dialog shows up instead)

  • ❏ archive.org support for an entire website to be downloaded. one of the problems is fixing the old links which may be invalid and that may require patching the warc (target issues found here )

  • ❏ build capability to compare har files for the same web page loaded in different browsers. (request completion times, uris of the requests made, response status codes)

  • ❏ find ways to force terminate firefox without corrupting its profile (currently cache is completely disabled on purpose). so termination should work, but the cache should be preserved and reused between batches but no other data should.

  • ❏ experiment with and add compression for lossy multimedia compression openzim/warc2zim#72

  • ❏ design recrawl using a combination of: sitemaps, feeds, common-crawl, etag header & HEAD request

  • ❏ find a way to take a dom snapshot post-rendering, export it as html and use that for indexing. this will help with comment sections which are many times loaded by 3rd party js and show up as separate warc records. see python-cdp , pdso , dom-snapshot

  • ❏ find ways to derive website templates, use that to locate redundant content and filter it out of the indexing process in order to improve offline search results

  • ❏ integrate blocking lists at proxy-level see also webrecorder/browsertrix-crawler#154

  • ❏ write tests for this entire project by finding some relevant url set, recording the traffic and playing it back with mitmproxy.

  • ❏ investigate usage of latest browser binaries instead of distro packages

  • ❏ the sitemap generation process is too ad-hoc, needs to be generalized and made easy to use

  • ❏ improve docs with as many examples as possible

  • ❏ find a way to predict an optimal timeout for a single page based on previous pages on the same domain or same common prefix that is long enough (under the assumption that a long common prefix is a good indicator of similar load time). will require setting up some actual telemetry between the browser and some separate data store, might require some browser extension. this will also help in determining an optimal batch timeout.

  • ❏ look more into ff source docs to see if there are possible improvements

Contributing

The focus is on the roadmap, pull requests are welcome

FAQ

How does it work?

More details about the way it works are in this blog post.

My host user UID/GID don’t match the container UID/GID. What can I do?

For now, just change them in the Dockerfile and rebuild the docker image.

I want to change the browser profile, add extensions or userscripts, how do I do that?

Run the following on the host to get the Firefox profile

id=$(docker create wsdookadr/femtocrawl:latest)
docker cp $id:/home/user/ff ~/.mozilla/firefox/p1
docker rm -v $id

Start Firefox on the host with firefox --profile ~/.mozilla/firefox/p1. Make any changes you want to it, close Firefox, zip the profile and place it in data/ff.zip and rebuild the Docker image.

Note
The default ff profile comes with violentmonkey and uBlock.

I want to crawl a site that requires me to log in

See the previous item

I have some sites I’d like to crawl, what do I do?

On the host, do the following: place the urls you want crawled in a file, one per line and run bin/triage_new_links.sh on that file, that will produce two files with_sitemap.txt and without_sitemap.txt. Now add the contents of those to bin/gen_sitemap.sh and run it. This will produce list_urls.txt which you can use as input for femtocrawl.

I want to crawl some parts of reddit and read them offline, how do I do that?

Have a look at sitemap_reddit.py

What kind of performance can I expect?

On a 56 Mbps connection with 10 urls and 29 seconds per batch, you can crawl 29k urls per day. The CPU usage is minimal.

I want to read offline a website archived by archive.org. What do I do?

Coming soon.

Some links will be added to the input list. Delete the last batch to make sure no links will be missed.

rm warc/$(ls -tr warc/ | tail -1)

Suppose you’ve crawled a forum, but urls containing /attachment were not fetched and you want those too. Run the following to extract the links from the archives, and re-run the crawl.

find warc/ -name "*.warc" | xargs -I{} ./bin/warc_resources.py --infile {} --links | grep "/attachment" | sort | uniq >> input/list_urls.txt
./bin/op.py --crawl

What do I use this for?

Use-cases:

  • building offline web archives

  • website testing

  • cross-testing different web archiving tools

  • long-term news archiving

  • building web corpuses