This is a minimalistic dynamic page crawler accompanied by a variety of tools used to produce sitemaps and some tools to deal with WARC files.
Make sure there is a group "docker" on your machine and that the user you’ll run femtocrawl under is added to that group. Also see steps 1-2 in the docker docs:
sudo groupadd docker sudo usermod -aG docker $USER
Clone femtocrawl and pull the latest femtocrawl image:
git clone https://github.com/wsdookadr/femtocrawl cd femtocrawl docker pull wsdookadr/femtocrawl
Create a top-level directory for the crawl.
This will hold all the directory hierarchy for the crawl.
mkdir ~/crawl1
Create the symlinks:
./bin/op.py --symcreate ~/crawl1
Add the urls to be crawled to input/list_urls.txt
:
https://lobste.rs https://news.ycombinator.com http://google.com http://mozilla.org
Run the crawl
./bin/op.py --crawl
After the previous step, you’ll see the following files:
user@garage3:~/zim-bench/femtocrawl$ ls -l warc/ total 1088 -rw-r--r-- 1 user user 84 Aug 22 01:53 1.urls -rw-r--r-- 1 user user 1109534 Aug 22 01:53 1.warc
At this point, you can check the contents of the WARC using replayweb.page
Now it’s time for validation
./bin/op.py --validate
If invalid WARCs are reported, you can investigate further or exclude them by deleting them.
When all WARCs in the warc/
directory are valid, you can join all the WARCs into warc/big.zim
./bin/op.py --join
At this point you can convert to ZIM
./bin/op.py --zim
And you can serve the archive locally at http://localhost:8083 like this:
./bin/op.py --kiwix
If you want to do offline searches via bin/warc_query.py
then you
should also index the data:
./bin/op.py --index
You can also use multiple of these switches at the same time.
-
✓ basic crawling of web pages
-
✓ warc joining
-
✓ zim conversion
-
✓ kiwix integration
-
✓ warc indexing and search (for html and pdf records)
-
✓ improve docker image build times. a lot of steps are re-run every time now.
-
✓ rewrite bin/femtocrawl.sh in python but retain logic, add args including har output switch
-
[50%] add chromium support
-
✓ the bash version of femtocrawl knew where to pick up where it left off, do the same for the py version
-
✓ merge validation into crawling, right after the warc has been produced, as a background process independent of the rest.
-
✓ pdf files don’t render properly. comparing the original with the archived one shows differences in size. the issue is most likely in har_dump.py or har2warc. take the smallest pdf possible and analyze
-
❏ make it easy to check and handle redownload of broken resources in the already downloaded warc (resources that didn’t have time to complete within the given timeout)
-
❏ pdf autodownload (when the pdf does not load, but a dialog shows up instead)
-
❏ archive.org support for an entire website to be downloaded. one of the problems is fixing the old links which may be invalid and that may require patching the warc (target issues found here )
-
❏ build capability to compare har files for the same web page loaded in different browsers. (request completion times, uris of the requests made, response status codes)
-
❏ find ways to force terminate firefox without corrupting its profile (currently cache is completely disabled on purpose). so termination should work, but the cache should be preserved and reused between batches but no other data should.
-
❏ experiment with and add compression for lossy multimedia compression openzim/warc2zim#72
-
❏ design recrawl using a combination of: sitemaps, feeds, common-crawl, etag header & HEAD request
-
❏ find a way to take a dom snapshot post-rendering, export it as html and use that for indexing. this will help with comment sections which are many times loaded by 3rd party js and show up as separate warc records. see python-cdp , pdso , dom-snapshot
-
❏ find ways to derive website templates, use that to locate redundant content and filter it out of the indexing process in order to improve offline search results
-
❏ integrate blocking lists at proxy-level see also webrecorder/browsertrix-crawler#154
-
❏ write tests for this entire project by finding some relevant url set, recording the traffic and playing it back with mitmproxy.
-
❏ investigate usage of latest browser binaries instead of distro packages
-
❏ the sitemap generation process is too ad-hoc, needs to be generalized and made easy to use
-
❏ improve docs with as many examples as possible
-
❏ find a way to predict an optimal timeout for a single page based on previous pages on the same domain or same common prefix that is long enough (under the assumption that a long common prefix is a good indicator of similar load time). will require setting up some actual telemetry between the browser and some separate data store, might require some browser extension. this will also help in determining an optimal batch timeout.
-
❏ look more into ff source docs to see if there are possible improvements
More details about the way it works are in this blog post.
For now, just change them in the Dockerfile and rebuild the docker image.
Run the following on the host to get the Firefox profile
id=$(docker create wsdookadr/femtocrawl:latest) docker cp $id:/home/user/ff ~/.mozilla/firefox/p1 docker rm -v $id
Start Firefox on the host with firefox --profile ~/.mozilla/firefox/p1
.
Make any changes you want to it, close Firefox, zip the profile and place it in data/ff.zip
and rebuild the Docker image.
Note
|
The default ff profile comes with violentmonkey and uBlock. |
On the host, do the following: place the urls you want crawled in a file,
one per line and run bin/triage_new_links.sh
on that file, that will
produce two files with_sitemap.txt
and without_sitemap.txt
. Now
add the contents of those to bin/gen_sitemap.sh
and run it. This will
produce list_urls.txt
which you can use as input for femtocrawl.
Have a look at sitemap_reddit.py
On a 56 Mbps connection with 10 urls and 29 seconds per batch, you can crawl 29k urls per day. The CPU usage is minimal.
Some links will be added to the input list. Delete the last batch to make sure no links will be missed.
rm warc/$(ls -tr warc/ | tail -1)
Suppose you’ve crawled a forum, but urls containing /attachment
were not fetched and you want those too.
Run the following to extract the links from the archives, and re-run the crawl.
find warc/ -name "*.warc" | xargs -I{} ./bin/warc_resources.py --infile {} --links | grep "/attachment" | sort | uniq >> input/list_urls.txt ./bin/op.py --crawl