Skip to content

niczem/trawler

Repository files navigation

Trawler

A job scheduler and analysis tool for webscraping (and other) tasks.

Node.js Package

Datasources

Curently the following datasources are implemented:

"

  • facebook posts and reactions scrape facebook posts, comments and reactions (like, heart, etc)
  • gab (nazi-twitter) crawl posts for user
  • google dorking find interesting files and download them
  • json to csv convert json array into csv
  • mail sends mails and files - mostly usefull in pipelines
  • masscan udp based port scanner (requires docker)
  • motiondetection script to to motionanalysis in directory with videofiles
  • onionlist download tor-catalogue from onionlist.org
  • onions.danwin1210.de download tor-catalogue from danwin1210.de, and creates screenshots of each website in the result
  • tiktok get video metadata per hashtag, download them and analyse the text using easyOCR
  • url generic http scraper
  • urlscreenshotter scrapes comma separated list of urls and creates screenshot of each of them"

Create your own datasource

- copy template dir in ./jobs
- define fields in fields.js which are needed to start the job
- a job can output one or multiple files
- no directories should be used, please use archives
- use job_id.ext (eg job_id.json) as filename

Features

  • simple configuration of actions/datasources, also from 3rd party modules/repos
  • job monitoring and scheduling
  • schedule jobs
  • sqlite, csv and json browser
  • separation of datasets/artifacts (one archive per crawl)
  • scalable amount of workers (also on other machines)

Architecture

Frontend and API

  • GUI to create and schedule jobs
  • Displays pending, running and done jobs
  • Display csv and sqlite datasets

Worker(s)

  • Can be distributed (workers and c&c on different locations/servers)
  • Jobs are managed through json files (and can be distrubuted with an adapter like pouchDB)
  • Multithreaded

Install & run

Using NPM

npm i
npm run all