Rotakka

Rotakka is a distributed cluster application designed for scalable Twitter crawling. Its main advantage is that it avoids IP-based blocking by exploiting publicly available web proxies. In contrast to API-based approaches, Rotakka uses browser emulation enabled by Selenium to visit and download Twitter user profiles. It is built on the Akka framework and consists of

a proxy-collecting module,
a proxy-checking module,
a Twitter-crawling module,
and a graph-storing module.

Requirements

Java 8
Maven
a working Selenium driver, in our case:
- an installed Google Chrome or Chromium browser
- a downloaded chromedriver binary of the same version as the Chrome browser
following environment variables must be set:
- CHROME_DRIVER_PATH
  - our value: /usr/bin/chromedriver
- CHROME_BINARY_PATH
  - our value: /usr/bin/google-chrome-stable
- CHROME_HEADLESS_MODE
  - on servers: true
  - for visual development: false

For further instructions, have a look at the scripts in the "deployment" directory.

Usage

Building a Fat-JAR

mvn package

The Jar will be created in the "target" directory.

Running the Fat-JAR

java [-Drotakka.config.parameter="whatever"] -jar rotakka-1.0.jar

There can be multiple config parameters added, each prepended with "-D".

At the end of the command above, either "master" or "slave" must follow, otherwise the help is printed.

Developing with IntelliJ

Just import the project as Maven project.

Master configuration

ProgramArguments="master"
EnvironmentVariables=... (set them as mentioned above)

Slave configuration

ProgramArguments="slave -mh 127.0.0.1"
EnvironmentVariables=... (set them as mentioned above)

Useful Config Parameters

All other parameters regaring the system can be found in the rotakka.conf file within the resource folder. A description of each parameter can be found in the config itself.

Project Structure

As mentioned above, Rotakka is split into several parts. In this section, we will examine each package and explain the most important facts.

Top Level Files

On the top level there are several files associated with starting Rotakka. Most importantly, we see the MainApp class which is responsible for starting the system.

Cluster

Within this package we have the ClusterListener and the MetricsListener. Both actors are mostly used for logging and being able to extract the results such as Total Tweets from the logs. It is important to note that these are not cluster singletons, but exist on each node. This means that the outputs will have to be manually aggregated across the different nodes to get a complete picture.

Graph

This package implements the Graph Building and Storing. It will not be further explained here because it has a separate and very detailed README.

Proxy

This part of the codebase is responsible for the crawling of public proxies as well as for checking whether these proxies fulfil the quality requirements which we impose on them. This package includes both the checking-package and the crawling-package and some data classes. While both packages contain the actors already known from the paper, the crawling package also includes the code specific to the public proxy websites.

Twitter

This package is responsible for crawling Twitter. It includes both the scheduler and the worker.

Utils

This package includes several utility classes which are used throughout the project. Most importantly, it also includes the code to start the Selenium WebDriver.

Disclaimer

Rotakka is a powerful system and can be used to scrape huge amounts of data within a short time frame. We encourage any potential user to comply with the limitations set by the service which they intent to crawl. Most of these limitations can be found in the Terms of Service. We are not responsible for any damage created by the misuse of our system.

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
deployment		deployment
paper_latex		paper_latex
src		src
.gitignore		.gitignore
README.md		README.md
Rotakka_Paper.pdf		Rotakka_Paper.pdf
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deployment

deployment

paper_latex

paper_latex

src

src

.gitignore

.gitignore

README.md

README.md

Rotakka_Paper.pdf

Rotakka_Paper.pdf

pom.xml

pom.xml

Repository files navigation

Rotakka

Requirements

Usage

Building a Fat-JAR

Running the Fat-JAR

Developing with IntelliJ

Master configuration

Slave configuration

Useful Config Parameters

Project Structure

Top Level Files

Cluster

Graph

Proxy

Twitter

Utils

Disclaimer

About

Releases

Packages

Contributors 2

Languages

Miroka96/Rotakka

Folders and files

Latest commit

History

Repository files navigation

Rotakka

Requirements

Usage

Building a Fat-JAR

Running the Fat-JAR

Developing with IntelliJ

Master configuration

Slave configuration

Useful Config Parameters

Project Structure

Top Level Files

Cluster

Graph

Proxy

Twitter

Utils

Disclaimer

About

Topics

Resources

Stars

Watchers

Forks

Languages