Web Crawler Detection

This work is the final project of Rahnema College Machine Leaning internship program. The aim of this project is to train an unsupervised model in order to recognize the web requests given to Sanjagh Website are crawlers or not. Finally, for taking advantage of this work, we had to create and develop a web application for production phase. Reach the webpage by clicking this link: Demo

We highly recommend you to see the presentation slides here to get a better intuition of what we have done.

Dataset

The dataset has been obtained from the Sanjagh server logs. Although it cannot be publicly released, a tiny sample of can be found in output.log. In case you are interested in the complete dataset, you can use any other nginx log servers available in the world-wide internet.

A sample record structure is as follows:

207.213.193.143 [2021-5-12T5:6:0.0+0430] [Get /cdn/profiles/1026106239] 304 0 [[Googlebot-Image/1.0]] 32

Project phases

Phase	Description
EDA	Exploratory Data Analysis and Feature Engineering.
Baseline Models	Train some common and baseline Clustering models for anomaly detection.
Advanced Models	Auto encoders are used!
Demo	A simple demo webpage developed.

EDA

In this phase we just got to know the data better! We exploratory searched about useful information in the dataset and tried to extract appropriate clues from it. It is highly recommended running the 01_sanjaghDatasetEDA.ipynb in notebooks/ to see what we have exactly done in this part.

Then we had to create and generate some features per session. These features can be modified in my_utils.py in utils/ Here is the list of the features we have used:

Features per session	Description
Click rate	Higher click rate can only be achieved by an automated script.
STD of path’s depth	Deeper requests usually indicates a human user
Percentage of 4xx status codes	Usually higher for crawlers as there is higher chances of hitting an outdated or deleted pages.
Percentage of 3xx status codes	Indicates redirected requests
Percentage of HTTP HEAD requests	Usually higher for crawlers as there is higher chances of hitting an outdated or deleted pages.
Percentage of image requests	Web crawlers usually ignore images
Average & sum of response_length & response_time	Human users retrieve info from the web via browser, so it forces the user’s session to request additional resource automatically.
Set the user agent attributes	Browser - OS - is_bot - is_pc
Average of time between requests	Is more for human requests
Number of robots.txt* requests*	Crawlers wants to know the limitations!
Percentage of consecutive repeated requests	Crawlers wants to know the limitations!

Baseline models

The baseline models we decided to use are IsolationForest and LocalOutlierFactor.

IsolationForest

The Isolation Forest is a technique for the detection of outlier samples. Since outliers have features X that differ significantly from most of the samples, they are isolated earlier in the hierarchy of a decision tree. Outliers are detected by setting a threshold on the mean length (number of splits) from the top of the tree downwards. The Scikit-learn implementation provides a score for each sample that increases from -1 to +1 with the number of splits. The sample with lower score are likely to be outliers. The outlier threshold on the score must be set by the user.

And then for better visualization, we applied PCA with 3 components to see how outilers and inliers are seperated.

LocalOutlierFactor

The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation of density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood. More precisely, locality is given by k-nearest neighbors, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers.

But the results of IsolationForest were more accurate.

Advanced Models

One of the most promising methods for unsupervised task is Auto encoders. We got the best results for that.

Auto encoders

It is a neural network architecture capable of discovering structure within data in order to develop a compressed representation of the input. Applications: anomaly detection, data denoising and ... .

The training configuration we use is as follows. These can be modified in configs.py which can be found in configs/.

Setting	Type
Optimizer	Adam
Loss	MSE
Activations	ReLu
# of epochs	20
Batch size	64
Percentage of image requests	Web crawlers usually ignore images

Additionally, multiple architectures have been tested and the comparison of them is shown below:

# of neurons	Train loss	Test loss
[15, 7, 15]	0.42	0.48
[15, 3, 15]	0.28	0.39
[15, 7, 3, 7, 15]	0.29	0.43
[15, 7, 7, 7, 15]	0.31	0.42

As for all the unsupervised algorithms an anomaly score threshold should be selected, after many experiments, the MSE threshold which fits best for the dataset is 0.26. But you can modify it based on your own dataset in configs.py which can be found in configs/. Also we evaluated our models and it can be checked thorough presentation slides in presentation/.

Demo

We have used Flask and React to develop a webpage for our project. Firstly check it here. In case you are interested in running the webpage locally follow the steps below:

How to run

The whole source code of the webpage can be found in App/. Firstly, clone the repository and run the commands below:

Clone the repo:

$ git clone https://github.com/mohammadhashemii/Web-Crawler-Detection
$ cd Web-Crawler-Detection

Run the backend:

$ cd backend
$ pip install flask flask-cors
$ python app.py

Run the frontend:

$ cd frontend
$ npm install
$ npm start

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
App		App
checkpoints		checkpoints
configs		configs
dataloader		dataloader
dataset		dataset
images		images
models		models
notebooks		notebooks
presentation		presentation
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

License

mohammadhashemii/Web-Crawler-Detection

Folders and files

Latest commit

History

Repository files navigation

Web Crawler Detection

Dataset

Project phases

EDA

Baseline models

IsolationForest

LocalOutlierFactor

Advanced Models

Auto encoders

Demo

How to run

About

Topics

Resources

License

Stars

Watchers

Forks

Languages