Url2vec

Abstract

In this thesis a new methodology for clustering Web pages is discussed, using Random Walks between pages, together with their textual content, to learn vector representations for nodes in the web graph. Url2vec is implemented to extract clusters of pages of the same semantic type. Unlike the clustering algorithms proposed in literature, Url2Vec does not consider a website as a collection of text documents independent from each other, but tries to combine information about the content of the pages and the structure of the website.

The experimental results produced proved to be discreet and encouraged to follow the studies in this direction to identify new ways to improve the results achieved in terms of quality.

Setup

I suggest to setup a virtual environment using miniconda

Create an environment with python 2.7:

conda create --name url2vec python=2.7

Install requirements:

pip install -r ./requirements.txt

To check the examples:

jupyter-notebook ./notebooks

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
crawler		crawler
dataset		dataset
notebooks		notebooks
res/img		res/img
url2vec		url2vec
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawler

crawler

dataset

dataset

notebooks

notebooks

res/img

res/img

url2vec

url2vec

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

scrapy.cfg

scrapy.cfg

setup.py

setup.py

Repository files navigation

Url2vec

Abstract

Setup

About

Releases

Packages

Contributors 2

Languages

chrisPiemonte/url2vec

Folders and files

Latest commit

History

Repository files navigation

Url2vec

Abstract

Setup

About

Topics

Resources

Stars

Watchers

Forks

Languages