Football Data Platform

WIP: This project is still under development, more updates to come

The Football Data Platform is a comprehensive data aggregation tool tailored for football enthusiasts, analysts, and researchers. It collects football-related data from popular platforms: FBRef, Sofascore and Transfermarkt. Once fetched, it saves the webpages' data as JSON files and subsequently loads it into a PostgreSQL database for structured queries and analytics.

Features

Data Scraping: Pulls data from Transfermarkt, Sofascore, and FBRef efficiently and systematically.
Data Storage: Stores raw webpage data as JSON files.
Database Loading: Inserts and structures the scraped data into a PostgreSQL database.

Prerequisites

Python 3.9+
Poetry
Docker

Installation

Clone this repository:

git clone https://github.com/your-github-username/football-data-platform.git
cd football-data-platform

Create a Poetry virtual environment and install the dependencies:

poetry shell
poetry install --no-root

Create a .env file in the root directory:

cp .env.example .env

Build the Docker image and spin up the containers:

docker compose up -d --build

Run the database migrations:

alembic upgrade head

Scrapping

Sofascore

scrapy crawl sofascore -a TOURNAMENT_ID=<tournament_id> -a SEASON_ID=<season_id>

Where <tournament_id> and <season_id> are the tournament and season identifiers, respectively. They can be found in the URL of the tournament page on Sofascore. If no season_id is provided, the crawler will scrape all seasons with available data.

Example: LaLiga 23/24

scrapy crawl sofascore_season -a TOURNAMENT_ID=8 -a SEASON_ID=52376

Transfermarkt

scrapy crawl transfermarkt -a TOURNAMENT_ID=<tournament_id> -a SEASON_ID=<season_id>

Where <tournament_id> and <season_id> are the tournament and season identifiers, respectively. They can be found in the URL of the tournament page on Transfermarkt.

Example: LaLiga 23/24

scrapy crawl transfermarkt -a TOURNAMENT_ID=ES1 -a SEASON_ID=2023

FBref

scrapy crawl <spider_name>

Where <spider_name> is the name of the spider to be executed. The available spiders are:

FBrefBRA1
FBrefEPL
FBrefUCL

Processing

usage:

processing [-h] [--full-load] [--debug] [{sofascore,transfermarkt}]

positional arguments:

{sofascore,transfermarkt} Source to process data from.

optional arguments:

-h, --help show this help message and exit
--full-load Process and load all data from the source
--debug Enable debug mode.

Example:

python app/processing sofascore --full-load

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
alembic		alembic
app		app
notebooks		notebooks
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alembic

alembic

app

app

notebooks

notebooks

.env.example

.env.example

.gitignore

.gitignore

README.md

README.md

alembic.ini

alembic.ini

docker-compose.yml

docker-compose.yml

poetry.lock

poetry.lock

pyproject.toml

pyproject.toml

scrapy.cfg

scrapy.cfg

Repository files navigation

Football Data Platform

WIP: This project is still under development, more updates to come

Features

Prerequisites

Installation

Scrapping

Sofascore

Transfermarkt

FBref

Processing

About

Releases

Packages

Languages

felipeall/football-data-platform

Folders and files

Latest commit

History

Repository files navigation

Football Data Platform

WIP: This project is still under development, more updates to come

Features

Prerequisites

Installation

Scrapping

Sofascore

Transfermarkt

FBref

Processing

About

Resources

Stars

Watchers

Forks

Languages