Uber Speeds Data Analysis

Ride-hailing has become an increasingly common mode of transportation in every urban city of the world. With the convenience of smartphones, 4G data, and secure payment systems – ride-hailing turns out to be a natural and much needed facility defying all traditional transport mechanisms of the past.

In 2009, Uber was launched, which was the first of its kind company that started offering ride-hailing services in select places of the world. Today Uber is available in 65 countries and over 600 countries worldwide. Uber completes 14 million trips each day. Another way of looking at Uber is through a lens of technical capabilities that it has to offer. Today, it is considered among the top tech companies in the world. As more and more people use Uber, it gains massive amount of movement and speeds data of the vehicles, that is also publicly available for scientific research and urban planning purposes. One such dataset, that was released on May 14th, 2019, is the Movement Speeds of Uber vehicles recorded around the world. An explanation of how Uber calculates speeds is found here

In this project, Movement Speeds data for Uber rides in London city are analyzed. Hourly time series of speeds data is decomposed and predicted during the 1st quarter of 2018. Furthermore, road accidents (collisions) data from 2018 is read in hope to be correlated with the speeds dataset.

The objectives of this project are

Analyze hourly speed time series to find patterns and relationships.
Predict per road future speeds based on the current time-series
Analyze and correlate collision data with speeds dataset, to answer questions such as, when did most collisions occur?

To encourage further development, complete source code and analysis is available in the src directory. This project is open to contributions.

Implementation

The implementation of the project is divided into three sections. First, the data collection and deciding what kind of information is available out there that can be related with the datasets. Second, the preparation of data, that includes, data cleaning, sorting, merging and filtering meaningful information on which analysis can be performed. Finally, time-series decomposition, forecasting, and modeling to find patterns, relationships, and predictions from the dataset.

Data Collection

To perform meaningful analysis, London city was selected as it has higher percentage of Uber rides with respect to other countries in the region. There are two different kinds of datasets offered by Uber.

Hourly Time Series – Provides hourly mean of speeds and their standard deviations across the city for historic data
Quarterly Statistics by Hour of Day – This dataset provides the average, standard deviation, 50th percentile, and 85th percentile speeds aggregated by hour of day across all days in the specified quarter.

Selecting Duration of Data

In this project, we only take the Hourly Time Series dataset for time series analysis of speeds data. Since this is an hourly time series, a month of data could have 730,001 mean speed observations. For this reason, only three months of hourly time series data is collected and analyzed.

Selecting Interval of Data

The next question is for what time the data should be collected? Since it is available for 2018, 2019, and so on. The decision to choose the time depended on collision dataset for the London city during that time. Both collision data and speed datasets has to be for the same time interval to be able to form a meaningful correlation between the two.

Looking up at UK government data website, it was found that Road Safety Accidents dataset is available for 2018 and it was decided to select 2018 data for both datasets.

Downloading Datasets

You will need three datasets

Uber movement speeds data .csv files
Road safety collision data for London 2018
Open Street Map Geometries .geojson file of London 2018

Uber provides a handy Movement Data Toolkit that is used to download the .csv files for speeds dataset and is available as an npm and Node.js package.

Using the speeds-transform command of the toolkit, three months speed data for London can be downloaded. This data comes in .csv file.

mdt speeds-transform historical London 2018-01-01 2018-03-31 --output=speeds-data.csv

Since it is alot of data, you can maximize the memory consumption of Node.js by prepending node options to the above command.

export NODE_OPTIONS="--max_old_space_size=8192" && mdt speeds-transform historical London 2018-01-01 2018-03-31 --output=speeds-data.csv

Whereas the road safety accidents collision data is a single file that can be downloaded here

Download London 2018 geometries by using Movement Data Toolkit's command

mdt create-geometry-file London 2018 > london2018.geojson

The script assumes all data files to be inside data/london2018/ folder at the root of the project. If you have a different folder, you can specify a different path in the Jupyter notebook.

Anatomy of Uber Speeds Data

The hourly speeds dataset of Uber contains the following major columns

utc_timestamp - Date & time of observations in UTC format
segment_id - Special Ids assigned to road segments by Uber
start_junction_id - Junction where the segment starts i.e., a roundabout
end_junction_id - Junction where the segment ends
speed_mph_mean - Mean speed of vehicles within an hour
speed_mph_stddev - Standard deviation of speed of vehicles in an hour

The above columns can uniquely identify a street segment and its speed. However, since the segments and junctions correspond to Uber's internal identification models. There also exists Open Street Map (OSM) Ids, that correspond to OSM Way Id and OSM Node Id respectively.

OSM Way and Node Ids define roads and nodes connecting them. Uber has its own implementation of road structures because speeds can vary a lot within a single OSM Way Id. That is why multiple Segment Ids correspond to a single OSM Way Id. Fortunately, OSM Ids are also provided in the same table

osm_way_id - OSM Way Id with One to Many relationship with segment_id
osm_start_node_id - Start node Id of OSM corresponding to start_junction_id
osm_end_node_id - End node Id of OSM corresponding to end_junction_id

For more information about Uber's speed data, checkout this article.

Having said that, analysis as a Python Jupyter Notebook can be found here along with detailed documentation and discussion of results.

Anatomy of Collision Data

Road safety accident data or collision data is downloaded above contains the statistics about road accidents in the London city. Below major columns exists in the dataset.

Date - Date and time of accident
Longitude - Longitude of accident location
Latitude - Latitude of accident location
Accident_Severity - Severity value of accident
Number_of_casualties - Number of casualties ... among others

Since this dataset does not contain direct osm_way_ids, a manual mapping has to be defined.

Environment Setup and Usage

This project uses Python3 with Jupyter Notebook. The easiest way to install both is using Anaconda distribution. Once Anaconda is installed, make sure you have the following packages installed as the script depends upon these packages.

Numpy
Pandas
Geopandas
Statsmodel
pmdarima

After that, you can execute the .ipynb in the browser or install an editor such as VSCode with Python extension to execute the notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Uber Speeds Data Analysis

Implementation

Data Collection

Selecting Duration of Data

Selecting Interval of Data

Downloading Datasets

Anatomy of Uber Speeds Data

Anatomy of Collision Data

Environment Setup and Usage

About

Releases

Packages

Languages

saadsaifse/uber-speeds-data-analysis

Folders and files

Latest commit

History

Repository files navigation

Uber Speeds Data Analysis

Implementation

Data Collection

Selecting Duration of Data

Selecting Interval of Data

Downloading Datasets

Anatomy of Uber Speeds Data

Anatomy of Collision Data

Environment Setup and Usage

About

Topics

Resources

Stars

Watchers

Forks

Languages