Skip to content

edugalt/scaling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

An inferential approach to Urban Scaling Laws

This repository contains data and the Python implementation of probabilistic models to investigate Urban scaling laws [1][2]. These statistical laws state that many observables $y_i$ (e.g., GDP) of $i=1, 2, \ldots, N$ urban areas in a country (or region) scale with the population $x_i$ as

$$ y_i \sim x_i^{\beta},$$

with $0<\beta<2$. The primary interest is to compare different models, test the validity of the urban scaling law, and estimate the scaling paramter $\beta$.

Models

For the application of models based on cities (C), see Ref. [1] and the Jupyter Notebook (or Open Notebook in Colab).

For the application of models based on the attribution of tokens to individuals (I), which account also for the spatial interaction between urban areas, see Ref. [2]|the Jupyter Notebook (or Open Notebook in Colab).

Model Parameters Spatial interaction (Y/N)? Cities(C) or Individuals(I) Formula Description/Reference
Per-capita $\emptyset$ N C,I $y_i = x_i \frac{\sum y_i}{\sum x_i}$ Fixed per-capita rate $\beta=1$ [2]
Least-square $\beta,A$ N C $\log(y) = A +\beta \log(x)$ Least-squared fitting of log-transformed variables [1]
Gaussian $\beta,\alpha,\gamma,\delta$ N C $\mathbb{E}(y\mid x) = \alpha x^{\beta}, \mathbb{V}(y\mid x) = \gamma \mathbb{E}(y\mid x)^{\delta}$ Gaussian $P(y\mid x)$ [1]
Log-normal $\beta,\alpha,\gamma,\delta$ N C $\mathbb{E}(y\mid x) = \alpha x^{\beta}, \mathbb{V}(y\mid x) = \gamma \mathbb{E}(y\mid x)^{\delta}$ Log-normal $P(y\mid x)$ [1]
Persons $\beta$ N I $p(j) \sim x_{c(j)}^{\beta-1}$ Tokens are attributed to individuals with probability $p(j)$ [1][2]
Gravitational $\beta,\alpha_G$ Y I $a_G = \frac{1}{1+ \left(\frac{d}{\alpha_G}\right)^2}$ Tokens to individuals with prob. $p(j)$, who interact according to $a_G$ depending on distance $d$ [1] [2]
Exponential $\beta,\alpha_E$ Y I $a_E = e^{- d \ln(2) / \alpha_E}$ Tokens to individuals with prob. $p(j)$, who interact according to $a_E$ depending on distance $d$ [2]

Data

The datasets listed below are available for investigation. The column "tag" indicates the key to be used to call this data in our code (e.g., in the notebook). The column "Location?" indicates whether the latitude and logitude is available (Y/N). An example of the analysis of COVID19 cases can be found here.

Region: Tag: N Location? Year Description Source
Australia
covid19_NSW 144 N 2021 COVID19 cases in the state of NSW NSW
australia_area 102 Y 2021 Area Australian Bureau of Statistics
australia_education 102 Y 2021 Top bracket in Eduction Census, Australian Bureau of Statistics
australia_income 102 Y 2021 Top bracket in Income Census, Australian Bureau of Statistics
Brazil
brazil_aids_2010 1812 Y 2010 AIDS cases Brazilian Health Ministry
brazil_externalCauses_2010 5286 Y 2010 Death by external causes Brazilian Health Ministry
brazil_gdp_2010 5565 Y 2010 GDP Brazilian Health Ministry
covid19_brazil 5570 N 2021 COVID19 cases Brasil.io and wcota
Chile
covid19_chile 346 N 2021 COVID19 cases MinCiencia
Europe
eurostat_cinema_seats 418 N 2011 Cinema seats Eurostat
eurostat_cinema_attendance 221 N 2011 Attendance to cinemas Eurostat
eurostat_museum_visitors 443 N 2011 Visitors to museums Eurostat
eurostat_theaters 398 N 2011 Theaters Eurostat
eurostat_libraries 597 N 2011 Libraries Eurostat
Germany
germany_gdp 108 N 2012 GDP German Statistical Office
OECD
oecd_gdp 275 N 2010 GDP OECD
oecd_patents 218 N 2008 Patents filed OECD
UK
uk_income 100 N 2000 to 2011 Weekly income Arcaute et al.
uk_patents 93 N 2000 to 2011 Patents filed Arcaute et al.
uk_train 97 N 2000 to 2011 Train statiions Arcaute et al.
USA
usa_gdp 381 Y 2013 GDP BEA
usa_miles 459 Y 2013 Length of roads in miles FHWA
covid19_USA 3131 N 2021 Covid19 cases Kaggle

The data is stored in the folder data, where more information about its sources and filtering can be found. It consists of Python packages (e.g. brazil). Each package has functions that return the data there, defined in the __init__.py of the package. The data is always a tuple (x, y) of numpy arrays of the same size, where x is always population.

For example, to get the population-gdp of brazilian cities from 2010 use:

import brazil
x, y = brazil.gdp(2010)

For the spatial data, an additional array (l) indicates the location (latitude and longitude) of the urban area.

Import your own data:

New data can be added as .csv file to

new_dataset/generic_dataset.txt (for three columns: city name, $x,y$)

or

new_dataset2/generic_dataset.txt (for two columns: $x,y$)

For the spatial analysis, import your resuts as $x$ (population), $y$ (observable), $\ell$ (latitude and longitude) directly in the notebook

Code

The easiset way to interact and run the code is through the Notebooks in the folder notebooks. Follow the link in the "Notebook-*-Colab.ipynb" files to run them in Colab or download this repository and run using Jupyter. The source Python code is in the folder src

Likelihood and minimisation

All inference is performed based on the likelihood of different models. The module best_parameters.py contains the definition of the likelihood functions of the models, the minimization algorithm, and the parameters we use in it. The bootstrap used to estimate error bars is also defined in this module, at minimize_with_errors. The bootstrap for the person model is implemented in pvalue_population.py. The likelihood and minimization of the spatial models appear in 'spatial.py'

Analysis

The different analysis we perform, as well as the list of databases we use, are defined in analysis.py. The general setting is defined in LikelihoodAnalysis and respective methods.

For example, to get beta estimated by Log-Normal with free \delta and other statistical information, use

from analysis import LogNormalAnalysis
>>> analysis = LogNormalAnalysis('brazil_aids_2010', required_successes=512)
>>> analysis.beta[0]
>>> analysis.p_value
>>> analysis.bic

You can run the Jupyter Notebook (or Open Notebook in Colab[Jupyter Notebook] or run python -m analyze.py. For example,

MODEL=LogNormalAnalysis ERROR_SAMPLES=10 python -m analyze

runs the LogNormal model with 10 samples for bootstrap on the new dataset. It prints the best \beta, the bootstrap error for beta, p_value, and BIC for the specific model (the script explains how to select the model).

Pre-computed results are stored at _results. In case you want to reproduce some of the results stored in _results, you can delete the respective analysis in the directory and run (may take some time)

python -m analysis_run

this requires some environment variables that are documented when you run it.

References

This repository contains both data and code from the papers:

[1] Is this scaling non-linear? by Jorge C. Leitão, José M. Miotto, Martin Gerlach, and Eduardo G. Altmann, Royal Society Open Science 3, 150649 (2016). | See Notebook | Open Notebook in Colab

[2] Spatial Interctions in urban scaling laws, by Eduardo G. Altmann, PLOS ONE 15, e0243390 (2020). | See Notebook | Open Notebook in Colab

and also results for COVID-19 data performed by Jimena Espinoza in Semester 2 2021| See Notebook | Open Notebook in Colab.

Contributions with data and models are welcome. If results of this repository are used, please cite the corresponding publications as well.