Skip to content

Commit

Permalink
Develop (#132)
Browse files Browse the repository at this point in the history
* deprecate py2.7
* Multiprocess (#130)
  • Loading branch information
bstabler committed Apr 23, 2021
1 parent e990822 commit b664d22
Show file tree
Hide file tree
Showing 52 changed files with 622 additions and 289 deletions.
1 change: 1 addition & 0 deletions .gitignore
@@ -1,6 +1,7 @@
sandbox/
regress/
example_test_no_integerizing/
example_mtc/
.idea
.ipynb_checkpoints

Expand Down
15 changes: 9 additions & 6 deletions .travis.yml
Expand Up @@ -3,24 +3,27 @@ language: python
sudo: false

python:
- '2.7'
- '3.6'
- '3.7'
- '3.8'

install:
- wget http://repo.continuum.io/miniconda/Miniconda-3.7.0-Linux-x86_64.sh -O miniconda.sh
- wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
- bash miniconda.sh -b -p $HOME/miniconda
- export PATH="$HOME/miniconda/bin:$PATH"
- source "$HOME/miniconda/etc/profile.d/conda.sh"
- hash -r
- conda config --set always_yes yes --set changeps1 no
- conda update -q conda
- conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION future
- source activate test-environment
- conda info -a
- conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION
- conda activate test-environment
- conda install pytest pytest-cov coveralls pycodestyle
- pip install .
- pip freeze

script:
- pycodestyle populationsim
- py.test --cov populationsim --cov-report term-missing

after_success:
- coveralls
# Build docs
Expand Down
2 changes: 2 additions & 0 deletions LICENSE.txt
@@ -1,3 +1,5 @@
BSD 3-Clause License

PopulationSim
Contributions Copyright (C) by the contributing authors

Expand Down
1 change: 0 additions & 1 deletion MANIFEST.in
@@ -1,5 +1,4 @@
include ez_setup.py
include README.rst
graft example_calm
graft example_calm_repop
graft example_survey_weighting
Expand Down
189 changes: 127 additions & 62 deletions docs/application_configuration.rst

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions docs/conf.py
Expand Up @@ -20,9 +20,9 @@
# -- Get Package Version --------------------------------------------------
with open("../setup.py") as file:
lines = file.readlines()
for l in lines:
if "version" in l:
VERSION = l.replace("version='", "").replace("',", "").replace(" ", "")
for line in lines:
if "version" in line:
VERSION = line.replace("version='", "").replace("',", "").replace(" ", "")
print("package version: " + VERSION)

# If extensions (or modules to document with autodoc) are in another directory,
Expand Down
51 changes: 39 additions & 12 deletions docs/getting_started.rst
Expand Up @@ -27,34 +27,35 @@ Installation

::

conda create -n popsim python=3.7
conda create -n popsim python=3.8

#Windows
# Windows
activate popsim

#Mac
# Mac
conda activate popsim

4. Get and install the PopulationSim package on the activated conda Python environment:

::

# best to use the conda version of pytables for consistency with activitysim
conda install pytables

pip install populationsim


.. _anaconda_notes :
.. _activitysim :

Python 2 or 3?
~~~~~~~~~~~~~~~
ActivitySim
~~~~~~~~~~~

.. note::

PopulationSim is a 64bit Python 2 or 3 library that uses a number of packages from the
PopulationSim is a 64bit Python 3 library that uses a number of packages from the
scientific Python ecosystem, most notably `pandas <http://pandas.pydata.org>`__
and `numpy <http://numpy.org>`__. It relies heavily on the
`ActivitySim <https://activitysim.github.io>`__ package. Both ActivitySim and PopulationSim
currently support Python 2, but Python 2 will be `retired <https://pythonclock.org/>`__ at the
end of 2019 so Python 3 is recommended.
and `numpy <http://numpy.org>`__. It also relies heavily on the
`ActivitySim <https://activitysim.github.io>`__ package.

The recommended way to get your own scientific Python installation is to
install 64 bit Anaconda, which contains many of the libraries upon which
Expand All @@ -67,7 +68,17 @@ Python 2 or 3?
Run Examples
------------

There are three examples for running PopulationSim, two created using data from the Corvallis-Albany-Lebanon Modeling (CALM) region in Oregon and the other using data from the Metro Vancouver region in British Columbia. The `example_calm`_ set-up runs PopulationSim in base mode, where a synthetic population is created for the entire modeling region. This takes approximately 12 minutes on a laptop with an Intel i7-4800MQ CPU @ 2.70GHz and 16 GB of RAM. The `example_calm_repop`_ set-up runs PopulationSim in the *repop* mode, which updates the synthetic population for a small part of the region. The `example_survey_weighting`_ set-up runs PopulationSim for the case of developing final weights for a household travel survey. More information on the configuration of PopulationSim can be found in the **Application & Configuration** section.
There are four examples for running PopulationSim, three created using data from the
Corvallis-Albany-Lebanon Modeling (CALM) region in Oregon and the other using data from
the Metro Vancouver region in British Columbia.

1. The `example_calm`_ set-up runs PopulationSim, where a synthetic population is created single-processed for the entire modeling region.

2. The `example_calm_mp`_ set-up runs PopulationSim `multi-processed <http://docs.python.org/3/library/multiprocessing.html>`_, where a synthetic population is created for the entire modeling region by simultaneously balancing results using multiple processors on your computer, thereby reducing runtime.

3. The `example_calm_repop`_ set-up runs PopulationSim in the *repop* mode, which updates the synthetic population for a small part of the region.

4. The `example_survey_weighting`_ set-up runs PopulationSim for the case of developing final weights for a household travel survey. More information on the configuration of PopulationSim can be found in the **Application & Configuration** section.

Example_calm
~~~~~~~~~~~~
Expand All @@ -84,6 +95,22 @@ Follow the steps below to run **example_calm** set up:

* Review the outputs in the *output* folder

Example_calm_mp
~~~~~~~~~~~~~~~

Follow the steps below to run **example_calm_mp** multiprocessed set up:

* Open a command prompt in the example_calm folder
* In ``configs_mp\setting.yaml``, set ``num_processes: 2`` to a reasonable number of processors for your machine
* Run the following commands:

::

activate popsim
python run_populationsim.py -c configs_mp -c configs

* Review the outputs in the *output* folder

Example_calm_repop
~~~~~~~~~~~~~~~~~~

Expand Down
75 changes: 43 additions & 32 deletions docs/index.rst
Expand Up @@ -5,64 +5,75 @@
Introduction
=============

PopulationSim is an open platform for population synthesis and survey weighting. It emerged from Oregon DOT's desire to
build a shared, open, platform that could be easily adapted for statewide, regional, and urban
transportation planning needs.
PopulationSim is an open platform for population synthesis and survey weighting. It emerged from
`Oregon DOT <https://www.oregon.gov/odot>`_'s desire to build a shared, open, platform that could
be easily adapted for statewide, regional, and urban transportation planning needs.

What is population synthesis?
-----------------------------
Activity Based Models (ABMs) operate in a micro-simulation framework , wherein the travel choices of person and household decision-making agents are predicted by applying Monte Carlo methods to behavioral models. This requires a data set of households and persons representing the entire population in the modeling region. Population synthesis refers to the process used to create this data.

The required inputs to population synthesis are a population sample and marginal distributions. The population
sample is commonly referred to as the *seed or reference sample* and the marginal distributions are referred to
as *controls or targets*. **The process of expanding the seed sample to match the marginal distribution
is termed population synthesis.** The software tool which implements this population synthesis process
Activity based travel demand models such as `ActivitySim <http://www.activitysim.org>`_ operate at an individual
level, wherein the travel choices of person and household decision-making agents are predicted by applying
Monte Carlo methods to behavioral models. This requires a data set of households and persons representing
the entire population in the modeling region. Population synthesis refers to the process used to create this data.

The required inputs to population synthesis are a population sample and marginal distributions (or control totals).
The population sample is commonly referred to as the *seed or reference sample* and the marginal distributions are
commonly referred to as *controls or targets*. **The process of expanding the seed sample to match the marginal
distribution is termed population synthesis.** The software tool which implements this population synthesis process
is termed as a **Population Synthesizer**.

What does a Population Synthesizer produce?
-------------------------------------------
The objective of a population synthesizer is to generate a synthetic population for
a modeling region. The main outputs from a population synthesizer include lists of persons and households
representing the entire population of the modeling region. These databases include household and person-level
attributes of interest. Examples of attributes at the household level include household income, household size, housing type, and number of vehicles. Examples of person attributes include
a modeling region. The main outputs from a population synthesizer include tables of persons and households
representing the entire population of the modeling region. These tables also include household and person-level
attributes of interest. Examples of attributes at the household level include household income, household size, housing
type, and number of vehicles. Examples of person attributes include
age, gender, work\school status, and occupation. Depending on the use case, a population synthesizer may also
produce multi-way distribution of demographic variables at different geographies to be used as an input
to aggregate travel models. In the case of PopulationSim specifically, an additional option is also included to
modify an existing regional synthetic population for a smaller geographical area. In this case, the outputs are a modified list of persons and households.
to aggregate (four-step) travel models. In the case of PopulationSim specifically, an additional option is also included to
modify an existing regional synthetic population for a smaller geographical area. In this case, the outputs are a modified
set of persons and households.

How does a population synthesizer work?
---------------------------------------
The main inputs to a population synthesizer are disaggregate population samples and marginal control
distributions. In the United States, the disaggregate population sample is typically obtained from the Census Public Use Microdata Sample (PUMS), but other sources, such as a household travel survey, can also be used. The seed sample should
include demographic variables corresponding to each marginal control termed as *controlled variables* (e.g.,
household size, household income, etc.). The seed sample could also include other variables of interest but not
necessarily controlled via marginal controls. These are termed as *uncontrolled variables*. The seed sample should also include an initial weight on each household record.

Base-year marginal distributions of person and household-level attributes of interest are available from Census. For future years, marginal distributions are either held constant, or forecasted. Marginal distributions can be for both household or person level variables and are specified at a specific geography (e.g., Block Groups, Traffic Analysis Zone or County). PopulationSim allows controls to be specified at multiple geographic levels.

The objective of a population synthesizer is to
generate household weights which satisfies the marginal control distributions. This is achieved by use of
a data fitting technique. The most common fitting technique used by various population synthesizers is the
Iterative Proportional Fitting (IPF) procedure. Generally, the IPF procedure is used to obtain joint distributions of demographic
variables. Then, random sampling from PUMS generates the baseline synthetic population.
distributions. In the United States, the disaggregate population sample is typically obtained from the `Census Public Use
Microdata Sample (PUMS) <https://www.census.gov/programs-surveys/acs/microdata.html>`_, but other sources, such as a household
travel survey, can also be used. The seed sample should include demographic variables corresponding to each marginal control
termed as *controlled variables* (e.g., household size, household income, etc.). The seed sample could also include other
variables of interest but not necessarily controlled via marginal controls. These are termed as *uncontrolled variables*.
The seed sample should also include an initial weight on each household record.

Current year marginal distributions of person and household-level attributes of interest are available from Census. For
future years, marginal distributions are either held constant, or forecasted. Marginal distributions can be for both
household or person level variables and are specified at a specific geography (e.g., Block Groups, Traffic Analysis Zone
or County). PopulationSim allows controls to be specified at multiple geographic levels.

The objective of a population synthesizer is to generate household weights which satisfies the marginal control
distributions. This is achieved by use of a data fitting technique. The most common fitting technique used by various
population synthesizers is the Iterative Proportional Fitting (IPF) procedure. Generally, the IPF procedure is used
to obtain joint distributions of demographic variables. Then, random sampling from PUMS generates the baseline synthetic
population.

One of the limitations of the simple IPF method is that it does not incorporate both household and person
level attributes simulatenously. Some population synthesizers use a heuristic algorithm called the
Iterative Proportional Updating Algorithm (IPU) to incorporate both person and household-level variables in the fitting procedure.

Besides IPF, entropy
maximization algorithms have been used as a fitting technique. In most of the entropy based methods,
Besides IPF, entropy maximization algorithms have been used as a fitting technique. In most of the entropy based methods,
the relative entropy is used as the objective function. The relative entropy based optimization ensures
that the least amount of new information is introduced in finding a feasible solution. The base entropy
is defined by the initial weights in the seed sample. The weights generated by the entropy maximization
algorithm preserves the distribution of initial weights while matching the marginal controls. This is an
advantage of the entropy maximization based procedures over the IPF based procedures. PopulationSim uses the entropy maximization based list balancing to match controls specified at various geographic levels.
advantage of the entropy maximization based procedures over the IPF based procedures. PopulationSim uses the entropy maximization
based list balancing to match controls specified at various geographic levels.

Once the final weights
have been assigned, seed sample is expanded using these weights to generate a synthetic population. Most
Once the final weights have been assigned, the seed sample is expanded using these weights to generate a synthetic population. Most
population synthesizers create distributions using final weights and employ random sampling to expand the
seed sample. PopulationSim uses Linear Programming to convert the final weights to integer values and expands
the seed sample using these integer weights. For detailed description of PopulationSim algorithm, please refer to the TRB paper link in the :ref:`docs` section. For information on software implementation refer to :ref:`core_components` and :ref:`model_steps`. To learn more about PopulationSim application and configuration, please follow the content index below.
the seed sample using these integer weights. For detailed description of PopulationSim algorithm, please refer to the TRB paper
link in the :ref:`docs` section. For information on software implementation refer to :ref:`core_components` and :ref:`model_steps`. To
learn more about PopulationSim application and configuration, please follow the content index below.

How does population synthesis work for survey weighting?
--------------------------------------------------------
Expand Down
10 changes: 6 additions & 4 deletions docs/software.rst
Expand Up @@ -9,8 +9,8 @@ This page describes the PopulationSim software implementation and how to contrib

The implementation starts with
the ActivitySim framework, which serves as the foundation for the software. The framework, as briefly described
below, includes features for data pipeline management, expression handling, testing, etc. Built upon the
framework are additional core components for population synthesis such as balancers and integerizers.
below, includes features for data pipeline management, expression handling, multiprocessing, testing, etc. Built upon
the framework are additional core components for population synthesis such as balancers and integerizers.
Built upon the population synthesis core components are the model steps that make up a PopulationSim run,
such as the inputs pre-processor, setting up the data strucutres, doing the initial seed balancing, etc.

Expand Down Expand Up @@ -42,7 +42,8 @@ being implemented in the ActivitySim framework means:
* Model Orchestrator

* `ORCA <https://github.com/UDST/orca>`__ is used for running the overall model system and for defining dynamic data tables, columns, and injectables (functions). ActivitySim wraps ORCA functionality to make a Data Pipeline tool, which allows for re-starting at any model step.

* Support for `multiprocessing <http://docs.python.org/3/library/multiprocessing.html>`_ to reduce runtime

* Expressions

* Model expressions are in CSV files and contain Python expressions, mainly pandas/numpy expression that operate on the input data tables. This helps to avoid modifying Python code when making changes to the model calculations.
Expand Down Expand Up @@ -236,4 +237,5 @@ Release Notes
* v0.4 - transfer to ActivitySim.org
* v0.4.1 - package updates
* v0.4.2 - validation script in Python
* v0.4.3 - allow non-binary incidence
* v0.4.3 - allow non-binary incidence
* v0.5 - support for multiprocessing

0 comments on commit b664d22

Please sign in to comment.