publish v0.5.1 (#141)

* update package version number as well * Allow non-binary incidence (#123) * Allow non-binary incidence * style * update tests to pass * add some progress indication * tidy up validation script, use histogram for a histogram * fix render and some typos * increment version * deprecate py2.7 * Multiprocess (#130) * [Bugfix] Allow seed and meta geography to be the same (#139) * Fixes bug where if the seed geography is the same as the meta_geography, pandas has a small panic attack and the run will fail. * add cytoolz to the "requirements" * fix another activitysim change * Absolute bounds (#136) * adding upper/lower bounds to weighting use case * #137, #134, #133, #131 Co-authored-by: Jamie Cook <jamie.cook@veitchlister.com.au> Co-authored-by: Blake Rosenthal <blake.rosenthal@rsginc.com> Co-authored-by: Ben Stabler <bstabler@users.noreply.github.com> Co-authored-by: Leah Flake <leah.flake@rsginc.com>
ActivitySim · Aug 26, 2021 · 47ece66 · 47ece66
1 parent b664d22
commit 47ece66
Show file tree

Hide file tree

Showing 12 changed files with 70 additions and 48 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -16,7 +16,7 @@ install:
 - conda info -a
 - conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION
 - conda activate test-environment
-- conda install pytest pytest-cov coveralls pycodestyle
+- conda install pytest pytest-cov coveralls pycodestyle cytoolz
 - pip install .
 - pip freeze
 

diff --git a/docs/application_configuration.rst b/docs/application_configuration.rst
@@ -320,7 +320,7 @@ These settings control the functionality of the PopulationSim algorithm. The set
 |                                      |            | The maximum expansion factor may have to be adjusted upwards if the target |br| |
 |                                      |            | is much greater than the seed number of households.                        |br| |
 +--------------------------------------+------------+---------------------------------------------------------------------------------+
-| MAX_BALANCE_ITERATIONS_SIMULTANEOUS  | Integer    | Number of simultaneous list balancer iterations                                 |
+| MAX_BALANCE_ITERATIONS_SIMULTANEOUS  | Integer    | Number of list balancer iterations.  The default may be more than is needed.    |
 +--------------------------------------+------------+---------------------------------------------------------------------------------+
 
 
@@ -693,7 +693,7 @@ This sections describes the settings that are configured differently for the *re
 
 **Input Data Tables for repop mode**
 
-The repop mode runs over an existing synthetic population and uses the data pipeline (HDF5 file) from the regular run as an input. User should copy the HDF5 file from the regular outputs to the *output* folder of the repop set up. The data input which needs to be specified in this setting is the control data for the subset of geographies to be modified. Input tables for the repop mode can be specified in the same manner as regular mode. However, only one geography can be controlled. In the example below, TAZ controls are specified. The controls specified in TAZ_control_data do not have to be consistent with the controls specified in the data used to control the initial population. Only those geographic units to be repopulated should be specified in the control data (for example, TAZs 314 through 317).
+The repop mode runs over an existing synthetic population and uses the data pipeline (HDF5 file) from the regular run as an input. User should copy the HDF5 file from the regular outputs to the *output* folder of the repop set up. The data input which needs to be specified in this setting is the control data for the subset of geographies to be modified. Input tables for the repop mode can be specified in the same manner as regular mode. However, only one geography can be controlled and the geography must be the lowest in "geographies" setting. In the example below, TAZ controls are specified. The controls specified in TAZ_control_data do not have to be consistent with the controls specified in the data used to control the initial population. Only those geographic units to be repopulated should be specified in the control data (for example, TAZs 314 through 317).
 
 ::
 
@@ -713,6 +713,7 @@ The repop mode runs over an existing synthetic population and uses the data pipe
 | Attribute                 | Description                                                 |
 +===========================+=============================================================+
 | repop_control_file_name   | Name of the CSV control specification file for repop mode   |
+|                           | Must include total_hh_control field                         |
 +---------------------------+-------------------------------------------------------------+
 
 

diff --git a/docs/getting_started.rst b/docs/getting_started.rst
@@ -12,7 +12,13 @@ This page describes how to install and run PopulationSim with the provided examp
 Installation
 ------------
 
-1. Install `Anaconda 64bit Python 3 <https://www.anaconda.com/distribution/>`__. Anaconda Python is required for PopulationSim.
+1. It is recommended that you install and use a *conda* package manager
+for your system. One easy way to do so is by using `Anaconda 64bit Python 3 <https://www.anaconda.com/distribution/>`__,
+although you should consult the `terms of service <https://www.anaconda.com/terms-of-service>`__
+for this product and ensure you qualify (as of summer 2021, businesses and
+governments with over 200 employees do not qualify for free usage).  If you prefer
+a completely free open source *conda* tool, you can download and install the
+appropriate version of `Miniforge <https://github.com/conda-forge/miniforge#miniforge3>`__.
 
 2. If you access the internet from behind a firewall, then you will need to configure your proxy server. To do so, create a .condarc file in your Anaconda installation folder (i.e. ``C:\ProgramData\Anaconda3``), such as:
 
@@ -62,7 +68,7 @@ ActivitySim
   ActivitySim depends + some handy Python installation management tools.
 
   For more information on Anaconda and ActivitySim, see ActivitySim's `getting started
-  <https://activitysim.github.io/activitysim/gettingstarted.html#anaconda>`__ guide.
+  <https://activitysim.github.io/activitysim/gettingstarted.html>`__ guide.
 
 
 Run Examples

diff --git a/docs/software.rst b/docs/software.rst
@@ -224,18 +224,3 @@ Contribution Guidelines
 
 PopulationSim development follows the same `development guidelines <https://activitysim.github.io/activitysim/development.html>`__ as ActivitySim.
 
-
-Release Notes
--------------
-
-  * v0.3 - first release
-  * v0.3.1 - allow zones with zero households
-  * v0.3.2 - fix bug in mult-integerizer with total_hh_parent_control_index
-  * v0.3.3 - add disgnostic printouts on assert fail in mult_integerizer
-  * v0.3.4 - add survey weighting use case
-  * v0.3.5 - add Python 3.5+ support
-  * v0.4 - transfer to ActivitySim.org
-  * v0.4.1 - package updates
-  * v0.4.2 - validation script in Python
-  * v0.4.3 - allow non-binary incidence 
-  * v0.5 - support for multiprocessing
diff --git a/example_survey_weighting/configs/settings.yaml b/example_survey_weighting/configs/settings.yaml
@@ -18,7 +18,8 @@ USE_SIMUL_INTEGERIZER: True
 USE_CVXPY: False
 max_expansion_factor: 4 # Default is 30
 min_expansion_factor: 0.5
-
+absolute_upper_bounds: 20000 
+absolute_lower_bounds: 1
 
 # Geographic Settings
 # ------------------------------------------------------------------

diff --git a/populationsim/balancer.py b/populationsim/balancer.py
@@ -242,6 +242,7 @@ def np_balancer(
 def do_balancing(control_spec,
                  total_hh_control_col,
                  max_expansion_factor, min_expansion_factor,
+                 absolute_upper_bound, absolute_lower_bound,
                  incidence_df, control_totals, initial_weights):
 
     # incidence table should only have control columns
@@ -262,14 +263,21 @@ def do_balancing(control_spec,
 
     if min_expansion_factor:
 
-        # number_of_households in this seed geograpy as specified in seed_controlss
+        # number_of_households in this seed geograpy as specified in seed_controls
         number_of_households = control_totals[total_hh_control_index]
 
         total_weights = initial_weights.sum()
         lb_ratio = min_expansion_factor * float(number_of_households) / float(total_weights)
 
         lb_weights = initial_weights * lb_ratio
-        lb_weights = lb_weights.clip(lower=0)
+
+        if absolute_lower_bound:
+            lb_weights = lb_weights.clip(lower=absolute_lower_bound)
+        else:
+            lb_weights = lb_weights.clip(lower=0)
+
+    elif absolute_lower_bound:
+        lb_weights = initial_weights.clip(lower=absolute_lower_bound)
 
     else:
         lb_weights = None
@@ -283,7 +291,14 @@ def do_balancing(control_spec,
         ub_ratio = max_expansion_factor * float(number_of_households) / float(total_weights)
 
         ub_weights = initial_weights * ub_ratio
-        ub_weights = ub_weights.round().clip(lower=1).astype(int)
+
+        if absolute_upper_bound:
+            ub_weights = ub_weights.round().clip(upper=absolute_upper_bound, lower=1).astype(int)
+        else:
+            ub_weights = ub_weights.round().clip(lower=1).astype(int)
+
+    elif absolute_upper_bound:
+        ub_weights = ub_weights.round().clip(upper=absolute_upper_bound, lower=1).astype(int)
 
     else:
         ub_weights = None

diff --git a/populationsim/steps/final_seed_balancing.py b/populationsim/steps/final_seed_balancing.py
@@ -68,6 +68,8 @@ def final_seed_balancing(settings, crosswalk, control_spec, incidence_table):
 
     max_expansion_factor = settings.get('max_expansion_factor', None)
     min_expansion_factor = settings.get('min_expansion_factor', None)
+    absolute_upper_bound = settings.get('absolute_upper_bound', None)
+    absolute_lower_bound = settings.get('absolute_lower_bound', None)
 
     relaxation_factors = pd.DataFrame(index=seed_controls_df.columns.tolist())
 
@@ -86,6 +88,8 @@ def final_seed_balancing(settings, crosswalk, control_spec, incidence_table):
             total_hh_control_col=total_hh_control_col,
             max_expansion_factor=max_expansion_factor,
             min_expansion_factor=min_expansion_factor,
+            absolute_lower_bound=absolute_lower_bound,
+            absolute_upper_bound=absolute_upper_bound,
             incidence_df=seed_incidence_df,
             control_totals=seed_controls_df.loc[seed_id],
             initial_weights=seed_incidence_df['sample_weight'])

diff --git a/populationsim/steps/initial_seed_balancing.py b/populationsim/steps/initial_seed_balancing.py
@@ -65,6 +65,8 @@ def initial_seed_balancing(settings, crosswalk, control_spec, incidence_table):
 
     max_expansion_factor = settings.get('max_expansion_factor', None)
     min_expansion_factor = settings.get('min_expansion_factor', None)
+    absolute_upper_bound = settings.get('absolute_upper_bound', None)
+    absolute_lower_bound = settings.get('absolute_lower_bound', None)
 
     # run balancer for each seed geography
     weight_list = []
@@ -82,6 +84,8 @@ def initial_seed_balancing(settings, crosswalk, control_spec, incidence_table):
             total_hh_control_col=total_hh_control_col,
             max_expansion_factor=max_expansion_factor,
             min_expansion_factor=min_expansion_factor,
+            absolute_upper_bound=absolute_upper_bound,
+            absolute_lower_bound=absolute_lower_bound,
             incidence_df=seed_incidence_df,
             control_totals=seed_controls_df.loc[seed_id],
             initial_weights=seed_incidence_df['sample_weight'])

diff --git a/populationsim/steps/repop_balancing.py b/populationsim/steps/repop_balancing.py
@@ -60,6 +60,8 @@ def repop_balancing(settings, crosswalk, control_spec, incidence_table):
 
     max_expansion_factor = settings.get('max_expansion_factor', None)
     min_expansion_factor = settings.get('min_expansion_factor', None)
+    absolute_upper_bound = settings.get('absolute_upper_bound', None)
+    absolute_lower_bound = settings.get('absolute_lower_bound', None)
 
     # run balancer for each low geography
     low_weight_list = []
@@ -101,6 +103,8 @@ def repop_balancing(settings, crosswalk, control_spec, incidence_table):
                 total_hh_control_col=total_hh_control_col,
                 max_expansion_factor=max_expansion_factor,
                 min_expansion_factor=min_expansion_factor,
+                absolute_upper_bound=absolute_upper_bound,
+                absolute_lower_bound=absolute_lower_bound,
                 incidence_df=seed_incidence_df,
                 control_totals=low_controls_df.loc[low_id],
                 initial_weights=initial_weights)

diff --git a/populationsim/steps/setup_data_structures.py b/populationsim/steps/setup_data_structures.py
@@ -111,11 +111,11 @@ def add_geography_columns(incidence_table, households_df, crosswalk_df):
     # add seed_geography col to incidence table
     incidence_table[seed_geography] = households_df[seed_geography]
 
-    # add meta column to incidence table
-    seed_to_meta = \
-        crosswalk_df[[seed_geography, meta_geography]] \
-        .groupby(seed_geography, as_index=True).min()[meta_geography]
-    incidence_table[meta_geography] = incidence_table[seed_geography].map(seed_to_meta)
+    # add meta column to incidence table (unless it's already there)
+    if seed_geography != meta_geography:
+        tmp = crosswalk_df[list({seed_geography, meta_geography})]
+        seed_to_meta = tmp.groupby(seed_geography, as_index=True).min()[meta_geography]
+        incidence_table[meta_geography] = incidence_table[seed_geography].map(seed_to_meta)
 
     return incidence_table
 

diff --git a/populationsim/tests/run_mp.py b/populationsim/tests/run_mp.py
@@ -17,56 +17,58 @@
 
 def setup_dirs():
 
-    configs_dir = os.path.join(os.path.dirname(__file__), 'configs')
-    mp_configs_dir = os.path.join(os.path.dirname(__file__), 'configs_mp')
+    configs_dir = os.path.join(os.path.dirname(__file__), "configs")
+    mp_configs_dir = os.path.join(os.path.dirname(__file__), "configs_mp")
     inject.add_injectable("configs_dir", [mp_configs_dir, configs_dir])
 
-    output_dir = os.path.join(os.path.dirname(__file__), 'output')
+    output_dir = os.path.join(os.path.dirname(__file__), "output")
     inject.add_injectable("output_dir", output_dir)
 
-    data_dir = os.path.join(os.path.dirname(__file__), 'data')
+    data_dir = os.path.join(os.path.dirname(__file__), "data")
     inject.add_injectable("data_dir", data_dir)
 
     tracing.config_logger()
 
-    tracing.delete_output_files('csv')
-    tracing.delete_output_files('txt')
-    tracing.delete_output_files('yaml')
+    tracing.delete_output_files("csv")
+    tracing.delete_output_files("txt")
+    tracing.delete_output_files("yaml")
 
 
 def regress():
 
-    expanded_household_ids = pipeline.get_table('expanded_household_ids')
+    expanded_household_ids = pipeline.get_table("expanded_household_ids")
     assert isinstance(expanded_household_ids, pd.DataFrame)
-    taz_hh_counts = expanded_household_ids.groupby('TAZ').size()
+    taz_hh_counts = expanded_household_ids.groupby("TAZ").size()
     assert len(taz_hh_counts) == TAZ_COUNT
     assert taz_hh_counts.loc[100] == TAZ_100_HH_COUNT
 
     # output_tables action: skip
-    output_dir = inject.get_injectable('output_dir')
-    assert not os.path.exists(os.path.join(output_dir, 'households.csv'))
-    assert os.path.exists(os.path.join(output_dir, 'summary_DISTRICT_1.csv'))
+    output_dir = inject.get_injectable("output_dir")
+    assert not os.path.exists(os.path.join(output_dir, "households.csv"))
+    assert os.path.exists(os.path.join(output_dir, "summary_DISTRICT_1.csv"))
 
 
 def test_mp_run():
 
     setup_dirs()
 
+    # Debugging ----------------------
     run_list = mp_tasks.get_run_list()
     mp_tasks.print_run_list(run_list)
+    # --------------------------------
 
-    # do this after config.handle_standard_args, as command line args may override injectables
-    injectables = ['data_dir', 'configs_dir', 'output_dir']
+    # do this after config.handle_standard_args, as command line args
+    # may override injectables
+    injectables = ["data_dir", "configs_dir", "output_dir"]
     injectables = {k: inject.get_injectable(k) for k in injectables}
 
-    # pipeline.run(models=run_list['models'], resume_after=run_list['resume_after'])
+    mp_tasks.run_multiprocess(injectables)
 
-    mp_tasks.run_multiprocess(run_list, injectables)
-    pipeline.open_pipeline('_')
+    pipeline.open_pipeline("_")
     regress()
     pipeline.close_pipeline()
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
 
     test_mp_run()
diff --git a/setup.py b/setup.py
@@ -5,7 +5,7 @@
 
 setup(
     name='populationsim',
-    version='0.5',
+    version='0.5.1',
     description='Population Synthesis',
     author='contributing authors',
     author_email='ben.stabler@rsginc.com',