Skip to content
eli knaap edited this page Jun 14, 2019 · 1 revision

Geosnap Architecture

The Community object

Objectives

the Community object is the central data structure of the library. It needs to:

  • provide a relatively efficient space-time data structure
  • be relatively transparent, meaning that it's easy decompose to pull out the consituent parts (like the geodataframe of neighborhood boundaries for each time period)
  • a community is a concept
    • it is defined by the particular boundaries that make up the Community any given time

This would probably be ideal for the new dataclass but that would mean we target only >=3.7 (there's a 3.6 backport)

Structure

For a given Community we need:

  • dataframe(s) of neighborhood-level attributes for >=1 time period
  • geodataframe(s) of neighborhood boundary(s) for >=1 time period
    • For harmonized data, we need only a single set of neighborhood boundaries
    • For unharmonized data we need boundaries for each time period
      • we may also need new, original boundaries so we need a system to register new ones and keep track of which is "active" using a property
      • another property that stores the method of geometry harmonization
        • e.g. a gdf with index as rows and different geoms as cols
      • maybe store the full intersection gdf somewhere if it gets calculated because its expensive? - this is a can of worms. not doing this yet.
    • we need to support both, conversion between the two, and a attribute that knows which condition is true
  • (is this all that's absolutely essential?)
    • if so, maybe the solution is just a clever multi index?
  • right now we also store state and county boundaries because why not? we use them for plotting
    • but if that's the only reason we could add a utility inside the plot() function that gets extra geoms as needed
  • do we want to provide a standard location for known aux data we might want to pass to harmonize functions (maybe not attached at init, but can be added with method). Dont want to bloat the object...
    • osm
    • nlcd clip
    • no... this is overloading the class. we can attach this stuff during the harmonize process
  • do we repeat geometries in the harmonized case?
    • drawback: storage/mem inefficient
    • bonus: single API for both harmonized and unarhomized case
    • bonus: maintains 1:1 relationship between attributes and geoms (less chance for error on join)
    • bonus: geometry/geospatial queries are much easier
    • bonus: easy to slice/decompose

Building Internal Database

geosnap.data.import_ltdb(full_path, sample_path]) - this method reads ltdb data and stores in a local quilt database geosnap.data.geolytics_to_quilt(full_path, sample_path])

function that needs to be called first if the data is not present in local quilt db (hopefully this will remove some of the current ltdb confusion)`

Constructor Methods

If we instantiate a Community from certain datasets, the schema is implied (e.g. similar to the way we use source='ltdb' now, but we can simplify and abstract the Community signture while using a @classmethod for encapsulating the ltdb-specific logic)

  • from_ltdb(filter) - instantiate from ltdb prompts for a separate `

  • from_geolytics(filter) - same as above

  • from_census(years, filter) - assumes unharmonized. user specifies time periods, data come from quilt

  • from_lehd(dataset, years, filter) - harmonised or unharmonized depending on selected time periods

  • from_gdfs(gdfs, harmonized=False) - a dict of geodataframes (unharmonized) or a gdf and dict of {time_period: df} harmonized?

  • from_parquet(path) - a previously saved Community?

Attributes

sandiego = GeoDataFrame() sandiego_community = Community()

sandiego_community.geodataframe

  • harmonized @property (bool)
  • harmonization @property (str) defaults to None that (harmonize and harmonizeD are connected)

sandiego.geodataframe[1990]

  • census - will be called geodataframe

    • currently this is a single long-form attribute table
    • should be maybe a multiindex or dict like {time_periods: DataFrame}?
    • do all possible community data come from a census of some kind?
  • tracts, counties, states currently hardcoded, should be abstracted and may need one for each time period

    • states and counties go, and only source/target geoms get stored
  • function/method to attch MSA name/ID to Community.geodataframe

Methods

  • plot()
  • plot_interactive() ?
  • to_parquet(path) store for later use
  • to_crs() convenience for reprojecting all geoms at once