Skip to content

catalyst-cooperative/mozilla-sec-eia

Repository files navigation

mozilla-sec-eia: Developing a linkage between SEC and EIA

Tox-PyTest Status

Codecov Test Coverage

Read the Docs Build Status

Any color you want, so long as it's black.

This repo contains exploratory development for an SEC-EIA linkage.

Usage

Utility functions for accessing and working with 10k filings as well as their exhibit 21 attachments can be found in 'src/mozilla_sec_eia/utils.py'. The base class is the GCSArchive which provides an interface to archived filings on GCS. To instantiate this class, the following environment variables need to be set, or defined in a .env file:

GCS_BUCKET_NAME GCS_METADATA_DB_INSTANCE_CONNECTION GCS_IAM_USER GCS_METADATA_DB_NAME

This code sample shows how to use the class to fetch filings from the archive:

from mozilla_sec_eia.utils import GCSArchive
archive = GCSArchive()

# Get metadata from postgres instance
metadata_df = archive.get_metadata()

# Do some filtering to get filings of interest
filings = metadata_df.loc[...  # Get rows from original df

# This will download and cache filings locally for later use
# Successive calls to get_filings will not re-download filings which are already cahced
downloaded_filings = archive.get_filings(filings)

# Get exhibit 21's and extract subsidiary data
for filing in downloaded_filings:
        cool_extraction_model(filing.get_ex_21().as_image())

Labeling

We are using Label Studio to create training data for fine-tuning the Ex. 21 extraction model. The very preliminary workflow for labeling data is as follows:

  • For each filing that you want to label, follow notebook 7 to create the inputs for Label Studio. This notebook first creates a PDF of the filing. Then, it extracts the bounding boxes around each word and create a "task" JSON and image for each Ex. 21 table that will be used in Label Studio.
  • Upload these JSONs and images to the same bucket in GCS (the "unlabeled" bucket by default).
  • Install Label Studio
  • Start Label Studio locally and create a project.
  • Under Settings, set the template/config for the project with the config found in labeling-configs/labeling-config.xml. This should create the correct entity labels and UI setup.
  • Connect GCS to Label Studio by following these directions
  • Specific Label Studio settings: Filter files for only JSONs (these are your tasks). Leave "Treat every bucket object as a source file" disabled. Add the service account authentication JSON for your bucket.
  • Additionally add a Target Storage bucket (the "labeled" bucket by default).
  • Import data and label Ex. 21 tables.
  • Sync with target storage.
  • Update the labeled_data_tracking.csv with the new filings you've labeled.
  • Run the rename_labeled_filings.py script to update labeled file names in the GCS bucket with their SEC filename.

About Catalyst Cooperative

Catalyst Cooperative is a small group of data wranglers and policy wonks organized as a worker-owned cooperative consultancy. Our goal is a more just, livable, and sustainable world. We integrate public data and perform custom analyses to inform public policy (Hire us!). Our focus is primarily on mitigating climate change and improving electric utility regulation in the United States.

Contact Us

  • For general support, questions, or other conversations around the project that might be of interest to others, check out the GitHub Discussions
  • If you'd like to get occasional updates about our projects sign up for our email list.
  • Want to schedule a time to chat with us one-on-one? Join us for Office Hours
  • Follow us on Twitter: @CatalystCoop
  • More info on our website: https://catalyst.coop
  • For private communication about the project or to hire us to provide customized data extraction and analysis, you can email the maintainers: pudl@catalyst.coop