mozilla-sec-eia: Developing a linkage between SEC and EIA

This repo contains exploratory development for an SEC-EIA linkage.

Usage

Utility functions for accessing and working with 10k filings as well as their exhibit 21 attachments can be found in 'src/mozilla_sec_eia/utils.py'. The base class is the GCSArchive which provides an interface to archived filings on GCS. To instantiate this class, the following environment variables need to be set, or defined in a .env file:

GCS_BUCKET_NAME GCS_METADATA_DB_INSTANCE_CONNECTION GCS_IAM_USER GCS_METADATA_DB_NAME

This code sample shows how to use the class to fetch filings from the archive:

from mozilla_sec_eia.utils import GCSArchive
archive = GCSArchive()

# Get metadata from postgres instance
metadata_df = archive.get_metadata()

# Do some filtering to get filings of interest
filings = metadata_df.loc[...  # Get rows from original df

# This will download and cache filings locally for later use
# Successive calls to get_filings will not re-download filings which are already cahced
downloaded_filings = archive.get_filings(filings)

# Get exhibit 21's and extract subsidiary data
for filing in downloaded_filings:
        cool_extraction_model(filing.get_ex_21().as_image())

Labeling

We are using Label Studio to create training data for fine-tuning the Ex. 21 extraction model. The very preliminary workflow for labeling data is as follows:

For each filing that you want to label, follow notebook 7 to create the inputs for Label Studio. This notebook first creates a PDF of the filing. Then, it extracts the bounding boxes around each word and create a "task" JSON and image for each Ex. 21 table that will be used in Label Studio.
Upload these JSONs and images to the same bucket in GCS (the "unlabeled" bucket by default).
Install Label Studio
Start Label Studio locally and create a project.
Under Settings, set the template/config for the project with the config found in labeling-configs/labeling-config.xml. This should create the correct entity labels and UI setup.
Connect GCS to Label Studio by following these directions
Specific Label Studio settings: Filter files for only JSONs (these are your tasks). Leave "Treat every bucket object as a source file" disabled. Add the service account authentication JSON for your bucket.
Additionally add a Target Storage bucket (the "labeled" bucket by default).
Import data and label Ex. 21 tables.
Sync with target storage.
Update the labeled_data_tracking.csv with the new filings you've labeled.
Run the rename_labeled_filings.py script to update labeled file names in the GCS bucket with their SEC filename.

About Catalyst Cooperative

Catalyst Cooperative is a small group of data wranglers and policy wonks organized as a worker-owned cooperative consultancy. Our goal is a more just, livable, and sustainable world. We integrate public data and perform custom analyses to inform public policy (Hire us!). Our focus is primarily on mitigating climate change and improving electric utility regulation in the United States.

Contact Us

For general support, questions, or other conversations around the project that might be of interest to others, check out the GitHub Discussions
If you'd like to get occasional updates about our projects sign up for our email list.
Want to schedule a time to chat with us one-on-one? Join us for Office Hours
Follow us on Twitter: @CatalystCoop
More info on our website: https://catalyst.coop
For private communication about the project or to hire us to provide customized data extraction and analysis, you can email the maintainers: pudl@catalyst.coop

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github		.github
docs		docs
labeling-configs		labeling-configs
notebooks		notebooks
src/mozilla_sec_eia		src/mozilla_sec_eia
terraform		terraform
tests		tests
.codecov.yml		.codecov.yml
.coveragerc		.coveragerc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.rst		README.rst
environment.yml		environment.yml
labeled_data_tracking.csv		labeled_data_tracking.csv
pyproject.toml		pyproject.toml
tox.ini		tox.ini

License

catalyst-cooperative/mozilla-sec-eia

Folders and files

Latest commit

History

Repository files navigation

mozilla-sec-eia: Developing a linkage between SEC and EIA

Usage

Labeling

About Catalyst Cooperative

Contact Us

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Sponsor this project

Languages