Skip to content

Plaba/US-Congress-Corpora-Builder

Repository files navigation

US-Congress-Corpora-Builder

A set of Python tools to download the Senate and House transcripts and convert them to usable text.

Usage

sh setup.sh
sh build-corpera.sh

The text transcripts will be in transcripts-txt/ and will be named by chamber of congress and date.

Roadmap

  • Downloading PDFs by date range
  • Converting them into usable text
  • Seperating the text by speaker and eliminating non-spoken text (See SeperateSpeeches.py)

About

A set of Python tools to download the Senate and House transcripts and convert them to usable text.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published