Skip to content

ansonl/lobbyist-lookup

Repository files navigation

Unified Congress Lobbyist Disclosure Scrapper and Lookup

Deploy

  • Record Retrieval

    - Latest current year House lobby disclosure filings available on [House.gov](http://disclosures.house.gov/).
    • Using the webbrowser based search may result in

      Cannot download more than 2000 records. Please refine search.

    • Using past filings download link utimately leads to here to download filings in xml format.

      • The house.gov site uses an input element with method of POST to an asp page to serve the archive files. The site also runs on ASP which has ViewState and EventValidation enforced to prevent CSRF. ViewStateand EventValidation makes programmatic POST requests more complicated as we need to have valid ViewState and EventValidation values in order to send a valid POST request.
        • This Go program retrieves a response from the ASP server with a GET request. After parsing the hidden ViewState and EventValidation input values, we are able to construct a valid POST request which the ASP server replies back with a file stream. We write the file stream to a defined file.
          • houseRetrieve.go uses code.google.com/p/go.net/html package to parse HTML for tokens.
          • houseRetrieve.go contains the archive downloading portion of the code and can be repurposed to send/received requests with other ASP sites using CSRF protection.
    • XXXX Registration archives contain new registrations for that year. XXXX N Quarter archives contain filings due for N quarter.

      • This program will download all archives for the current year.
    • Use predicted file naming convention for Senate filings on Senate.gov.

      • Senate provides xml files with up to 1000 filings per file.
        • XML files are in UTF-16 and Go expects UTF-8
          • Used code.google.com/p/go-charset/charset to convert UTF-8 to UTF-16.
    • Interesting Info

      • House has ~90k filings versus Senate's ~130k filings.
      • House filings are in their individual XML file versus Senate filing being 1000 per file
      • Senate filings therefore parse faster funnily enough.
    • Retrieves lobbyist filings every day.

      • Heroku cycles dynos every 24 hrs so that also refreshes the list as well ;)
Parameter Comment
__VIEWSTATE extracted token
__EVENTVALIDATION extracted token
selFilesXML requestd archive filename from page HTML input element
btnDownloadXML needed to tell ASP to serve file?