Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receive "HTTP Error 401: Unauthorized" after 100 scrapes #140

Open
jiapenghe1996 opened this issue Dec 12, 2023 · 1 comment
Open

Receive "HTTP Error 401: Unauthorized" after 100 scrapes #140

jiapenghe1996 opened this issue Dec 12, 2023 · 1 comment
Assignees

Comments

@jiapenghe1996
Copy link

I am using https://github.com/rbshaffer/gpo_tools to scrape Congressional Hearings Scripts via GovInfo API. However, it appears that for each session of Congress, I am only avaialble to scrape the first 100 hearings. After 100 scrapes, I got the error "HTTPError: HTTP Error 401: Unauthorized."

I would like to ask if you know how to resolve this issue? Thank you!

@jonquandt jonquandt self-assigned this Feb 9, 2024
@jonquandt
Copy link
Member

Unfortunately, I do not have any knowledge of how that tool works. Do you know if it is properly sending the API key for each request?

I would suggest reaching out to the maintainer for the tool.

From a quick glance, it looks like it's leveraging the API to gather a list of hearings, and then trying to brute force scrape things from the GovInfo website via an older link pattern (potentially from the days of FDsys when the site was hosted under https://www.gpo.gov/fdsys) as part of extract_nav

My recommendation would be to leverage the GovInfo API fully to get the list of hearing information and then following the package links, download content and metadata.

Note that for the CHRG collection, package results will not contain content - you will need to follow the granulesLink to get to the individual parts of a hearing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants