Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add public function for retrieving filing URLs without downloading #32

Open
mksamelson opened this issue Mar 7, 2020 · 7 comments
Open
Labels
enhancement New feature or request
Milestone

Comments

@mksamelson
Copy link

Would be nice to be able to access the files on-line for scraping as opposed to downloading them all. A feature for just returning filing URLs would be handy

@jadchaar
Copy link
Owner

jadchaar commented Mar 7, 2020

Hey @mksamelson, thanks for reaching out and using the tool!

I actually have an internal utility function that does exactly what you are requesting:

env ❯ python3
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from sec_edgar_downloader._utils import get_filing_urls_to_download
>>> get_filing_urls_to_download("10-K", "AAPL", 20, "2010-12-31", "2019-12-31", False)
[FilingMetadata(filename='0000320193-19-000119.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000032019319000119/0000320193-19-000119.txt'), FilingMetadata(filename='0000320193-18-000145.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/0000320193-18-000145.txt'), FilingMetadata(filename='0000320193-17-000070.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000032019317000070/0000320193-17-000070.txt'), FilingMetadata(filename='0001628280-16-020309.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000162828016020309/0001628280-16-020309.txt'), FilingMetadata(filename='0001193125-15-356351.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt'), FilingMetadata(filename='0001193125-14-383437.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000119312514383437/0001193125-14-383437.txt'), FilingMetadata(filename='0001193125-13-416534.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000119312513416534/0001193125-13-416534.txt'), FilingMetadata(filename='0001193125-12-444068.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000119312512444068/0001193125-12-444068.txt'), FilingMetadata(filename='0001193125-11-282113.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000119312511282113/0001193125-11-282113.txt'), FilingMetadata(filename='0001193125-10-238044.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000119312510238044/0001193125-10-238044.txt')]

The function sec_edgar_downloader._utils.get_filing_urls_to_download returns a list of FilingMetadata objects, which contain the URL you are looking for. The parameters and interface are exactly the same as the get method, but all parameters are required. Since this is an internal method, I have not gotten around to putting a docstring on it.

Let me know if this helps, or if you would like to see something different implemented in a future release!

@mksamelson
Copy link
Author

Thanks this is helpful. It would be great in a future release if you could have a utility that provided URLs of other file formats. Your utility accesses the *.txt document (full filing). If there is a way to 1. list the URLs and 2. download html and xml files that would be great.

The image below show the file you reference (circled in red). The file types highlighted in yellow are also very useful.

image

@jadchaar jadchaar added the enhancement New feature or request label Mar 7, 2020
@jadchaar
Copy link
Owner

jadchaar commented Mar 7, 2020

Your request has been noted! This is actually quite related to #31. When I get a free moment, I will work toward adding this feature!

Originally I created this tool for text parsing purposes, but I have seen a nice influx of users requesting the ability to download XML and HTML versions as well, so this will hopefully be the next feature I work on!

@jadchaar jadchaar changed the title Ability to Just Pull Filing URLs Ability to download XML and HTML filing data and retrieve corresponding URLs Mar 7, 2020
@mksamelson
Copy link
Author

Thanks.

Just for additional clarity, the txt files have html tags but often have a lot of other junk that causes issues when trying to use an html/xml parser. So you usually have to resort to regular expressions to parse. However, the raw html and xml files don't have this issue.

@jadchaar
Copy link
Owner

jadchaar commented Mar 7, 2020

Thanks for letting me know and thanks for finding a regex workaround in the meantime :).

@jadchaar jadchaar changed the title Ability to download XML and HTML filing data and retrieve corresponding URLs Add public function for retrieving filing URLs without downloading Jan 18, 2021
@jadchaar
Copy link
Owner

v4 of this package will add the ability to download XML and HTML filing details in addition to the full submission TXT: #52. I still need to make a public facing function for obtaining the URLs without downloading, but the utility function can still serve this purpose until a public function on the Downloader class is added.

@jadchaar
Copy link
Owner

jadchaar commented May 9, 2021

Another user requested this functionality in an email to me:

I don't use it to download files. Instead, I use it to generate the full_submission_url, and save the urls. i.e., I modified the Downloader() function so that it returns the filings_to_fetch FilingMetadata object.

As such, I'm wondering, in future versions of sec-edgar-download, can you add an option to return the FilingMetadata object filings_to_fetch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants