Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Automate schedule downloads #18

Open
lauriemerrell opened this issue Sep 13, 2022 · 4 comments
Open

[Data] Automate schedule downloads #18

lauriemerrell opened this issue Sep 13, 2022 · 4 comments
Assignees

Comments

@lauriemerrell
Copy link
Member

lauriemerrell commented Sep 13, 2022

In addition to scraping realtime data every 5 minutes, we should scrape the GTFS schedule (static) data on a daily basis so we don't have to get historical versions after the fact.

We should write a Lambda function that will scrape the CTA schedule GTFS data from https://www.transitchicago.com/downloads/sch_data/google_transit.zip every day.

Acceptance criteria for this should just be a Python script that will scrape the zipfile as bytes and write it to S3.

Once that's ready we should make a follow up ticket to deploy to AWS (has to be done by me, @lauriemerrell) and another follow up ticket to describe desired follow up processing.

@mrscraps13
Copy link

wanted feed back on this, please let me know!
@lauriemerrell @KyleDolezal
"""
with open("infile", "rb") as in_file, open("out-file", "wb") as out_file:
chunk = in_file.read(chunk_size)

if chunk == b"":
    break

out_file.write(chunk)

"""

@KyleDolezal
Copy link
Collaborator

@mrscraps13 It looks good to me. I can see similar working examples, such as here. Is this code part of a branch? I'm wondering if I could see it in context.

@lauriemerrell
Copy link
Member Author

lauriemerrell commented Oct 23, 2022

Agree with @KyleDolezal, looks good but wondering about context-- I think that in my day job where we download feeds, we just use requests and basically request.get(<SCHEDULE_URL>) and then just save the response content. Here's an example: https://github.com/cal-itp/data-infra/blob/main/airflow/dags/gtfs_downloader/download_data.py#L35-L78, it's a bit hard to follow because there's some other config stuff going on but maybe helpful?

@mrscraps13
Copy link

mrscraps13 commented Dec 7, 2022

im a bit lost about the 'context', which other pieces. the way i thought about this was reading the file by chunks. could someone provide a bit more guidance :)

@dcjohnson24 dcjohnson24 self-assigned this Jul 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready for Review
Development

No branches or pull requests

4 participants