Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create torrents for bulk data #226

Open
ndawg opened this issue Mar 29, 2018 · 4 comments
Open

Create torrents for bulk data #226

ndawg opened this issue Mar 29, 2018 · 4 comments

Comments

@ndawg
Copy link

ndawg commented Mar 29, 2018

Right now, I'm using the fdsys script to scrape all bill texts for every Congress session that has data. This takes a long, long time, so having the data hosted somewhere makes sense. After all, bills from previous congressional sessions aren't going to be modified. However, it is about a gigabyte of data per session, so no host would make sense - on the other hand, this is a great use case for torrents. The main issue is that you would most likely end up being stuck with all the formats possible in one torrent, but that's okay for me. Thoughts on this?

@konklone
Copy link
Member

konklone commented Apr 2, 2018

Sunlight used to host these on S3, but doesn't do that anymore.

It is a pretty decent use case for torrents, though I don't know if any of the organizers here have (or are familiar with) torrent management software, or want to take on the maintenance.

@dwillis
Copy link
Member

dwillis commented Apr 2, 2018

@konklone, do you happen to know what hosting these on S3 cost Sunlight?

@konklone
Copy link
Member

konklone commented Apr 5, 2018

No, I don't remember anymore...not even the order of magnitude. If it was hugely expensive I'd probably remember, but we also didn't promote them very well -- they are just linked to on the wiki.

And actually, they still are:
https://github.com/unitedstates/congress/wiki

And the Sunlight downloads...still work. They're just not updated anymore. And are delivered over plain HTTP (gross).

@sbma44
Copy link

sbma44 commented Apr 5, 2018

I don't recall the S3 costs associated with these but I'd be shocked if they were significant. Speaking as a former crazed Bittorrent evangelist, I kind of doubt you'll wind up with enough use to keep a healthy swarm going. Still, if you want to go this route, S3 offers torrent capability. In practice that will probably wind up with AWS as the single seed and no real difference in costs (it actually might be a bit higher since I think you wind up paying for more API ops for individual chunks, even as the bandwidth costs are the same -- still, we're probably talking about pocket change).

What might make more sense is just configuring a requester-pays bucket. This will introduce some hassle for devs who aren't in the AWS ecosystem but is a pretty clean solution and protects against unexpected bills coming from devs who pull this data on an hourly cron. Unfortunately requester-pays buckets do not support Bittorrent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants