Create torrents for bulk data #226

ndawg · 2018-03-29T21:35:52Z

Right now, I'm using the fdsys script to scrape all bill texts for every Congress session that has data. This takes a long, long time, so having the data hosted somewhere makes sense. After all, bills from previous congressional sessions aren't going to be modified. However, it is about a gigabyte of data per session, so no host would make sense - on the other hand, this is a great use case for torrents. The main issue is that you would most likely end up being stuck with all the formats possible in one torrent, but that's okay for me. Thoughts on this?

The text was updated successfully, but these errors were encountered:

konklone · 2018-04-02T04:45:43Z

Sunlight used to host these on S3, but doesn't do that anymore.

It is a pretty decent use case for torrents, though I don't know if any of the organizers here have (or are familiar with) torrent management software, or want to take on the maintenance.

dwillis · 2018-04-02T21:47:40Z

@konklone, do you happen to know what hosting these on S3 cost Sunlight?

konklone · 2018-04-05T03:45:41Z

No, I don't remember anymore...not even the order of magnitude. If it was hugely expensive I'd probably remember, but we also didn't promote them very well -- they are just linked to on the wiki.

And actually, they still are:
https://github.com/unitedstates/congress/wiki

And the Sunlight downloads...still work. They're just not updated anymore. And are delivered over plain HTTP (gross).

sbma44 · 2018-04-05T14:21:02Z

I don't recall the S3 costs associated with these but I'd be shocked if they were significant. Speaking as a former crazed Bittorrent evangelist, I kind of doubt you'll wind up with enough use to keep a healthy swarm going. Still, if you want to go this route, S3 offers torrent capability. In practice that will probably wind up with AWS as the single seed and no real difference in costs (it actually might be a bit higher since I think you wind up paying for more API ops for individual chunks, even as the bandwidth costs are the same -- still, we're probably talking about pocket change).

What might make more sense is just configuring a requester-pays bucket. This will introduce some hassle for devs who aren't in the AWS ecosystem but is a pretty clean solution and protects against unexpected bills coming from devs who pull this data on an hourly cron. Unfortunately requester-pays buckets do not support Bittorrent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create torrents for bulk data #226

Create torrents for bulk data #226

ndawg commented Mar 29, 2018 •

edited

konklone commented Apr 2, 2018

dwillis commented Apr 2, 2018

konklone commented Apr 5, 2018

sbma44 commented Apr 5, 2018 •

edited

Create torrents for bulk data #226

Create torrents for bulk data #226

Comments

ndawg commented Mar 29, 2018 • edited

konklone commented Apr 2, 2018

dwillis commented Apr 2, 2018

konklone commented Apr 5, 2018

sbma44 commented Apr 5, 2018 • edited

ndawg commented Mar 29, 2018 •

edited

sbma44 commented Apr 5, 2018 •

edited