Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for additional web archives using Memento TimeMaps. #1

Open
phonedude opened this issue May 26, 2021 · 4 comments
Open

Comments

@phonedude
Copy link

There are many additional web archives that could be supported, esp. if this service used Memento TimeMaps, either aggregated through TimeTravel or directly via their TimeMap URIs.

Some lists of archives:

Memento Quick Intro

@ticky
Copy link
Owner

ticky commented May 26, 2021

I'd love to support more archives, however, two things make the wayback machine attractive for this:

  1. It's got pretty good coverage
  2. Though rudimentary, the site search is an absolute killer feature, especially for browsers with rudimentary text input like game consoles
  3. It's quite fast to load

Time Travel is lovely, but it takes a really long time to load (30-50 seconds for any given lookup, many of them just time out), and doesn't (as far as I can tell?) support any form of site search, or URL prefix matching. Given the connection speed of some of the devices I'm targeting, I'd much prefer to reduce waiting as much as possible, as a lot of it will be in the network.

The Internet Archive's CDX API also allows significant optimisation of the code on my end due to allowing complex filtering of snapshots, and as far as I can tell Memento APIs don't permit this, which in turn makes their responses take longer as they have to return all their data at once.

Do you have any suggestions for mitigating these performance issues?

@phonedude
Copy link
Author

Yeah, there's no doubt IA WM was the first, is the biggest, etc. and if you can only support one, that's the one. And for the 90s, IA is pretty much the only game in town, and if the other archives have pages, they're typically just copies of IA's WARCs (not always, but mostly).

Most other archives don't support prefix search, etc. yet, so there will be a trade-off re: breadth and features. One solution would be to offer different branches: IA, and various non-IA in another. You don't have to go through TT, you could contact some of the other archives directly. Or you could run your own instance of MemGator, and specify the non-IA archives that you'd like to poll (e.g., just arquivo.pt, archive.today, perma.cc, and wayback.vefsafn.is). The non-IA archives are likely to be sparse for many URLs, so the responses should be small and relatively quick. Try MemGator; it doesn't do any processing or pagination and is thus pretty fast.

$ time curl -isL memgator.cs.odu.edu/timemap/link/www.nasa.gov
[...]

real 0m18.784s
user 0m0.118s
sys 0m0.555s
$ time curl -isL memgator.cs.odu.edu/timemap/link/www.nasa.gov | wc -l
63969

real 0m6.465s
user 0m0.112s
sys 0m0.216s

The second call responded quickly (6s) because IA had cached its response. But the first call at 18s isn't too bad given the size of the response.

Other formats are similar:

$ time curl -isL memgator.cs.odu.edu/timemap/json/www.nasa.gov | wc -l
255847

real 0m10.538s
user 0m0.164s
sys 0m0.256s
$ time curl -isL memgator.cs.odu.edu/timemap/cdxj/www.nasa.gov | wc -l
63969

real 0m13.450s
user 0m0.129s
sys 0m0.251s

Finally, and I know you've already mentioned it in your repo, but regardless of the endpoint, some kind of caching would be a huge win for your application. It might even be worth it to go custom, since you're focused on data prior to a certain year (2000? 2005? 2010?). Most of the updates from all archives are going to come from the recent past. But even a standard reverse proxy would be super speedy. If you can keep robots out of your service, you'll probably get a lot of cache hits.

@ticky
Copy link
Owner

ticky commented May 29, 2021

I'm guessing you had happened to prime some cache on their end before your testing, because timing a request to memgator.cs.odu.edu/timemap/json/www.nasa.gov takes more than a minute for me, but on subsequent runs it's closer to ten seconds. Neither are particularly good times, and I intend to optimise around the cold cache state.

mementoweb's API times out after two minutes, which isn't enough time to fetch the history for, for instance, apple.com, resulting in a 504 Gateway Timeout and no usable data. On the third attempt to load this it actually did respond, but this request was about twenty minutes later, and presumably taking advantage of a primed cache laid down by the other two requests. A twenty minute request-retry cycle doesn't feel acceptable.

I just don't believe that the Memento protocol as designed is fit for this purpose, given it's missing fundamental features like filtering, date range specifiers, or even rudimentary pagination. This is the only way I can see that something like this can consume the API in an efficient, performant way; the CDX API allows me to reduce the complexity on both ends of this equation by requesting only a small subset of the data for the initial query, and when the user has drilled down, a month's worth of less-filtered data.

@ticky
Copy link
Owner

ticky commented May 29, 2021

I've realised overnight that the site search thing isn't really a blocker to this; there's no reason I couldn't leverage Wayback Machine for site search and an aggregator for the actual history, but the performance issues due to limitations of the Memento APIs remain an issue.

I missed this bit in your prior response:

Try MemGator; it doesn't do any processing or pagination and is thus pretty fast.

The lack of processing or pagination is, IMO, exactly the cause of the trouble! It means that each archive used by the aggregator has its full history queried, which means fetching a potentially huge number of items.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants