Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prioritise successful captures in replay #813

Open
michaeltobintna opened this issue Jan 8, 2021 · 2 comments
Open

Prioritise successful captures in replay #813

michaeltobintna opened this issue Jan 8, 2021 · 2 comments

Comments

@michaeltobintna
Copy link

michaeltobintna commented Jan 8, 2021

If a warc contains two captures of the same URL with different response codes (e.g. 403 and 200) the 200 response is not prioritised in replay. A 200 may be added to a collection as a result of patching a 403. If the replay displays the 403 capture, this is misleading as it appears a capture has been unsuccessful.
Maybe a status code filter on replay would solve this issue.

@despens
Copy link
Contributor

despens commented Jan 11, 2021

Would it be possible for you to share a WARC file or Conifer collection URL where this is happening?

@michaeltobintna
Copy link
Author

michaeltobintna commented Jan 25, 2021

Thanks for getting back to me.

Here is an example.

In this collection, at this URL:
https://conifer.rhizome.org/ukgwa/20210125-/20210125042604/https://coronavirus.data.gov.uk/details/cases

If you click the circled toggle, to change the chart to a nation view:
conifer1

It will serve a 403 error and fail to load the chart:
conifer2

If you take the URL which returned the 403 error and search for it in the archive, you can see that the capture is in fact a 403 error.
conifer3

However if you change the timestamp in the URL to a later hour, you'll see that there was a successful capture of the resource:
conifer4

This is a result of the crawler being throttled and then 403 & 429 errors being patched.

My suggestion is that Conifer's replay system should prioritise successful (i.e. 200) captures to ensure more accurate replay.

Let me know if anything is unclear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants