Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

primary_netloc and primary_url use the parent instead of root URL #538

Open
JustAnotherArchivist opened this issue Jul 19, 2022 · 0 comments

Comments

@JustAnotherArchivist
Copy link
Contributor

The effect is that if such an ignore is added later in the job, it won't have the expected effect. For example, a job for https://example.org/ comes across a link to https://example.net/ which further has a frame https://example.net/foo. If an ignore ^https?://(?!{primary_netloc}/) is added at the beginning, only the first URL is retrieved, but if it's added after the retrieval of https://example.net/, all three are retrieved even though the frame should be ignored. primary_netloc is already example.net at that point due to this bug, and so the ignore doesn't match.

This was introduced by 967d5aa while porting to wpull 2. The ignoracle tests are currently broken and disabled (4d3e4fc) and need to be fixed first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant