Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow only internal redirects #81

Open
spekulatius opened this issue Jul 24, 2021 · 2 comments
Open

Follow only internal redirects #81

spekulatius opened this issue Jul 24, 2021 · 2 comments

Comments

@spekulatius
Copy link
Contributor

Hello @mvdbos

I haven't found time to look into the robots.txt filter discussed in the other issue. Sorry! I stumbled on a new question you might be able to shine some light on:

I'm trying to filter out URLs that have been redirected externally. I'm keen to implement a PostFetchFilter to keep it all within the spider. I was wondering if it possible to get the final URL (after redirects) in a PostFetchFilter? It seems like only the original URL is part of the Resource.

Appreciate any ideas on how you would approach this.

Cheers,
Peter

@mvdbos
Copy link
Owner

mvdbos commented Oct 30, 2021

Hi @spekulatius , my apologies for the very late reply.
One way (not tested by me) could be this:

  • Set the allow_redirects option on the Guzzle request handler when you construct it, and set the option track_redirects to true. This would store info about redirects in the X-Guzzle-Redirect-History and X-Guzzle-Redirect-Status-History headers.
  • If I am not mistaken, Resource contains the entire response (ResponseInterface), which you can use to inspect the headers.

@spekulatius
Copy link
Contributor Author

Hello @mvdbos,

no problem. We've all got plenty of issues to take care of :) My open robots.txt issue is a sign of this...

I'll try to get to a solution using the allow_redirects and let you know how it goes 👍

Cheers,
Peter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants