Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapy "session" extension #3258

Open
dmsolow opened this issue May 11, 2018 · 15 comments
Open

Scrapy "session" extension #3258

dmsolow opened this issue May 11, 2018 · 15 comments

Comments

@dmsolow
Copy link

dmsolow commented May 11, 2018

I'm interested in modifying Scrapy spider behavior slightly to add some custom functionality and avoid messing around with the meta dictionary so much. Basically, the implementation I'm thinking of will be an abstract subclass of scrapy.Spider which I will call SessionSpider. The primary differences will be:

  • Instead of the normal spider parse callback signature (self, response), SessionSpider will have (self, session, response) callbacks. The session argument will be some kind of Session object that at least keeps track of cookies (and possibly proxies and certain headers).

  • This will require a change in how the cookie middleware works. Instead of passing a cookie jar ID, the session will keep track of cookies directly. As a side note: does the default cookie middleware ever drop cookiejars? I could be missing something, but it looks to me like they stay around forever. This would be a problem for my spiders because I want them to run "forever" on an unbounded list of URLs.

  • A SessionSpider callback that wants to create requests with the same session will generate requests using a session.Request factory method that returns a scrapy.Request. This method will take care of merging session variables with the new request.

  • I'm hoping to implement most of the features I want by having the Session object do the meta manipulation behind the scenes so that SessionSpider subclasses don't have to touch meta as much. However, I will also have to modify/add middleware, since I want to change how cookiejars are passed around.

I thought I would post this here just to see what thoughts people have. Is this is a bad idea? Has it been tried before? Any issues I might run into? I see that this kind of thing has been discussed before: #1878

@IAlwaysBeCoding
Copy link
Contributor

This is a great idea, although adding another argument to the parse callback signature should not be done. That's because you will have to actually edit the inner workings of Scrapy to allow what you are suggesting. It's not as simple as creating a new Spider class

I've thought about building a set of middlewares to do what you are trying to do. You must use meta to implement this and I don't think there are any other ways to do it. In fact, meta is so important to Scrapy that a lot of the default middlewares touch the Request/Response meta to implement their logic.

I think the best approach would be to make a SessionSpider with a few extra helping methods that can create Session instances that you can later pass on to simple Request instances.

Something like calling self.create_new_session() inside the spider where you can create Session instances on the spot.

@lucywang000
Copy link
Member

@dmsolow thanks for the clear description. But I'm still wondering what's the main advantage of this new SessionSpider and session concepts? In my understanding, a session is backed by a cookiejar.

a session.Request factory method that returns a scrapy.Request. This method will take care of merging session variables with the new request.

A session variable is just a cookiejar index, right? In our project we have a spider middleware that populates the request cookiejar index based on the response cookiejar index, which ensures the new request uses the same session as the response.

@kmike
Copy link
Member

kmike commented Jun 25, 2019

#3563 (comment) is an idea in a similar direction.

@lycanthropes
Copy link

How about the progress now ? I also meet the session problem.

@Gallaecio
Copy link
Member

@lycanthropes There is currently no one working on this feature.

@lycanthropes
Copy link

There is just a official solution to this , I found the solution yesterday.

@Gallaecio
Copy link
Member

@lycanthropes Which solution?

@Gallaecio
Copy link
Member

That is not exactly what the original suggestion is about. If you read the original suggestion above carefully, you’ll see it mentions that solution already (“Instead of passing a cookie jar ID”).

@lycanthropes
Copy link

Do you mean the classification of @lucywang000 ?

@Gallaecio
Copy link
Member

I mean the issue description.

@andrewbaxter
Copy link

andrewbaxter commented Jul 16, 2019

I'd like to add some notes from internal discussion with @raphapassini about sessions here too:

  • Need to make some "session setup" requests when the cookie jar is empty or a new ip is used to set up a session, get some standard cookies, etc

  • Requests might need to be sent 1. explicitly in the current session (if cookies are needed for the request to pass or the request is dependent on some other prior state set up in previous requests), 2. in any session, 3. or explicitly in a new session if it needs clear cookies, history, etc

  • A session may get into a bad state where requests on the session no longer work. Requests may need to be paused and retried after executing a "refresh" process, or else in a new session

  • Perhaps automatically refresh in a periodic schedule

  • Sessions may need to separate and persist cookies but also perhaps headers (to interface with external session APIs, ex: Crawlera session id)

  • Inspect session stats (how many sessions created/destroyed, requests per session, etc) to debug crawl issues

I don't know what I'm talking about particularly, but maybe a Scheduler could be a good place to start implementing this? I've worked on solutions that wrapped callbacks to juggle queues of requests per session, but there were significant difficulties due to callbacks never running (because of dupe filtering, unexpected errors, etc) and sessions getting into indeterminate state.

@ThomasAitken
Copy link

Thoughts:
https://github.com/ThomasAitken/scrapy-sessions
?

@GeorgeA92
Copy link
Contributor

@ThomasAitken
From https://github.com/ThomasAitken/scrapy-sessions readme:

Scrapy's sessions are a black box

It is not true. Basically CookiesMiddleware is a wrapper around dictionary with CookieJar objects from python builtin http module.

...They can't be exposed within a scrape and they can't be directly altered.
2. Scrapy makes it very difficult to easily replace a session (and/or general 'profile') unilaterally across all requests that are scheduled or enqueued. This is important for engaging with websites that have session-expiry logic.

It is possible to reach CookieMiddleware object with it's content directly from spider start_requests and parse methods (from crawler object as well as the most of ofther middlewares/moduels:

Unfortunately Crawler object doesn't have methods to get middleware object from it's name so it is possible with this... trick:

class Myspider(scrapy.Spider):
   def start_requests(self):
        downloader_middlewares = self.crawler.engine.downloader.middleware.middlewares
        self.CookieMiddleware= [middleware for middleware in downloader_middlewares if "CookieMiddleware" in str(type(middleware))][0]

With direct access to CookieJars from CookieMiddleware object and cookie_jar ID from response.meta - You already able to make any manipulations with sessions

  1. Scrapy provides no native capability for maintaing distinct profiles (client identities) within a single scrape.

Unfortunately this is true. By default scrapy use single CookieJar object for all requests. Single user agent from settings.
+a lot of additional issues in case of multiple proxies used.
The most of publicly available proxy rotaion modules for scrapy don't create CookieJar per proxy - they are not session safe.

The idea of this tool is to manage distinct client identities within a scrape. The identity consists of two or more of the following attributes: session + user agent + proxy.

Some proxy providers already include session handling as service in addition to scraping proxies. In this case from that list - only proxy handling required from scrapy user.

For rest of cases. I agree that idea is actual.

from w3lib.http import basic_auth_header
PROFILES = [
    {"proxy":['proxy_url', basic_auth_header('username', 'password')], "user-agent": "MY USER AGENT"},
    {"proxy":['proxy_url', basic_auth_header('username', 'password')], "user-agent": "MY USER AGENT"}
]

In order to bind proxy address to cookiejar it is enough to use the same key value for proxy and cookiejar request meta keys (no extra middlewares required) as I did on this gist code sample

@ThomasAitken
Copy link

ThomasAitken commented May 1, 2021

@GeorgeA92

Thanks for your feedback.

Yes, I understand that the Scrapy Cookiejar is a wrapper around the http Cookiejar, and yes you are right that you can use that trick to access the cookie middleware. But my API offers much nicer syntax, convenient methods for inspecting the specific cookies and functionality to 'refresh' sessions.

As for the 'profiles' aspect, yes you are right that you can bind proxy addresses to a cookiejar as you do in that code sample, but there are other benefits to the way I have set things up. It is just a convenient way of setting something up that is normally very difficult in Scrapy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants