Scrapy "session" extension #3258

dmsolow · 2018-05-11T15:13:58Z

I'm interested in modifying Scrapy spider behavior slightly to add some custom functionality and avoid messing around with the meta dictionary so much. Basically, the implementation I'm thinking of will be an abstract subclass of scrapy.Spider which I will call SessionSpider. The primary differences will be:

Instead of the normal spider parse callback signature (self, response), SessionSpider will have (self, session, response) callbacks. The session argument will be some kind of Session object that at least keeps track of cookies (and possibly proxies and certain headers).
This will require a change in how the cookie middleware works. Instead of passing a cookie jar ID, the session will keep track of cookies directly. As a side note: does the default cookie middleware ever drop cookiejars? I could be missing something, but it looks to me like they stay around forever. This would be a problem for my spiders because I want them to run "forever" on an unbounded list of URLs.
A SessionSpider callback that wants to create requests with the same session will generate requests using a session.Request factory method that returns a scrapy.Request. This method will take care of merging session variables with the new request.
I'm hoping to implement most of the features I want by having the Session object do the meta manipulation behind the scenes so that SessionSpider subclasses don't have to touch meta as much. However, I will also have to modify/add middleware, since I want to change how cookiejars are passed around.

I thought I would post this here just to see what thoughts people have. Is this is a bad idea? Has it been tried before? Any issues I might run into? I see that this kind of thing has been discussed before: #1878

The text was updated successfully, but these errors were encountered:

IAlwaysBeCoding · 2018-05-11T20:28:25Z

This is a great idea, although adding another argument to the parse callback signature should not be done. That's because you will have to actually edit the inner workings of Scrapy to allow what you are suggesting. It's not as simple as creating a new Spider class

I've thought about building a set of middlewares to do what you are trying to do. You must use meta to implement this and I don't think there are any other ways to do it. In fact, meta is so important to Scrapy that a lot of the default middlewares touch the Request/Response meta to implement their logic.

I think the best approach would be to make a SessionSpider with a few extra helping methods that can create Session instances that you can later pass on to simple Request instances.

Something like calling self.create_new_session() inside the spider where you can create Session instances on the spot.

lucywang000 · 2018-05-16T03:21:30Z

@dmsolow thanks for the clear description. But I'm still wondering what's the main advantage of this new SessionSpider and session concepts? In my understanding, a session is backed by a cookiejar.

a session.Request factory method that returns a scrapy.Request. This method will take care of merging session variables with the new request.

A session variable is just a cookiejar index, right? In our project we have a spider middleware that populates the request cookiejar index based on the response cookiejar index, which ensures the new request uses the same session as the response.

kmike · 2019-06-25T19:33:39Z

#3563 (comment) is an idea in a similar direction.

lycanthropes · 2019-07-08T12:35:33Z

How about the progress now ? I also meet the session problem.

Gallaecio · 2019-07-09T09:27:05Z

@lycanthropes There is currently no one working on this feature.

lycanthropes · 2019-07-10T12:49:58Z

There is just a official solution to this , I found the solution yesterday.

Gallaecio · 2019-07-10T12:53:45Z

@lycanthropes Which solution?

lycanthropes · 2019-07-10T13:10:17Z

http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#multiple-cookie-sessions-per-spider

Gallaecio · 2019-07-10T18:33:36Z

That is not exactly what the original suggestion is about. If you read the original suggestion above carefully, you’ll see it mentions that solution already (“Instead of passing a cookie jar ID”).

lycanthropes · 2019-07-10T22:54:27Z

Do you mean the classification of @lucywang000 ?

Gallaecio · 2019-07-11T04:15:12Z

I mean the issue description.

andrewbaxter · 2019-07-16T17:12:14Z

I'd like to add some notes from internal discussion with @raphapassini about sessions here too:

Need to make some "session setup" requests when the cookie jar is empty or a new ip is used to set up a session, get some standard cookies, etc
Requests might need to be sent 1. explicitly in the current session (if cookies are needed for the request to pass or the request is dependent on some other prior state set up in previous requests), 2. in any session, 3. or explicitly in a new session if it needs clear cookies, history, etc
A session may get into a bad state where requests on the session no longer work. Requests may need to be paused and retried after executing a "refresh" process, or else in a new session
Perhaps automatically refresh in a periodic schedule
Sessions may need to separate and persist cookies but also perhaps headers (to interface with external session APIs, ex: Crawlera session id)
Inspect session stats (how many sessions created/destroyed, requests per session, etc) to debug crawl issues

I don't know what I'm talking about particularly, but maybe a Scheduler could be a good place to start implementing this? I've worked on solutions that wrapped callbacks to juggle queues of requests per session, but there were significant difficulties due to callbacks never running (because of dupe filtering, unexpected errors, etc) and sessions getting into indeterminate state.

ThomasAitken · 2021-04-30T07:39:41Z

Thoughts:
https://github.com/ThomasAitken/scrapy-sessions
?

GeorgeA92 · 2021-05-01T00:23:04Z

@ThomasAitken
From https://github.com/ThomasAitken/scrapy-sessions readme:

Scrapy's sessions are a black box

It is not true. Basically CookiesMiddleware is a wrapper around dictionary with CookieJar objects from python builtin http module.

...They can't be exposed within a scrape and they can't be directly altered.
2. Scrapy makes it very difficult to easily replace a session (and/or general 'profile') unilaterally across all requests that are scheduled or enqueued. This is important for engaging with websites that have session-expiry logic.

It is possible to reach CookieMiddleware object with it's content directly from spider start_requests and parse methods (from crawler object as well as the most of ofther middlewares/moduels:

Unfortunately Crawler object doesn't have methods to get middleware object from it's name so it is possible with this... trick:

class Myspider(scrapy.Spider):
   def start_requests(self):
        downloader_middlewares = self.crawler.engine.downloader.middleware.middlewares
        self.CookieMiddleware= [middleware for middleware in downloader_middlewares if "CookieMiddleware" in str(type(middleware))][0]

With direct access to CookieJars from CookieMiddleware object and cookie_jar ID from response.meta - You already able to make any manipulations with sessions

Scrapy provides no native capability for maintaing distinct profiles (client identities) within a single scrape.

Unfortunately this is true. By default scrapy use single CookieJar object for all requests. Single user agent from settings.
+a lot of additional issues in case of multiple proxies used.
The most of publicly available proxy rotaion modules for scrapy don't create CookieJar per proxy - they are not session safe.

The idea of this tool is to manage distinct client identities within a scrape. The identity consists of two or more of the following attributes: session + user agent + proxy.

Some proxy providers already include session handling as service in addition to scraping proxies. In this case from that list - only proxy handling required from scrapy user.

For rest of cases. I agree that idea is actual.

from w3lib.http import basic_auth_header
PROFILES = [
    {"proxy":['proxy_url', basic_auth_header('username', 'password')], "user-agent": "MY USER AGENT"},
    {"proxy":['proxy_url', basic_auth_header('username', 'password')], "user-agent": "MY USER AGENT"}
]

In order to bind proxy address to cookiejar it is enough to use the same key value for proxy and cookiejar request meta keys (no extra middlewares required) as I did on this gist code sample

ThomasAitken · 2021-05-01T04:17:37Z

@GeorgeA92

Thanks for your feedback.

Yes, I understand that the Scrapy Cookiejar is a wrapper around the http Cookiejar, and yes you are right that you can use that trick to access the cookie middleware. But my API offers much nicer syntax, convenient methods for inspecting the specific cookies and functionality to 'refresh' sessions.

As for the 'profiles' aspect, yes you are right that you can bind proxy addresses to a cookiejar as you do in that code sample, but there are other benefits to the way I have set things up. It is just a convenient way of setting something up that is normally very difficult in Scrapy.

Gallaecio added the enhancement label Jul 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrapy "session" extension #3258

Scrapy "session" extension #3258

dmsolow commented May 11, 2018 •

edited

IAlwaysBeCoding commented May 11, 2018

lucywang000 commented May 16, 2018

kmike commented Jun 25, 2019

lycanthropes commented Jul 8, 2019

Gallaecio commented Jul 9, 2019

lycanthropes commented Jul 10, 2019

Gallaecio commented Jul 10, 2019

lycanthropes commented Jul 10, 2019

Gallaecio commented Jul 10, 2019

lycanthropes commented Jul 10, 2019

Gallaecio commented Jul 11, 2019

andrewbaxter commented Jul 16, 2019 •

edited

ThomasAitken commented Apr 30, 2021

GeorgeA92 commented May 1, 2021

ThomasAitken commented May 1, 2021 •

edited

Scrapy "session" extension #3258

Scrapy "session" extension #3258

Comments

dmsolow commented May 11, 2018 • edited

IAlwaysBeCoding commented May 11, 2018

lucywang000 commented May 16, 2018

kmike commented Jun 25, 2019

lycanthropes commented Jul 8, 2019

Gallaecio commented Jul 9, 2019

lycanthropes commented Jul 10, 2019

Gallaecio commented Jul 10, 2019

lycanthropes commented Jul 10, 2019

Gallaecio commented Jul 10, 2019

lycanthropes commented Jul 10, 2019

Gallaecio commented Jul 11, 2019

andrewbaxter commented Jul 16, 2019 • edited

ThomasAitken commented Apr 30, 2021

GeorgeA92 commented May 1, 2021

ThomasAitken commented May 1, 2021 • edited

dmsolow commented May 11, 2018 •

edited

andrewbaxter commented Jul 16, 2019 •

edited

ThomasAitken commented May 1, 2021 •

edited