Expose cookiejars #1878

kmike · 2016-03-24T15:56:01Z

Scrapy cookiejar API is limited:

meta key is called cookiejar, but you can't put CookieJar object there, in fact it means cookiejar_id or session_id, not cookiejar; this is confusing. It should have been called session_id IMHO.
there is no way to get or set current cookies; it is a popular issue we don't have a solution for (see http://stackoverflow.com/questions/8708346/access-session-cookie-in-scrapy-spiders and Allow copying existing cookiejar for request.meta['cookiejar'] #1448).

I think we should provide a better API for 'sessions'. It should allow to

access current session cookies;
'fork' a session - start separate sessions from the current session.

Currently I'm using an ugly hack to access cookies:

class ExposeCookiesMiddleware(CookiesMiddleware):
    """
    This middleware appends CookieJar with current cookies to response flags.

    To use it, disable default CookiesMiddleware and enable
    this middleware instead::

        DOWNLOADER_MIDDLEWARES = {
            'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None,
            'autologin.middleware.ExposeCookiesMiddleware': 700,
        }

    """
    def process_response(self, request, response, spider):
        response = super(ExposeCookiesMiddleware, self).process_response(
            request, response, spider)
        cookiejarkey = request.meta.get("cookiejar")
        response.flags.append(self.jars[cookiejarkey])
        return response


def get_cookiejar(response):
    for obj in response.flags:
        if isinstance(obj, CookieJar):
            return obj

I don't have a concrete API proposal, but likely it should use a word 'session' :)

The text was updated successfully, but these errors were encountered:

pawelmhm · 2016-03-24T16:17:18Z

getting and setting cookies in Scrapy is really huge pain so big 👍 from me.

In project I'm working on now we use following solution that sets "jars" from Cookie Middleware on spider, and then allows you to use it.

class CustomCookiesMiddleware(cookies.CookiesMiddleware):
    @classmethod
    def from_crawler(cls, crawler):
        o = super(CustomCookiesMiddleware, cls).from_crawler(crawler)
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o

    def spider_opened(self, spider):
        self.enabled = getattr(spider, 'cookies_enabled', self.enabled)
        spider._cookiejars = self.jars

def BaseSpider(Spider):
    def get_cookie(self, name, cookiejar=None):
           if cookiejar not in self._cookiejars:
                raise KeyError(u'cookiejar {} does not exist'.format(cookiejar))
           _dict = {c.name: c.value for c in self._cookiejars[cookiejar]}
            return _dict.get(name)

but this is just for getting cookies, we dont have anything for setting cookies, we should definitely add something, last time I had to replace cookie value I had to write ugly code like this

locale_cookie = self._cookiejars[None]._cookies[".xbox.com"]["/"].get("defCulture")
locale_cookie.value = self.locale

pawelmhm · 2016-04-01T08:04:36Z

One other difficulty that arises when you work with Scrapy cookies is that we use Cookie object from cookielib, and this is incompatible with Cookie object from Cookie module. So if you want to create and add cookie you CANNOT use nice and easy SimpleCookie object, you have to use Cookie object from.

Using SimpleCookie ends like this

# coding: utf-8
from cookielib import CookieJar # this is what we use in Scrapy
from Cookie import SimpleCookie

jar = CookieJar()
c = SimpleCookie()
c["name"] = "foo"
c["name"]["domain"] = ".github.com"
c["name"]["path"] = "/"
c.output() # 'Set-Cookie: name=foo; Domain=.github.com; Path=/'
jat.set_cookie(c) 

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-ad88adc0c10c> in <module>()
----> 1 jar.set_cookie(c)

/opt/python2.7/lib/python2.7/cookielib.pyc in set_cookie(self, cookie)
   1641         self._cookies_lock.acquire()
   1642         try:
-> 1643             if cookie.domain not in c: c[cookie.domain] = {}
   1644             c2 = c[cookie.domain]
   1645             if cookie.path not in c2: c2[cookie.path] = {}

AttributeError: 'SimpleCookie' object has no attribute 'domain'

This means that if you want to set some cookie on Scrapy cookiejar you have to use cookielib.Cookie and this object is definitely not made for humans, e.g. here's how you create Cookie from cookielib, every kwarg is required and init will fail if appropriate value is not provided. There are no defaults even though some values are clearly static and dont change much (e.g. comment_url=None)

from cookielib import Cookie # this is what we use in Scrapy
c = Cookie(version=0, name='name', value='value', port=None, port_specified=False,
                   domain='.github.com',
                   domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False,
                   expires=1511172829, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)

IMO it would be nice to be able to use SimpleCookie in Scrapy, it would simplify things.

pawelmhm · 2016-04-20T09:33:46Z

how should this cookiejar api be designed @kmike ? Should it be part of Scrapy or perhaps some external library? I imagine that some external library could simply subclass cookie middleware and add some useful functions and utilities - e.g. for setting/getting cookies or maybe even persisting cookies across spider runs (something that is currently not supported but could be very useful). Reading about some bot detection systems, e.g. here they seem to appreciate clients that have long living cookies, so perhaps persisting some cookies could be useful in dealing with them.

One problem here is communicating between cookie middleware and spider. Cookiejars are stored as attribute of middleware, so if we want to expose cookiejars they would have to be attribute of spider probably. Are there any problems with linking middleware "jars" to spider, e.g. add spider opened to middleware and set middleware "jars" on spider instance, then add some methods for getting and setting cookies in middleware and make them available from spider.

kmike · 2016-04-20T11:59:28Z

Having cookie management builtin makes more sense to me. Of course, nothing prevents creating a separate library for that (well, maybe #1877 can be a problem), but I'd prefer having good cookie management in Scrapy itself. This is a basic task that everyone needs to solve.

In scrapy-splash I implemented another cookie middleware; it exposes current cookiejar as response.cookiejar and allows to 'fork' sessions by using new_session_id meta key. I'm not sure this is the best solution though.

Having cookiejars on spider makes sense; it also makes sense to store current cookiejar in spider state so that it is preserved when on-disk request queue is restarted (how is that handled now?).

Or maybe there is another clever API trick which can make working with cookies even more convenient, I dont' know :)

gdomod · 2016-05-19T13:06:13Z

Please can anybody help me:

i create a subclass of request to login, because i have more than one login to parse the page.
in the subclass i handle the cookie

def request(self, *args, **kwargs):
        cb= kwargs['callback']
        kwargs.pop('callback')
        return self.req(meta={'cookiejar': self.cookie, 'cb':cb}, callback=self.callback, *args, **kwargs)

    def callback(self,response):
        return response.meta['cb'](response)

my problem is the async, i need to wait for the zero.dologin()

self.zero = Request("login","pw")
yield self.zero.dologin()
yield self.zero.request(self.start_urls[0],callback=self.parse_forum)

i found in unit test https://github.com/scrapy/scrapy/blob/master/scrapy/utils/defer.py
chain or succees_process but i am too novice that i can use that for my problem.

jmaynier · 2016-09-21T11:47:17Z

Also, it would be great to be able to set settings like CONCURRENT_REQUESTS and DOWNLOAD_DELAY to be enforce per cookiejar.

eliasdorneles · 2016-12-12T14:50:34Z

@gdomod we use the Github Issues to discuss development of Scrapy, please use the community channels like Stackoverflow or the mailing list to ask for help on how to use it.

eliasdorneles · 2016-12-12T14:56:25Z

+1 to exposing cookiejars.

I'm needing it now for a new project, and intend to do the same as @pawelmhm mentioned (custom middleware adding a spider attribute referencing the jars object).

kmike · 2019-06-25T19:31:51Z

#3563 (comment) has yet another syntax proposal (haven't thought about it in depth though).

kmike added the discuss label Mar 24, 2016

kmike mentioned this issue May 16, 2016

New Instance of scrapy.Request #1987

Closed

kmike mentioned this issue Oct 31, 2017

WIP: CookiesMiddleware: add "reset_cookies" meta to clear the jar #2986

Open

dmsolow mentioned this issue May 11, 2018

Scrapy "session" extension #3258

Open

elacuesta mentioned this issue Jun 26, 2019

[MRG+1] Callback kwargs #3563

Merged

6 tasks

Gallaecio mentioned this issue Mar 1, 2022

Improve cookie handling #5431

Open

Gallaecio mentioned this issue May 20, 2022

Improve cookie handling #5463

Open

This was referenced Jan 19, 2024

Process cookies from header only once #4812

Draft

Cookies from the Cookie request header are not processed #1992

Open

GeorgeA92 linked a pull request Feb 9, 2024 that will close this issue

Cookiejars exposed #6218

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose cookiejars #1878

Expose cookiejars #1878

kmike commented Mar 24, 2016

pawelmhm commented Mar 24, 2016

pawelmhm commented Apr 1, 2016 •

edited by kmike

pawelmhm commented Apr 20, 2016 •

edited

kmike commented Apr 20, 2016

gdomod commented May 19, 2016 •

edited

jmaynier commented Sep 21, 2016

eliasdorneles commented Dec 12, 2016

eliasdorneles commented Dec 12, 2016

kmike commented Jun 25, 2019

Expose cookiejars #1878

Expose cookiejars #1878

Comments

kmike commented Mar 24, 2016

pawelmhm commented Mar 24, 2016

pawelmhm commented Apr 1, 2016 • edited by kmike

pawelmhm commented Apr 20, 2016 • edited

kmike commented Apr 20, 2016

gdomod commented May 19, 2016 • edited

jmaynier commented Sep 21, 2016

eliasdorneles commented Dec 12, 2016

eliasdorneles commented Dec 12, 2016

kmike commented Jun 25, 2019

pawelmhm commented Apr 1, 2016 •

edited by kmike

pawelmhm commented Apr 20, 2016 •

edited

gdomod commented May 19, 2016 •

edited