Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose cookiejars #1878

Open
kmike opened this issue Mar 24, 2016 · 9 comments · May be fixed by #6218
Open

Expose cookiejars #1878

kmike opened this issue Mar 24, 2016 · 9 comments · May be fixed by #6218
Labels

Comments

@kmike
Copy link
Member

kmike commented Mar 24, 2016

Scrapy cookiejar API is limited:

I think we should provide a better API for 'sessions'. It should allow to

  1. access current session cookies;
  2. 'fork' a session - start separate sessions from the current session.

Currently I'm using an ugly hack to access cookies:

class ExposeCookiesMiddleware(CookiesMiddleware):
    """
    This middleware appends CookieJar with current cookies to response flags.

    To use it, disable default CookiesMiddleware and enable
    this middleware instead::

        DOWNLOADER_MIDDLEWARES = {
            'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None,
            'autologin.middleware.ExposeCookiesMiddleware': 700,
        }

    """
    def process_response(self, request, response, spider):
        response = super(ExposeCookiesMiddleware, self).process_response(
            request, response, spider)
        cookiejarkey = request.meta.get("cookiejar")
        response.flags.append(self.jars[cookiejarkey])
        return response


def get_cookiejar(response):
    for obj in response.flags:
        if isinstance(obj, CookieJar):
            return obj

I don't have a concrete API proposal, but likely it should use a word 'session' :)

@kmike kmike added the discuss label Mar 24, 2016
@pawelmhm
Copy link
Contributor

getting and setting cookies in Scrapy is really huge pain so big 👍 from me.

In project I'm working on now we use following solution that sets "jars" from Cookie Middleware on spider, and then allows you to use it.

class CustomCookiesMiddleware(cookies.CookiesMiddleware):
    @classmethod
    def from_crawler(cls, crawler):
        o = super(CustomCookiesMiddleware, cls).from_crawler(crawler)
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o

    def spider_opened(self, spider):
        self.enabled = getattr(spider, 'cookies_enabled', self.enabled)
        spider._cookiejars = self.jars
def BaseSpider(Spider):
    def get_cookie(self, name, cookiejar=None):
           if cookiejar not in self._cookiejars:
                raise KeyError(u'cookiejar {} does not exist'.format(cookiejar))
           _dict = {c.name: c.value for c in self._cookiejars[cookiejar]}
            return _dict.get(name)

but this is just for getting cookies, we dont have anything for setting cookies, we should definitely add something, last time I had to replace cookie value I had to write ugly code like this

locale_cookie = self._cookiejars[None]._cookies[".xbox.com"]["/"].get("defCulture")
locale_cookie.value = self.locale

@pawelmhm
Copy link
Contributor

pawelmhm commented Apr 1, 2016

One other difficulty that arises when you work with Scrapy cookies is that we use Cookie object from cookielib, and this is incompatible with Cookie object from Cookie module. So if you want to create and add cookie you CANNOT use nice and easy SimpleCookie object, you have to use Cookie object from.

Using SimpleCookie ends like this

# coding: utf-8
from cookielib import CookieJar # this is what we use in Scrapy
from Cookie import SimpleCookie

jar = CookieJar()
c = SimpleCookie()
c["name"] = "foo"
c["name"]["domain"] = ".github.com"
c["name"]["path"] = "/"
c.output() # 'Set-Cookie: name=foo; Domain=.github.com; Path=/'
jat.set_cookie(c) 

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-ad88adc0c10c> in <module>()
----> 1 jar.set_cookie(c)

/opt/python2.7/lib/python2.7/cookielib.pyc in set_cookie(self, cookie)
   1641         self._cookies_lock.acquire()
   1642         try:
-> 1643             if cookie.domain not in c: c[cookie.domain] = {}
   1644             c2 = c[cookie.domain]
   1645             if cookie.path not in c2: c2[cookie.path] = {}

AttributeError: 'SimpleCookie' object has no attribute 'domain'

This means that if you want to set some cookie on Scrapy cookiejar you have to use cookielib.Cookie and this object is definitely not made for humans, e.g. here's how you create Cookie from cookielib, every kwarg is required and init will fail if appropriate value is not provided. There are no defaults even though some values are clearly static and dont change much (e.g. comment_url=None)

from cookielib import Cookie # this is what we use in Scrapy
c = Cookie(version=0, name='name', value='value', port=None, port_specified=False,
                   domain='.github.com',
                   domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False,
                   expires=1511172829, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)

IMO it would be nice to be able to use SimpleCookie in Scrapy, it would simplify things.

@pawelmhm
Copy link
Contributor

pawelmhm commented Apr 20, 2016

how should this cookiejar api be designed @kmike ? Should it be part of Scrapy or perhaps some external library? I imagine that some external library could simply subclass cookie middleware and add some useful functions and utilities - e.g. for setting/getting cookies or maybe even persisting cookies across spider runs (something that is currently not supported but could be very useful). Reading about some bot detection systems, e.g. here they seem to appreciate clients that have long living cookies, so perhaps persisting some cookies could be useful in dealing with them.

One problem here is communicating between cookie middleware and spider. Cookiejars are stored as attribute of middleware, so if we want to expose cookiejars they would have to be attribute of spider probably. Are there any problems with linking middleware "jars" to spider, e.g. add spider opened to middleware and set middleware "jars" on spider instance, then add some methods for getting and setting cookies in middleware and make them available from spider.

@kmike
Copy link
Member Author

kmike commented Apr 20, 2016

Having cookie management builtin makes more sense to me. Of course, nothing prevents creating a separate library for that (well, maybe #1877 can be a problem), but I'd prefer having good cookie management in Scrapy itself. This is a basic task that everyone needs to solve.

In scrapy-splash I implemented another cookie middleware; it exposes current cookiejar as response.cookiejar and allows to 'fork' sessions by using new_session_id meta key. I'm not sure this is the best solution though.

Having cookiejars on spider makes sense; it also makes sense to store current cookiejar in spider state so that it is preserved when on-disk request queue is restarted (how is that handled now?).

Or maybe there is another clever API trick which can make working with cookies even more convenient, I dont' know :)

@gdomod
Copy link

gdomod commented May 19, 2016

Please can anybody help me:

i create a subclass of request to login, because i have more than one login to parse the page.
in the subclass i handle the cookie

def request(self, *args, **kwargs):
        cb= kwargs['callback']
        kwargs.pop('callback')
        return self.req(meta={'cookiejar': self.cookie, 'cb':cb}, callback=self.callback, *args, **kwargs)

    def callback(self,response):
        return response.meta['cb'](response)

my problem is the async, i need to wait for the zero.dologin()

self.zero = Request("login","pw")
yield self.zero.dologin()
yield self.zero.request(self.start_urls[0],callback=self.parse_forum)

i found in unit test https://github.com/scrapy/scrapy/blob/master/scrapy/utils/defer.py
chain or succees_process but i am too novice that i can use that for my problem.

@jmaynier
Copy link

Also, it would be great to be able to set settings like CONCURRENT_REQUESTS and DOWNLOAD_DELAY to be enforce per cookiejar.

@eliasdorneles
Copy link
Member

@gdomod we use the Github Issues to discuss development of Scrapy, please use the community channels like Stackoverflow or the mailing list to ask for help on how to use it.

@eliasdorneles
Copy link
Member

+1 to exposing cookiejars.

I'm needing it now for a new project, and intend to do the same as @pawelmhm mentioned (custom middleware adding a spider attribute referencing the jars object).

@kmike
Copy link
Member Author

kmike commented Jun 25, 2019

#3563 (comment) has yet another syntax proposal (haven't thought about it in depth though).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants