Process cookies from header only once #4812

elacuesta · 2020-09-29T17:28:17Z

🚧 Work in progress 🚧

A first approach to solving #4717. The idea is to process cookies from the header only once, the actual method of solving it can vary (meta key, request attribute, etc). I know it's not much, but I'd like your opinions @Gallaecio @kmike @wRAR.

The following spider works with this patch, while it breaks with the current master (5a38639) - I'm working on some tests to reproduce this behaviour.

import scrapy

class LoginSpider(scrapy.Spider):
    name = "login"
    start_urls = ["http://quotes.toscrape.com/login"]
    custom_settings = {"COOKIES_DEBUG": True}

    def parse(self, response):
        csrf_token = response.xpath("//input[@name='csrf_token']/@value").get()
        return scrapy.FormRequest.from_response(
            response=response,
            formxpath="//form",
            formdata={
                "csrf_token": csrf_token,
                "username": "admin",
                "password": "admin"
            },
            callback=self.after_login,
        )

    def after_login(self, response):
        logout = response.xpath("//a[text()='Logout']/text()").get()
        assert logout is not None

Fixes #1992

Gallaecio · 2020-09-30T10:04:56Z

What if a retry (instead of a redirect) needs to happen, because the server sent e.g. a 503 response. Wouldn’t the retry miss some required cookies?

codecov · 2020-09-30T12:07:50Z

Codecov Report

Merging #4812 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4812      +/-   ##
==========================================
+ Coverage   87.52%   87.54%   +0.02%     
==========================================
  Files         160      160              
  Lines        9761     9762       +1     
  Branches     1440     1440              
==========================================
+ Hits         8543     8546       +3     
+ Misses        955      954       -1     
+ Partials      263      262       -1

Impacted Files	Coverage Δ
scrapy/downloadermiddlewares/cookies.py	`95.78% <100.00%> (+0.04%)`	⬆️
scrapy/core/downloader/__init__.py	`92.48% <0.00%> (+1.50%)`	⬆️

elacuesta · 2020-09-30T12:20:52Z

What if a retry (instead of a redirect) needs to happen, because the server sent e.g. a 503 response. Wouldn’t the retry miss some required cookies?

I don't think it would, once the header cookies are processed they are stored in the corresponding cookiejar and added to the appropriate requests. It would be an interesting test case though.

GeorgeA92 · 2023-09-02T08:04:54Z

What if a retry (instead of a redirect) needs to happen, because the server sent e.g. a 503 response. Wouldn’t the retry miss some required cookies?

I think that this and other similar cases (like previously mentioned 302s) required to be covered by tests first (maybe implemented in separate pull request).

It think it should be.. regular testcase with mockserver with defined endpoints that return:

302s with Set-Cookie header in response that trigger redirect middleware
503s with the same header that trigger retry middleware

GeorgeA92 · 2024-01-19T11:44:23Z

I am not able to reproduce this issue on recent master

script.py

from scrapy import FormRequest
import scrapy
from scrapy.crawler import CrawlerProcess

class LoginSpider(scrapy.Spider):
    name = 'login'
    # allowed_domains = ['quotes.toscrape.com']

    start_urls = ['http://quotes.toscrape.com/login']

    def parse(self, response):
        csrf_token = response.xpath('//input[@name="csrf_token"]/@value').get()
        return FormRequest.from_response(
            response,
            # formname='',
            # formcss='',
            formxpath='//form',
            formdata={
                'csrf_token': csrf_token,
                'username': 'admin',
                'password': 'admin'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        logout = response.xpath("//a[text()='Logout']/text()").get()
        self.logger.info(logout)



if __name__ == "__main__":
    process = CrawlerProcess();
    process.crawl(LoginSpider);
    process.start()

log_output

2024-01-19 12:38:02 [scrapy.utils.log] INFO: Scrapy 2.10.1 started (bot: scrapybot)
2024-01-19 12:38:02 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 1.1.1w  11 Sep 2023), cryptography 39.0.1, Platform Windows-10-10.0.22621-SP0
2024-01-19 12:38:02 [scrapy.addons] INFO: Enabled addons:
[]
2024-01-19 12:38:02 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor

2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2024-01-19 12:38:02 [scrapy.crawler] INFO: Overridden settings:
{}
2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-01-19 12:38:02 [scrapy.core.engine] INFO: Spider opened
2024-01-19 12:38:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-19 12:38:02 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-01-19 12:38:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/login> (referer: None)
2024-01-19 12:38:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://quotes.toscrape.com/> from <POST http://quotes.toscrape.com/login>
2024-01-19 12:38:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: http://quotes.toscrape.com/login)
2024-01-19 12:38:03 [login] INFO: Logout
2024-01-19 12:38:03 [scrapy.core.engine] INFO: Closing spider (finished)
2024-01-19 12:38:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1227,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 2,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 3638,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/302': 1,
 'elapsed_time_seconds': 0.815075,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 1, 19, 10, 38, 3, 562096, tzinfo=datetime.timezone.utc),
 'httpcompression/response_bytes': 13788,
 'httpcompression/response_count': 2,
 'log_count/DEBUG': 4,
 'log_count/INFO': 11,
 'log_count/WARNING': 1,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2024, 1, 19, 10, 38, 2, 747021, tzinfo=datetime.timezone.utc)}
2024-01-19 12:38:03 [scrapy.core.engine] INFO: Spider closed (finished)

From this test and on test proposed on [WIP] #6141 I don't see issue with 302's mentioned on #4717 (comment) . Proposed tests rely on possibility to look into CookieJars stored in CookiesMiddleware variable which is not trivial as we don't have implementation of #1878

GeorgeA92 · 2024-01-19T14:53:38Z

@Gallaecio
As far as I see initially it caused by this change in /pull/2400
And it got fixed later as result of e865c44#diff-dd2ebb59f519dd24355f1249d4f00a3ebc30f7dfaf39828a7bd461a144ca817cR58-R59 (related to GHSA-mfjm-vh54-3f96)

Gallaecio · 2024-01-19T15:24:31Z

I believe the code example in the PR description is not what needs to be fixed at the moment.

I believe this was the timeline:

Cookies from the Cookie request header are not processed #1992 was reported.
[MRG+1] CookiesMiddleware: keep cookies from 'Cookie' request header, fix encoding #2400 fixed it.
Request cookies not updated with values from previous response #4717 reported that [MRG+1] CookiesMiddleware: keep cookies from 'Cookie' request header, fix encoding #2400 introduced a regression, what the code in the description of this PR reproduces.
This PR was created to work on a proper fix.
Do not process cookies from headers #4823 reverted [MRG+1] CookiesMiddleware: keep cookies from 'Cookie' request header, fix encoding #2400 to remove the regression (and unfix Cookies from the Cookie request header are not processed #1992) until a proper fix could be done as part of this PR.

Due to #4823, the actual issue in the main branch now is not #4717, but #1992 (and fixing it in a way that does not cause #4717 again).

Process cookies from header only once

8bf9eba

Fix flag assignment

dd74f0c

elacuesta mentioned this pull request Oct 1, 2020

Do not process cookies from headers #4823

Merged

3 tasks

Gallaecio mentioned this pull request Mar 1, 2022

Improve cookie handling #5431

Open

GeorgeA92 mentioned this pull request Nov 10, 2023

Cookies test #6141

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process cookies from header only once #4812

Process cookies from header only once #4812

elacuesta commented Sep 29, 2020 •

edited by Gallaecio

Gallaecio commented Sep 30, 2020

codecov bot commented Sep 30, 2020 •

edited

elacuesta commented Sep 30, 2020

GeorgeA92 commented Sep 2, 2023

GeorgeA92 commented Jan 19, 2024

GeorgeA92 commented Jan 19, 2024

Gallaecio commented Jan 19, 2024 •

edited

Process cookies from header only once #4812

Are you sure you want to change the base?

Process cookies from header only once #4812

Conversation

elacuesta commented Sep 29, 2020 • edited by Gallaecio

Gallaecio commented Sep 30, 2020

codecov bot commented Sep 30, 2020 • edited

Codecov Report

elacuesta commented Sep 30, 2020

GeorgeA92 commented Sep 2, 2023

GeorgeA92 commented Jan 19, 2024

GeorgeA92 commented Jan 19, 2024

Gallaecio commented Jan 19, 2024 • edited

elacuesta commented Sep 29, 2020 •

edited by Gallaecio

codecov bot commented Sep 30, 2020 •

edited

Gallaecio commented Jan 19, 2024 •

edited