Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process cookies from header only once #4812

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Sep 29, 2020

馃毀 Work in progress 馃毀

A first approach to solving #4717. The idea is to process cookies from the header only once, the actual method of solving it can vary (meta key, request attribute, etc). I know it's not much, but I'd like your opinions @Gallaecio @kmike @wRAR.

The following spider works with this patch, while it breaks with the current master (5a38639) - I'm working on some tests to reproduce this behaviour.

import scrapy

class LoginSpider(scrapy.Spider):
    name = "login"
    start_urls = ["http://quotes.toscrape.com/login"]
    custom_settings = {"COOKIES_DEBUG": True}

    def parse(self, response):
        csrf_token = response.xpath("//input[@name='csrf_token']/@value").get()
        return scrapy.FormRequest.from_response(
            response=response,
            formxpath="//form",
            formdata={
                "csrf_token": csrf_token,
                "username": "admin",
                "password": "admin"
            },
            callback=self.after_login,
        )

    def after_login(self, response):
        logout = response.xpath("//a[text()='Logout']/text()").get()
        assert logout is not None

Fixes #1992

@Gallaecio
Copy link
Member

What if a retry (instead of a redirect) needs to happen, because the server sent e.g. a 503 response. Wouldn鈥檛 the retry miss some required cookies?

@codecov
Copy link

codecov bot commented Sep 30, 2020

Codecov Report

Merging #4812 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4812      +/-   ##
==========================================
+ Coverage   87.52%   87.54%   +0.02%     
==========================================
  Files         160      160              
  Lines        9761     9762       +1     
  Branches     1440     1440              
==========================================
+ Hits         8543     8546       +3     
+ Misses        955      954       -1     
+ Partials      263      262       -1     
Impacted Files Coverage 螖
scrapy/downloadermiddlewares/cookies.py 95.78% <100.00%> (+0.04%) 猬嗭笍
scrapy/core/downloader/__init__.py 92.48% <0.00%> (+1.50%) 猬嗭笍

@elacuesta
Copy link
Member Author

What if a retry (instead of a redirect) needs to happen, because the server sent e.g. a 503 response. Wouldn鈥檛 the retry miss some required cookies?

I don't think it would, once the header cookies are processed they are stored in the corresponding cookiejar and added to the appropriate requests. It would be an interesting test case though.

@GeorgeA92
Copy link
Contributor

What if a retry (instead of a redirect) needs to happen, because the server sent e.g. a 503 response. Wouldn鈥檛 the retry miss some required cookies?

I think that this and other similar cases (like previously mentioned 302s) required to be covered by tests first (maybe implemented in separate pull request).

It think it should be.. regular testcase with mockserver with defined endpoints that return:

  1. 302s with Set-Cookie header in response that trigger redirect middleware
  2. 503s with the same header that trigger retry middleware

@GeorgeA92 GeorgeA92 mentioned this pull request Nov 10, 2023
@GeorgeA92
Copy link
Contributor

I am not able to reproduce this issue on recent master

script.py
from scrapy import FormRequest
import scrapy
from scrapy.crawler import CrawlerProcess

class LoginSpider(scrapy.Spider):
    name = 'login'
    # allowed_domains = ['quotes.toscrape.com']

    start_urls = ['http://quotes.toscrape.com/login']

    def parse(self, response):
        csrf_token = response.xpath('//input[@name="csrf_token"]/@value').get()
        return FormRequest.from_response(
            response,
            # formname='',
            # formcss='',
            formxpath='//form',
            formdata={
                'csrf_token': csrf_token,
                'username': 'admin',
                'password': 'admin'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        logout = response.xpath("//a[text()='Logout']/text()").get()
        self.logger.info(logout)



if __name__ == "__main__":
    process = CrawlerProcess();
    process.crawl(LoginSpider);
    process.start()
log_output
2024-01-19 12:38:02 [scrapy.utils.log] INFO: Scrapy 2.10.1 started (bot: scrapybot)
2024-01-19 12:38:02 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 1.1.1w  11 Sep 2023), cryptography 39.0.1, Platform Windows-10-10.0.22621-SP0
2024-01-19 12:38:02 [scrapy.addons] INFO: Enabled addons:
[]
2024-01-19 12:38:02 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor

2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2024-01-19 12:38:02 [scrapy.crawler] INFO: Overridden settings:
{}
2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-01-19 12:38:02 [scrapy.core.engine] INFO: Spider opened
2024-01-19 12:38:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-19 12:38:02 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-01-19 12:38:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/login> (referer: None)
2024-01-19 12:38:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://quotes.toscrape.com/> from <POST http://quotes.toscrape.com/login>
2024-01-19 12:38:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: http://quotes.toscrape.com/login)
2024-01-19 12:38:03 [login] INFO: Logout
2024-01-19 12:38:03 [scrapy.core.engine] INFO: Closing spider (finished)
2024-01-19 12:38:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1227,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 2,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 3638,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/302': 1,
 'elapsed_time_seconds': 0.815075,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 1, 19, 10, 38, 3, 562096, tzinfo=datetime.timezone.utc),
 'httpcompression/response_bytes': 13788,
 'httpcompression/response_count': 2,
 'log_count/DEBUG': 4,
 'log_count/INFO': 11,
 'log_count/WARNING': 1,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2024, 1, 19, 10, 38, 2, 747021, tzinfo=datetime.timezone.utc)}
2024-01-19 12:38:03 [scrapy.core.engine] INFO: Spider closed (finished)

From this test and on test proposed on [WIP] #6141 I don't see issue with 302's mentioned on #4717 (comment) . Proposed tests rely on possibility to look into CookieJars stored in CookiesMiddleware variable which is not trivial as we don't have implementation of #1878

@GeorgeA92
Copy link
Contributor

@Gallaecio
As far as I see initially it caused by this change in /pull/2400
And it got fixed later as result of e865c44#diff-dd2ebb59f519dd24355f1249d4f00a3ebc30f7dfaf39828a7bd461a144ca817cR58-R59 (related to GHSA-mfjm-vh54-3f96)

@Gallaecio
Copy link
Member

Gallaecio commented Jan 19, 2024

I believe the code example in the PR description is not what needs to be fixed at the moment.

I believe this was the timeline:

  1. Cookies from the Cookie request header are not processed聽#1992 was reported.
  2. [MRG+1] CookiesMiddleware: keep cookies from 'Cookie' request header, fix encoding聽#2400 fixed it.
  3. Request cookies not updated with values from previous response聽#4717 reported that [MRG+1] CookiesMiddleware: keep cookies from 'Cookie' request header, fix encoding聽#2400 introduced a regression, what the code in the description of this PR reproduces.
  4. This PR was created to work on a proper fix.
  5. Do not process cookies from headers聽#4823 reverted [MRG+1] CookiesMiddleware: keep cookies from 'Cookie' request header, fix encoding聽#2400 to remove the regression (and unfix Cookies from the Cookie request header are not processed聽#1992) until a proper fix could be done as part of this PR.

Due to #4823, the actual issue in the main branch now is not #4717, but #1992 (and fixing it in a way that does not cause #4717 again).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cookies from the Cookie request header are not processed
3 participants