Cookiejars exposed #6218

GeorgeA92 · 2024-02-09T12:16:39Z

Aimed to fix #1878
based on suggestion from #1878 (comment)

codecov · 2024-02-09T12:18:31Z

Codecov Report

Merging #6218 (b8f8960) into master (6f73dc0) will increase coverage by 0.18%.
Report is 88 commits behind head on master.
The diff coverage is 100.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6218      +/-   ##
==========================================
+ Coverage   88.48%   88.67%   +0.18%     
==========================================
  Files         160      161       +1     
  Lines       11607    11792     +185     
  Branches     1883     1912      +29     
==========================================
+ Hits        10271    10457     +186     
+ Misses       1009     1007       -2     
- Partials      327      328       +1

Files	Coverage Δ
scrapy/downloadermiddlewares/cookies.py	`96.33% <100.00%> (+0.13%)`	⬆️

... and 14 files with indirect coverage changes

wRAR · 2024-02-09T12:36:27Z

I haven't read the original ticket recently, but why is this feature optional?

scrapy/downloadermiddlewares/cookies.py

GeorgeA92 · 2024-02-23T10:28:20Z

This is how it works on current version of PR

script_sample.py

import scrapy
from scrapy.crawler import CrawlerProcess

class Quotes(scrapy.Spider):
    name = "quotes"; custom_settings = {"DOWNLOAD_DELAY": 1}

    def start_requests(self):
        yield scrapy.Request(url='https://quotes.toscrape.com/login', callback=self.login)

    def login(self, response):
        self.logger.info(self.cookie_jars[None]) # scrapy.http.cookies.CookieJar object
        self.logger.info(self.cookie_jars[None].jar) # http.cookiejar object

        locale_cookie = self.cookie_jars[None]._cookies["quotes.toscrape.com"]["/"].get("session")
        locale_cookie.value = locale_cookie.value.upper()
        self.logger.info(self.cookie_jars[None].jar)

if __name__ == "__main__":
    p = CrawlerProcess(); p.crawl(Quotes); p.start()

log_output (fragment)

2024-02-23 10:51:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/login> (referer: None)
2024-02-23 10:51:27 [quotes] INFO: <scrapy.http.cookies.CookieJar object at 0x00000217DB719B40>
2024-02-23 10:51:27 [quotes] INFO: <CookieJar[<Cookie session=eyJjc3JmX3Rva2VuIjoiSnFQQU9GTGt1amZzZ3J3UVdHeGV6WFR2UnBpY0Job1NWS3liWmxhblVISXROREVtQ2RZTSJ9.Zdhqng.8uQzjuvDfOcNJHV7luY5Na6C1N0 for quotes.toscrape.com/>]>
2024-02-23 10:51:27 [quotes] INFO: <CookieJar[<Cookie session=EYJJC3JMX3RVA2VUIJOISNFQQU9GTGT1AMZZZ3J3UVDHEGV6WFR2UNBPY0JOB1NWS3LIWMXHBLVISXROREVTQ2RZTSJ9.ZDHQNG.8UQZJUVDFOCNJHV7LUY5NA6C1N0 for quotes.toscrape.com/>]>
2024-02-23 10:51:27 [scrapy.core.engine] INFO: Closing spider (finished)

Gallaecio · 2024-02-23T17:17:01Z

I‘m slightly hesitant about setting a spider attribute from a middleware, and I wonder if maybe it should be set from a different place or in a different place (e.g. the crawler), bun in general I’m find with the approach.

@kmike Any thoughts on the general approach? Should @GeorgeA92 go on with tests and docs?

kmike · 2024-03-17T17:22:24Z

Hey! My main worry is the obscure API, which we'd need to document & support in the future. It'd require good documentation to explain a line like

self.cookie_jars[None]._cookies["quotes.toscrape.com"]["/"].get("session")

It also need an access to a private property (._cookies).

GeorgeA92 · 2024-04-12T16:36:21Z

Hey! My main worry is the obscure API, which we'd need to document & support in the future. It'd require good documentation to explain a line like

self.cookie_jars[None]._cookies["quotes.toscrape.com"]["/"].get("session")

It also need an access to a private property (._cookies).

Another option is to update scrapy.http.Cookies.CookieJar class to add.. more convenient way to interact with Cookiejar

scrapy/scrapy/http/cookies.py

Lines 18 to 30 in 1d11ea3

    
           class CookieJar: 
        
               def __init__(self, policy=None, check_expired_frequency=10000): 
        
                   self.policy = policy or DefaultCookiePolicy() 
        
                   self.jar = _CookieJar(self.policy) 
        
                   self.jar._cookies_lock = _DummyLock() 
        
                   self.check_expired_frequency = check_expired_frequency 
        
                   self.processed = 0 
        
               def extract_cookies(self, response, request): 
        
                   wreq = WrappedRequest(request) 
        
                   wrsp = WrappedResponse(response) 
        
                   return self.jar.extract_cookies(wrsp, wreq)

cookiejars exposed

fe2979e

Gallaecio reviewed Feb 9, 2024

View reviewed changes

scrapy/downloadermiddlewares/cookies.py Outdated Show resolved Hide resolved

cookiejars exposed

b8f8960

Gallaecio requested a review from kmike February 23, 2024 17:14

Cj-Malone mentioned this pull request Mar 21, 2024

Add Anti-Bot Detection middleware alltheplaces/alltheplaces#7349

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cookiejars exposed #6218

Cookiejars exposed #6218

GeorgeA92 commented Feb 9, 2024

codecov bot commented Feb 9, 2024 •

edited

wRAR commented Feb 9, 2024

GeorgeA92 commented Feb 23, 2024 •

edited

Gallaecio commented Feb 23, 2024

kmike commented Mar 17, 2024

GeorgeA92 commented Apr 12, 2024

Cookiejars exposed #6218

Are you sure you want to change the base?

Cookiejars exposed #6218

Conversation

GeorgeA92 commented Feb 9, 2024

codecov bot commented Feb 9, 2024 • edited

Codecov Report

wRAR commented Feb 9, 2024

GeorgeA92 commented Feb 23, 2024 • edited

Gallaecio commented Feb 23, 2024

kmike commented Mar 17, 2024

GeorgeA92 commented Apr 12, 2024

codecov bot commented Feb 9, 2024 •

edited

GeorgeA92 commented Feb 23, 2024 •

edited