Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

selenium script written in _init_ function not executing #120

Open
MADDY312 opened this issue Jul 1, 2022 · 1 comment
Open

selenium script written in _init_ function not executing #120

MADDY312 opened this issue Jul 1, 2022 · 1 comment

Comments

@MADDY312
Copy link

MADDY312 commented Jul 1, 2022

code :

import scrapy
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from shutil import which

class test_2(scrapy.Spider):
    name='test_2'
    #allowed_domains=[]
    start_urls=[
        'https://www.jackjones.in/st-search?q=shoes'
        ]
    
    def _init_(self):
        print("test-1")
        chrome_options=Options()
        chrome_options.add_argument("--headless")
        
        driver=webdriver.Chrome("C:/chromedriver")
        driver.set_window(1920,1080)
        driver.get("https://www.jackjones.in/st-search?q=shoes")
        tab=driver.find_elements_by_class_name("st-single-product")
        tab[4].click()

        self.html=driver.page_source
        print("test-2")
        driver.close()
    
    def parse(self, response):
        print("test-3")
        resp=Selector(text=self.html)
        yield{
            'title':resp.xpath("//h1/text()").get()
        }


It appears that compiler does not execute the init function before going to parse function, because neither of the print statements are getting executed but the print statement in parse function is present in the output
How to fix this?

Output:

PS C:\Users\Vasu\summer\scrapy_selenium> scrapy crawl test_2
2022-07-01 13:18:30 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapy_selenium)
2022-07-01 13:18:30 [scrapy.utils.log] INFO: Versions: lxml 4.9.0.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, 
w3lib 1.22.0, Twisted 22.4.0, Python 3.8.13 (default, Mar 28 2022, 06:59:08) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 
22.0.0 (OpenSSL 1.1.1p  21 Jun 2022), cryptography 37.0.1, Platform Windows-10-10.0.19044-SP0
2022-07-01 13:18:30 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_selenium',
 'NEWSPIDER_MODULE': 'scrapy_selenium.spiders',
 'SPIDER_MODULES': ['scrapy_selenium.spiders']}
2022-07-01 13:18:30 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-07-01 13:18:30 [scrapy.extensions.telnet] INFO: Telnet Password: 168b57499cd07735
2022-07-01 13:18:30 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-07-01 13:18:31 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-07-01 13:18:31 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-07-01 13:18:31 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-07-01 13:18:31 [scrapy.core.engine] INFO: Spider opened
2022-07-01 13:18:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-07-01 13:18:31 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-07-01 13:18:31 [filelock] DEBUG: Attempting to acquire lock 1385261511056 on C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-01 13:18:31 [filelock] DEBUG: Lock 1385261511056 acquired on C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock        
2022-07-01 13:18:32 [filelock] DEBUG: Attempting to release lock 1385261511056 on C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-01 13:18:32 [filelock] DEBUG: Lock 1385261511056 released on C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock        
2022-07-01 13:18:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jackjones.in/st-search?q=shoes> (referer: None)
test-3
2022-07-01 13:18:32 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.jackjones.in/st-search?q=shoes> (referer: None)
Traceback (most recent call last):
  File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\utils\defer.py", line 132, in iter_errback    
    yield next(it)
  File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\utils\python.py", line 354, in __next__       
    return next(self.data)
  File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\utils\python.py", line 354, in __next__       
    return next(self.data)
  File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\core\spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\core\spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 342, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\core\spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 40, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\core\spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\core\spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\Vasu\summer\scrapy_selenium\scrapy_selenium\spiders\test_2.py", line 31, in parse
    resp=Selector(text=self.html)
AttributeError: 'test_2' object has no attribute 'html'
2022-07-01 13:18:32 [scrapy.core.engine] INFO: Closing spider (finished)
2022-07-01 13:18:32 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 237,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 20430,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.613799,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 7, 1, 7, 48, 32, 202155),
 'httpcompression/response_bytes': 87151,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 6,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/AttributeError': 1,
 'start_time': datetime.datetime(2022, 7, 1, 7, 48, 31, 588356)}
2022-07-01 13:18:32 [scrapy.core.engine] INFO: Spider closed (finished)
@DResthal
Copy link

Hey bud, I know this is pretty old in the ways of development but I thought I'd answer your question.

To start things off, you have an issue with your init method using only a single underscore (init) instead of two (init).

Next, you're not utilizing Scrapy-Selenium here. You're directly calling Selenium manually without the help of this package. This package let's you setup and use Selenium through a DownloaderMiddleware , which is where we want this code to be, not directly in our Spider class.

Third, I'm fairly certain the maintainer of this project either isn't around or has dropped the project. I've forked it to bring it up to date and essentially become the new maintainer, as I found a use for this package myself.

p.s. Owner, if you're reading this, please let us know if this project is still being maintained or if you've moved on :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants