Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] asyncio.exceptions.InvalidStateError: invalid state thrown by exit in async context manager #2238

Open
pjsg opened this issue Jan 8, 2024 · 7 comments

Comments

@pjsg
Copy link

pjsg commented Jan 8, 2024

System info

  • Playwright Version: [v1.40]
  • Operating System: [ macOS 14.2.1]
  • Browser: Chromium
  • Other info:

Source code

from playwright.async_api import async_playwright
import asyncio

async def doit(url):
    print(f"Processing {url}")
    try:
        async with async_playwright() as p:

                browser_type = p.chromium

                browser = await browser_type.launch(
                    headless=True,
                )

                page = await browser.new_page(
                    bypass_csp=True,
                    ignore_https_errors=True,
                )

                res = await page.goto(url, wait_until="load", timeout=30 * 1000)

                await page.wait_for_load_state(state="networkidle")
                await browser.close()

    except Exception as e:
        print(f"Got exception {e}")
        raise e

asyncio.run(doit("https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html"))

Steps

  • Save the code above and run it. I'm using python 3.10.7

Expected

It should complete without error.

Actual

  • It throws an InvalidStateError -- if it works, just run it a couple more times. It nearly always fails for me.
Processing https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html
Got exception invalid state
Traceback (most recent call last):
  File "/Users/philip/play-dir/playtest.py", line 22, in doit
    await page.wait_for_load_state(state="networkidle")
  File "/Users/philip/.pyenv/versions/play-dir/lib/python3.10/site-packages/playwright/async_api/_generated.py", line 9367, in wait_for_load_state
    await self._impl_obj.wait_for_load_state(state=state, timeout=timeout)
  File "/Users/philip/.pyenv/versions/play-dir/lib/python3.10/site-packages/playwright/_impl/_page.py", line 491, in wait_for_load_state
    return await self._main_frame.wait_for_load_state(**locals_to_params(locals()))
  File "/Users/philip/.pyenv/versions/play-dir/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 237, in wait_for_load_state
    return await self._wait_for_load_state_impl(state, timeout)
  File "/Users/philip/.pyenv/versions/play-dir/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 265, in _wait_for_load_state_impl
    await waiter.result()
playwright._impl._errors.TimeoutError: Timeout 30000ms exceeded.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/philip/play-dir/playtest.py", line 29, in <module>
    asyncio.run(doit("https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html"))
  File "/Users/philip/.pyenv/versions/3.10.7/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/Users/philip/.pyenv/versions/3.10.7/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()
  File "/Users/philip/play-dir/playtest.py", line 27, in doit
    raise e
  File "/Users/philip/play-dir/playtest.py", line 7, in doit
    async with async_playwright() as p:
  File "/Users/philip/.pyenv/versions/play-dir/lib/python3.10/site-packages/playwright/async_api/_context_manager.py", line 58, in __aexit__ 
    await self._connection.stop_async()
  File "/Users/philip/.pyenv/versions/play-dir/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 288, in stop_async
    self.cleanup()
  File "/Users/philip/.pyenv/versions/play-dir/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 299, in cleanup
    callback.future.set_exception(self._closed_error)
asyncio.exceptions.InvalidStateError: invalid state
@dgozman
Copy link
Contributor

dgozman commented Jan 12, 2024

I was able to repro in 1 out of 5 runs. However, I was not able to repro with the following snippet. Not yet sure what's going on.

from playwright.async_api import async_playwright
import asyncio

async def doit(url):
    print(f"Processing {url}")

    async with async_playwright() as p:
        browser_type = p.chromium
        browser = await browser_type.launch(
            headless=True,
        )

        try:
            page = await browser.new_page(
                bypass_csp=True,
                ignore_https_errors=True,
            )
            res = await page.goto(url, wait_until="load", timeout=30 * 1000)
            await page.wait_for_load_state(state="networkidle")
        except Exception as e:
            print(f"Got exception {e}")
            raise e
        finally:
            await browser.close()

asyncio.run(doit("https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html"))

@dgozman dgozman transferred this issue from microsoft/playwright Jan 12, 2024
@pjsg
Copy link
Author

pjsg commented Jan 12, 2024

It appears that the browser.close() is the key difference. In @dgozman example, this is executed, whereas in my example it is not executed (as the exception is already thrown). Having said that, if you don't do the close() then it throws a different exception on other urls: https://cnn.com/

@mxschmitt
Copy link
Member

I'm unfortunately not able to reproduce it. I tried to repro running 10 times on macOS with Python 3.10 and Python 3.12.

@mxschmitt
Copy link
Member

Closing for now since we can't reproduce it.

@danphenderson
Copy link
Contributor

danphenderson commented Feb 26, 2024

I don't think this should be closed. I can reproduce the error. Whenever there is a timeout error it appears that the event loop is closing, resulting in an Invalid state.

In [3]: from playwright.async_api import async_playwright
   ...: import asyncio
   ...:
   ...: async def doit(url):
   ...:     print(f"Processing {url}")
   ...:     try:
   ...:         async with async_playwright() as p:
   ...:
   ...:                 browser_type = p.chromium
   ...:
   ...:                 browser = await browser_type.launch(
   ...:                     headless=True,
   ...:                 )
   ...:
   ...:                 page = await browser.new_page(
   ...:                     bypass_csp=True,
   ...:                     ignore_https_errors=True,
   ...:                 )
   ...:
   ...:                 res = await page.goto(url, wait_until="load", timeout=30 * 1000)
   ...:
   ...:                 await page.wait_for_load_state(state="networkidle")
   ...:                 await browser.close()
   ...:
   ...:     except Exception as e:
   ...:         print(f"Got exception {e}")
   ...:         raise e
   ...:
   ...: asyncio.run(doit("https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html"))
Processing https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html
Got exception Timeout 30000ms exceeded.
---------------------------------------------------------------------------
TimeoutError                              Traceback (most recent call last)
Cell In[3], line 29
     26         print(f"Got exception {e}")
     27         raise e
---> 29 asyncio.run(doit("https://www.streetinsider.com/Press+Releases/Radius+Recycling+Reports+First+Quarter+Fiscal+2024+Financial+Results/22593061.html"))

File ~/.pyenv/versions/3.10.6/lib/python3.10/asyncio/runners.py:44, in run(main, debug)
     42     if debug is not None:
     43         loop.set_debug(debug)
---> 44     return loop.run_until_complete(main)
     45 finally:
     46     try:

File ~/.pyenv/versions/3.10.6/lib/python3.10/asyncio/base_events.py:646, in BaseEventLoop.run_until_complete(self, future)
    643 if not future.done():
    644     raise RuntimeError('Event loop stopped before Future completed.')
--> 646 return future.result()

Cell In[3], line 27, in doit(url)
     25 except Exception as e:
     26     print(f"Got exception {e}")
---> 27     raise e

Cell In[3], line 20, in doit(url)
     11 browser = await browser_type.launch(
     12     headless=True,
     13 )
     15 page = await browser.new_page(
     16     bypass_csp=True,
     17     ignore_https_errors=True,
     18 )
---> 20 res = await page.goto(url, wait_until="load", timeout=30 * 1000)
     22 await page.wait_for_load_state(state="networkidle")
     23 await browser.close()

File ~/Desktop/open-source/playwright-python/playwright/async_api/_generated.py:8612, in Page.goto(self, url, timeout, wait_until, referer)
   8551 async def goto(
   8552     self,
   8553     url: str,
   (...)
   8559     referer: typing.Optional[str] = None
   8560 ) -> typing.Optional["Response"]:
   8561     """Page.goto
   8562
   8563     Returns the main resource response. In case of multiple redirects, the navigation will resolve with the first
   (...)
   8608     Union[Response, None]
   8609     """
   8611     return mapping.from_impl_nullable(
-> 8612         await self._impl_obj.goto(
   8613             url=url, timeout=timeout, waitUntil=wait_until, referer=referer
   8614         )
   8615     )

File ~/Desktop/open-source/playwright-python/playwright/_impl/_page.py:500, in Page.goto(self, url, timeout, waitUntil, referer)
    493 async def goto(
    494     self,
    495     url: str,
   (...)
    498     referer: str = None,
    499 ) -> Optional[Response]:
--> 500     return await self._main_frame.goto(**locals_to_params(locals()))

File ~/Desktop/open-source/playwright-python/playwright/_impl/_frame.py:145, in Frame.goto(self, url, timeout, waitUntil, referer)
    135 async def goto(
    136     self,
    137     url: str,
   (...)
    140     referer: str = None,
    141 ) -> Optional[Response]:
    142     return cast(
    143         Optional[Response],
    144         from_nullable_channel(
--> 145             await self._channel.send("goto", locals_to_params(locals()))
    146         ),
    147     )

File ~/Desktop/open-source/playwright-python/playwright/_impl/_connection.py:59, in Channel.send(self, method, params)
     58 async def send(self, method: str, params: Dict = None) -> Any:
---> 59     return await self._connection.wrap_api_call(
     60         lambda: self.inner_send(method, params, False)
     61     )

File ~/Desktop/open-source/playwright-python/playwright/_impl/_connection.py:509, in Connection.wrap_api_call(self, cb, is_internal)
    507 self._api_zone.set(_extract_stack_trace_information_from_stack(st, is_internal))
    508 try:
--> 509     return await cb()
    510 finally:
    511     self._api_zone.set(None)

File ~/Desktop/open-source/playwright-python/playwright/_impl/_connection.py:97, in Channel.inner_send(self, method, params, return_as_dict)
     95 if not callback.future.done():
     96     callback.future.cancel()
---> 97 result = next(iter(done)).result()
     98 # Protocol now has named return values, assume result is one level deeper unless
     99 # there is explicit ambiguity.
    100 if not result:

TimeoutError: Timeout 30000ms exceeded.

@yijiyap
Copy link

yijiyap commented Apr 4, 2024

I am facing a similar problem with my scraper as well. The entire code base is really large so I can't post it here.
The scraper is supposed to scrape about 1400+ pages, and each page has a timeout of about 10 seconds. The process should take about 12+ hours without any errors.

Where this error happens isn't exactly consistent, but it seems to occur somewhere after about 3 hours of scraping, at around 350 links. It only throws the error when I stop the python programme, and does not stop the python file automatically like an error.

Some measures taken to workaround:

  • created a csv to mark the exact link that was scraped until before the error occured. So that when I scrape again, it will resume from where was left off;
  • automatically restart the scraper after 2 hours before it hits the error message.

Edit: Happens on Python 3.10 on MacOS and Python 3.11 on Windows.

@haf
Copy link

haf commented Apr 22, 2024

Another stacktrace:

 .venv/lib/python3.11/site-packages/playwright/_impl/_connection.py:296, in Connection.cleanup(self, cause)
     294     ws_connection._transport.dispose()
     295 for callback in self._callbacks.values():
 --> 296     callback.future.set_exception(self._closed_error)
     297 self._callbacks.clear()
     298 self.emit("close")

With anyio:

async with (
    async_playwright() as p,
    create_task_group() as tg
):
    browser = await p.chromium.launch()
    list_spider = await SpiderAPI[ListingLink, ListPageLink].create(browser)
    tg.start_soon(list_spider.run, spider_list(config)) # curried
    await sleep(5)
    tg.cancel_scope.cancel()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants