Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: I found a way to use a single playwright instance in a multithreaded context #2001

Open
Yomguithereal opened this issue Jul 6, 2023 · 2 comments

Comments

@Yomguithereal
Copy link

Your question

Hello playwright team,

I know the consensus about multithreading & playwright is that you should create one playwright instance per thread because playwright is not threadsafe, which is right. But one instance per thread seems quite costly and it is sad to lose the ability of the asyncio implementation to do multiple things at once on multiple tabs (it can do that, no?)

So I dug into the problem and I found the way to use an asyncio playwright from a multithreaded context safely. Which means you remain able to do multiple things using a single playwright instance concurrently, all while interacting with the browser from multiple threads as this is a legitimate use case for legacy reasons and other usability reasons. In my personal case I have a webmining project named minet in which I need to be able to combine multithreaded urllib3 or pycurl calls interwoven with some playwright tasks sometimes, for complex web crawling tasks and I orchestrate the threaded work using the quenouille library. In this context, mixing threads and asyncio for orchestration is a fully-fledged nightmare, so I wanted to find a way to pilot an asynchronous playwright instance from multiple threads.

The solution is therefore the following:

  1. You need to have some class that will spawn a thread in which in new asyncio loop will run
  2. Then you need to start the playwright instance in said thread and make the loop run forever
  3. Then you can send "jobs" using coroutine functions called through asyncio.run_coroutine_threadsafe

There is some threading glue code involved of course for synchronization but the rest is pretty straightforward.

Here is an example of such a class: https://github.com/medialab/minet/blob/master/minet/browser/threadsafe_browser.py
Here is an example of it being used: https://github.com/medialab/minet/blob/master/ftest/playwright-threading.py

But now I have some questions:

  • Is it really safe or is there some hidden footgun I have not yet seen?
  • Is this useful to anyone else? (I am not advocating for API additions nor code modification, here, but maybe documentation)
  • Is playwright actually able to run concurrent actions in multiple tabs (it certainly does look like it)
  • Is it actually worth it not to create multiple playwright instances because the underlying playwright controlling process might be some weird singleton I don't know about?

Some other related notes:

@ukenmisneru
Copy link

ukenmisneru commented Jul 14, 2023

Hi.Yomguithereal .
I got a similar question.
Is it possible to the pages with different contexts all open in one playwright instance (one browser window) ?
I need to post a lot of different queries for the same website and using different proxies to avoid been detected.
It will consumes a lot of system resources if different contexts opened in different browser windows.
Thanks

@dgtlmoon
Copy link

dgtlmoon commented Nov 8, 2023

Always consider that playwright is primarily a testing framework, not a web scraping framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants