Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session related suggestions #2

Open
jairoxyz opened this issue Jul 14, 2020 · 8 comments
Open

Session related suggestions #2

jairoxyz opened this issue Jul 14, 2020 · 8 comments
Labels
help wanted Extra attention is needed

Comments

@jairoxyz
Copy link
Contributor

jairoxyz commented Jul 14, 2020

Hi mate, I see you converted to TS. Looks really great now and works nicely for me, especially using custom session profile, which seems to help avoiding trigger of Captcha after a couple of retries of the request.get. I have a few suggestions:

  • when using session param without custom User-Agent, a random browser profile folder is created instead of the custom named folder. Is there a reason for this? Why not create the named folder always when the session param is used? I believe that way the session and cookies would be reused even after a server or docker app restart.

  • I also noticed that named profile folders get deleted after server restart, I guess due to the non persistent nature of the Docker image data storage. Maybe we could add a conf param for a custom temp folder so that a persistent volume could be mounted to it?

  • lastly, when using request.get with a session param and the session does not exist it returns a notice to create a session first. Why not call sesssion.create like you call it if no session param is provided?

Anyway, thanks for the great app!

Jx-

@NoahCardoza
Copy link
Owner

NoahCardoza commented Jul 15, 2020

  1. Well, the idea of the sessions is to speed up the requests after passing the challenge for the first time. When you want to reset your cookies you can just create a new session. I might be able to add something like "destroy": false" to the sessions.destroy command, but to be effective, we'd have to place the profiles somewhere other than a temp directory. Maybe add an ENV variable PATH_TO_PROFILES=<path-to-profiles> || os.tmpdir()?

    NOTE: The original author edited the profile to change the UA, but I'm not sure if that approach is working properly right now. I need to do some testing.

  2. If we implement a PATH_TO_PROFILES ENV variable this would be really easy just by editing the Dockerfile. We could add instructions in the README.md.

  3. I added this only to make it "stupid-proof." If people accidentally reference the wrong session then they get a message telling them to create one. This keeps people from misusing the sessions parameter and thinking something is broken when they have to keep waiting for 5 seconds to solving a captcha by accident. Plus, it allows you to get the heavy lifting of starting up the browser before you start sending requests so the request.get speeds will always be top-notch. However, if you think it would better to revert to the original functionality I'll consider it.

Any thoughts?

@jairoxyz
Copy link
Contributor Author

jairoxyz commented Jul 15, 2020

Hey again. Thanks for going through my suggestions.

  1. Yes, I did it that way in my dev version:
    dockerfile: ENV TMPDIR=/home/node/flaresolverr/tmp
    sessions.js: const tmpDir = process.env.TMPDIR || os.tmpdir()

This works fine and profiles are created in the new folder. I am just an absolute newbie to docker, so I still have to sort out how to mount a persistent volume. Also, if for ex. running on some free hoster space, most of the time volume mounting is not allowed, so not usable there.

  1. Yep

  2. For the named session creation in request.get I tried this:

const forceCreate = sessionId === params.session && !sessions.get(sessionId)
const browser = oneTimeSession || forceCreate ? await sessions.create(sessionId, params) : sessions.get(sessionId)

which works but caused some timeouts. Maybe its better to call sessions.create explicitly before request.get. It worked though in your original version, so I have to test that a bit more. Maybe a forceCreate param would be better?

Cheers,

Jx-

@jairoxyz
Copy link
Contributor Author

One more thing: I would like to add a headers parameter, so that we can pass these to https://github.com/puppeteer/puppeteer/blob/v3.3.0/docs/api.md#pagesetextrahttpheadersheaders . Some pages need correct headers to return proper responses.

@NoahCardoza
Copy link
Owner

In regards to explicit session creation, I think I'm going to keep it that way since it wouldn't work any other way with the enhancements I've added via #10.

Next on my list of things to do is a variable userDataDir parent directory.

NoahCardoza added a commit that referenced this issue Jul 19, 2020
@jairoxyz
Copy link
Contributor Author

Yeah I think this is great like it is now.

@NoahCardoza
Copy link
Owner

In regards to saving sessions, I've been playing around with setting a custom parent to the userDataDir (0f528b4). However, it seems that creating a session and then loading the session again doesn't preserve cookies like I'd have expected...

I used this to test it:

const P = require('puppeteer')

const showCookies = async page => {
  const pre = await page.$('pre')
  const cookies = await pre.evaluate(e => e.innerText)
  console.log(JSON.parse(cookies))
}

const setCookies = async path => {
  const browser = await P.launch({
    headless: true,
    userDataDir: path
  })
  const page = await browser.newPage()
  await page.goto('https://httpbin.org/cookies/set/test123/123')
  await showCookies(page)
  await browser.close()
}

const checkCookies = async path => {
  const browser = await P.launch({
    headless: true,
    userDataDir: path
  })
  const page = await browser.newPage()
  await page.goto('https://httpbin.org/cookies')
  await showCookies(page)
  await browser.close()
}

(async () => {
  const path = 'sessions/test'
  await setCookies(path)
  await checkCookies(path)
})().catch(console.error)

While I'd expect the output to be:

{ cookies: { test123: '123' } }
{ cookies: { test123: '123' } }

The output is actually:

{ cookies: { test123: '123' } }
{ cookies: {} }

I've done a little Googling and it doesn't seem like you can reload profiles? Maybe I'm doing something wrong? Any ideas?

@jairoxyz
Copy link
Contributor Author

jairoxyz commented Jul 23, 2020

Hmmm, I haven't done any concrete tests, I only notice that the CF cookies stored in the session/profile folder SQLite DB are being used, because wile the first solving can take several retries and quite a bit of time, subsequent calls to the site opens very fast, obviously without needing to solve again. So as long as the container is running all seems to be OK with the cookies but since many container hosts use volatile storage, once they go idle, the profile folder is lost. My workaround now is to store the cookies locally and send them back to the proxy if not expired yet. This seems to work more or less, though. I found one issue with the documentation, that says mandatory fields for cookies are name and value but I had to send domain, too.

I can play a bit with your code. Wasn't aware of that httpbin until now :)

PS: what I see already is that using above code, for reasons I don't understand, the Default profile folder is not created at all. This contains the cookies DB. But maybe that's a Windows issue?

PPS: using Windows temp dir as parent folder works then with Default and cookies being created. Problem is that the call to httpbin doesn't seem to store the cookies correctly, the DB file stays empty. If you call another page, such as github, cookies are stored correctly. Interestingly, in Fiddler you see the the requests from Puppeteer are working and cookies returned as expected:

GET https://httpbin.org/cookies/set/test123/123 HTTP/1.1
HTTP/1.1 302 FOUND
Date: Thu, 23 Jul 2020 11:38:40 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 223
Connection: keep-alive
Server: gunicorn/19.9.0
Location: /cookies
Set-Cookie: test123=123; Path=/

which redirects to

GET https://httpbin.org/cookies HTTP/1.1
HTTP/1.1 200 OK
Date: Thu, 23 Jul 2020 11:38:41 GMT
Content-Type: application/json
Content-Length: 44
Connection: keep-alive
Server: gunicorn/19.9.0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true

{
  "cookies": {
    "test123": "123"
  }
}

PPPS: I fiddled the proxy while calling a page after the challenge is solved and it definitely sends back the cookies stored in the session.

Cheers,

Jx-

@NoahCardoza
Copy link
Owner

Very interesting. Sorry it took forever to get back to you. Are you in the Discord server? https://discord.gg/pVT8rNy

I'll continue to look into this but I've been busier lately.

@NoahCardoza NoahCardoza added the help wanted Extra attention is needed label Aug 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants