Skip to content

Python tool that mimics cURL, but performs a login and handles any Cross-Site Request Forgery (CSRF) tokens. Useful for scraping HTML normally only accessible when logged in.

License

Notifications You must be signed in to change notification settings

JElchison/curl-auth-csrf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 

Repository files navigation

curl-auth-csrf

Python tool that mimics cURL, but performs a login and handles any Cross-Site Request Forgery (CSRF) tokens.

Useful for scraping HTML normally only accessible when logged in.

Features

  • Runs on any OS supported by Python
  • Runs in Python2 and Python3
  • Allows specifying arbitrary GET/POST data to be included with login form submission (i.e. username)
  • Reads password from stdin (to avoid the plain-text password showing up in shell history)
  • Parses login form and dynamically replicates all form inputs (including hidden ones such as csrfmiddlewaretoken)
  • Automatically populates HTTP Referer header consistent with expected login sequence
  • To support multiple login forms on the page, script allows specifying HTML id of form
  • To support multiple password fields within the same login form (though rare), script allows specifying HTML field name for password
  • Handles HTTPS and HTTP 302 redirects
  • Allows validating login success by testing resultant URL and/or content on resultant page
  • Uses Python Requests HTTP library for session (cookie) management during every script run
  • Allows an arbitrary number of pages to be fetched while logged in
  • Optionally performs logout (to avoid leaving a session open from the server's perspective)
  • Allows User-Agent string spoofing (chooses a "safe" default if not otherwise specified)
  • Defaults to output via stdout, but can alternatively output to file

Usage

usage: curl-auth-csrf.py [-h] [-a USER_AGENT_STR] -i LOGIN_URL [-f FORM_ID]
                         [-p PASSWORD_FIELD_NAME] [-d DATA] [-u SUCCESS_URL]
                         [-t SUCCESS_TEXT] [-j LOGOUT_URL] [-o FILE]
                         [--version]
                         url_after_login [url_after_login ...]

Python tool that mimics curl, but performs a login and handles any Cross-Site
Request Forgery (CSRF) tokens.  Useful for scraping HTML normally only
accessible when logged in.

positional arguments:
  url_after_login

optional arguments:
  -a USER_AGENT_STR, --user-agent-str USER_AGENT_STR
                        User-Agent string to use
  -i LOGIN_URL, --login-url LOGIN_URL
                        URL that contains the login form
  -f FORM_ID, --form-id FORM_ID
                        HTML id attribute of login form
  -p PASSWORD_FIELD_NAME, --password-field-name PASSWORD_FIELD_NAME
                        name of input field containing password
  -d DATA, --data DATA  adds the specified data to the form submission
                        (usually just the username)
  -u SUCCESS_URL, --success-url SUCCESS_URL
                        URL substring constituting successful login
  -t SUCCESS_TEXT, --success-text SUCCESS_TEXT
                        HTML snippet constituting successful login
  -j LOGOUT_URL, --logout-url LOGOUT_URL
                        URL to be visited to perform the logout
  -o FILE, --output FILE
                        write output to <file> instead of stdout
  --version             show program's version number and exit
  -h, --help            show this help message and exit

If actual password is not passed in via stdin, the user will be prompted.

Password Entry

The script expects the password to be passed in via stdin, to avoid the plain-text password showing up in shell history. A simple way to do this is as follows:

echo ThisIsMyPassword | ./curl-auth-csrf.py -i http://foobar.com/login -d username=bob http://foobar.com/secure_page

(Trailing newlines in the password are ignored.)

However, this defeats the purpose, as the password still shows up in the shell history. (Exception: In Bash, start the line with an initial space, which will prevent the line from showing up in the history. Refer to Bash documentation on HISTCONTROL and HISTIGNORE.)

A better way to handle this is with a CLI password management tool, such as pass. This is the recommended approach. For example, assuming that your password is managed by pass and already encrypted under the handle foobar.com:

pass foobar.com | ./curl-auth-csrf.py -i http://foobar.com/login -d username=bob http://foobar.com/secure_page

If nothing is passed in via stdin, then the user will be prompted for the password (interactively):

./curl-auth-csrf.py -i http://foobar.com/login -d username=bob http://foobar.com/secure_page
Password: 

Examples

If your username is bob@email.com for pbs.org, following is how you might normally scrape the zip code from your user profile:

curl -sL https://account.pbs.org/accounts/profile/ | grep Zip

However, since doing so requires being logged in, here's one way to do it using curl-auth-csrf:

pass pbs.org | ./curl-auth-csrf.py -i https://account.pbs.org/accounts/login/ -d email=bob@email.com -u https://account.pbs.org/accounts/profile/ -j https://account.pbs.org/accounts/logout/ https://account.pbs.org/accounts/profile/ | grep Zip

Notes:

  • The URL of the login page is https://account.pbs.org/accounts/login/
  • The HTML input field of the username is email
  • The URL we're taken to upon successful login is https://account.pbs.org/accounts/profile/
  • The URL of the logout page is https://account.pbs.org/accounts/logout/
  • The URL we want to scrape the zip code from is https://account.pbs.org/accounts/profile/
  • The information scraped is the only data written to stdout, so we can grep over it to pull what we're looking for

Another example, with a logout page and multiple pages fetched while logged in:

pass thefastpark.com | ./curl-auth-csrf.py -i https://www.thefastpark.com/ -d username=bob@email.com -u https://www.thefastpark.com/myrewards/history/ -j https://www.thefastpark.com/myrewards/logout/ https://www.thefastpark.com/myrewards/history/ https://www.thefastpark.com/myrewards/redeempoints/ | egrep -i '(Total Points|points available)'

Limitations

This script only handles standard logins involving a single form submission with a username, password, and hidden fields for CSRF. It will not handle the following scenarios:

  • Logins involving CAPTCHA
  • Logins involving re-authentications (i.e. multiple successive password prompts)
  • Logins involving two-factor authentication
  • Logins involving any client-side password transformations (i.e. passing a hashed password to the server)

If all you need is basic HTTP authentication, this script is overkill. cURL and Wget can do that out-of-box.

Gotchas

  • If you're redirecting (or piping) output to file (or another utility) and receiving an UnicodeEncodeError exception, try setting PYTHONIOENCODING=UTF-8 in your terminal. See this post.

Disclaimer

Please don't abuse this tool. Only use it with accounts that rightfully belong to you. If you use this tool with someone else's login, you are solely responsible and may face legal consequences.

This script isn't perfect. See the Limitations section above; also, there may be defects. Beware that some Internet services won't take kindly if you login incorrectly (i.e. not in a normal browser). Your using this tool means that you accept full responsibility for anything that might happen.

Debugging

If you're having trouble finding the right parameters, you can change the default debugging level from "WARNING" to "DEBUG" at the top of the Python script. See discussion at #2.

About

Python tool that mimics cURL, but performs a login and handles any Cross-Site Request Forgery (CSRF) tokens. Useful for scraping HTML normally only accessible when logged in.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages