Skip to content

Use AWS Lambda functions as a proxy pool to scrape web pages.

License

Notifications You must be signed in to change notification settings

teticio/lambda-scraper

Repository files navigation

Lambda Scraper

(See also lambda-selenium)

Use AWS Lambda functions as a HTTPS proxy. This is a cost effective way to have access to a large pool of IP addresses. Run the following to create as many Lambda functions as you need (one for each IP address). The number of functions as well as the region can be specified in variables.tf. Each Lambda function changes IP address after approximately 6 minutes of inactivity. For example, you could create 360 Lambda functions which you cycle through one per second, while making as many requests as possible via each corresponding IP address. Note that, in practice, AWS will sometimes assign the same IP address to more than one Lambda function.

I have re-written this using Node.js to take advantage of streaming Lambda function URLs, so that you can make (asynchronous) proxy requests by simply pre-pending the proxy URL. If you are looking for the original Python version, it is available in the old directory.

Pre-requisites

You will need to have installed Terraform and Docker.

Usage

git clone https://github.com/teticio/lambda-scraper.git
cd lambda-scraper
terraform init
terraform apply -auto-approve
# run "terraform apply -destroy -auto-approve" in the same directory to tear all this down again

You can specify the AWS region and profile as well as the number of proxies in a terraform.tfvars file:

num_proxies = 10
region      = "eu-west-2"
profile     = "default"

The proxy Lambda function forwards the requests to a random proxy-<i> Lambda function. To obtain its URL, run

echo $(terraform output -json | jq -r '.lambda_proxy_url.value')

Then you can make requests via the proxy by pre-pending the URL.

curl https://<hash>.lambda-url.<region>.on.aws/ipinfo.io/ip
# or
curl https://<hash>.lambda-url.<region>.on.aws/http://ipinfo.io/ip

If you make a number of cURL requests to this URL, you should see several different IP addresses. A script that does exactly this is provided in test.sh. You will notice that there is a cold start latency the first time each Lambda function is invoked.

Headers

Certain headers (host and those starting with x-amz or x-forwarded-) are stripped out because they interfere with the mechanism AWS uses to invoke the endpoint via HTTP. If you need these headers to be set in your request, you can do so by preceding them with lambda-scraper- (e.g. lambda-scraper-host: example.com). A special header lambda-scraper-raw-query-params is used to ensure the query parameters are passed straight through without being altered by encoding and decoding. Similarly, some response headers (those starting with x-amz) are mapped to lambda-scraper- so that they can be returned without affecting the response itself.

Authentication

Currently, the proxy Lambda function URL is configured to be publicly accessible, although the hash in the URL serves as a "key". The underlying proxy-<i> Lambda function URLs can only be accessed directly by signing the request with the appropriate AWS credentials. If you prefer to cycle through the underlying proxy URLs explicitly and avoid going through two Lambda functions per request, examples of how to sign the request are provided in proxy.js and test_with_iam.py. The list of underlying proxy URLs created by Terraform can be found in lambda/proxy-urls.json.

pip install -r requirements.txt
python test_with_iam.py

If you decide to also enforce IAM authentication for the proxy Lambda function URL, it is a simple matter of changing the authorization_type to AWS_IAM in main.tf.

Concurrency

The ability to call the Lambda functions asynchronously makes numerous parallel requests possible without resorting to multi-threading, while the proxy avoids being rate limited. In Python you can use the aiohttp library to make asynchronous HTTP requests as follows:

import asyncio

import aiohttp

# Replace with your proxy URL
PROXY = "https://<hash>.lambda-url.<region>.on.aws/"


async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()


async def fetch_all(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url.replace("https://", PROXY)) for url in urls]
        htmls = await asyncio.gather(*tasks)
    return htmls


urls = [
    "https://www.bbc.co.uk/news",
    "https://www.bbc.co.uk/news/uk",
]
print(asyncio.run(fetch_all(urls)))

"Serverless VPN" (well, almost)

It is possible to set up a proxy server that forwards all HTTP requests (but not websockets) to the Lambda proxy. To do this, first create a Certificate Authority with

openssl req -x509 -new -nodes -keyout testCA.key -sha256 -days 365 -out testCA.pem -subj '/CN=Mockttp Testing CA - DO NOT TRUST'

Then add and trust the testCA.pem certificate in a browser and set the proxy host to localhost and port to 8080. Add a .env file with the following contents:

PROXY_HOST=<hash>.lambda-url.<region>.on.aws

install the Node.JS packages

cd proxy_server
npm install
cd -

and run the server with

node proxy_server/app.js

You should now be able to navigate to a webpage with your browser and all the HTTP requests will be proxied via the Lambda function. Note that some sensitive endpoints may not work (for example if they use a pre-signed URL). You can toggle the proxy on and off by pressing X.