Skip to content

Real time google image scraper. The process is automated by sending HTTP requests to retrieve image data which is then parsed and saved.

Notifications You must be signed in to change notification settings

oxylabs/how-to-scrape-google-images

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

How To Scrape Google Images

Oxylabs promo code

Real time google image scraper. The process is automated by sending HTTP requests to retrieve image data which is then parsed and saved.

Scrape Google Images using Oxylabs’ Google Images Scraper API

For this tutorial, we will use Google Images Search API to get the Google images related to the one given in the query. This API helps us retrieve all the related images and the URLs (where these images are hosted).

To use this API, you must create an account on Oxylabs and get the API credentials. These credentials will be used in the later stages.

Step 1 - Setting up the environment

To get started, we must have Python 3.6+ installed and running on your system. Also, we need the following packages to put our code into action:

  • requests - for sending HTTP requests to Oxylabs API.

  • Pandas - for saving our output data in dataframes and saving in CSV files.

To install these packages, we can use the following command:

pip install requests pandas

Running this command will install all the required packages.

Step 2- Import the required libraries

After the installation of packages, start by creating a new Python file and import the required libraries using the following code:

import requests import pandas as pd

Step 3 - Structure the payload

The Oxylabs Image Scraper API has some parameters that can be set to structure the payload and make the request accordingly. The details of these parameters can be found in the official documentation by Oxylabs.

The payload is structured as follows:

payload = {
"source": "google_images",
   "domain": "com",
   "query": "<search_image_URL>",
   "context": [
       {
           "key": "search_operators",
           "value": [
               {"key": "site", "value": "example.com"},
               {"key": "filetype", "value": "html"},
               {"key": "inurl", "value": "image"},
           ],
       }
   ],
   "parse": "true",
   "geo_location": "United States"


}

Make sure to replace the query parameter value with the required search image URL.

The context parameter is used to apply some search filters. For example, our search operators force the API to scrape only the links from Google image search results that belong to example.com. If you remove this site key from the search_operators, the Image Scraper API may return related results from all the websites.

The search operators filetype: html and inurl:image define search criteria to only retrieve results with a file type of HTML and where "image" is included in the URL.

The parse parameter is set to true to get the results parsed in the JSON format. Additionally, you can add pages and start_page parameters to the payload to scrape multiple result pages starting from the start_page. A value of 1 is the default value for both the parameters.

Step 4 - Make the request

After creating the payload structure, you can initiate a POST request to Oxylabs’ API using the following code segment.

response = requests.request(
   "POST",
   "https://realtime.oxylabs.io/v1/queries",
   auth=(USERNAME, PASSWORD),
   json=payload,
)

Make sure to replace username and password with your API credentials. The response received can be viewed in the JSON format.

Step 5 - Extract the data and save it in a CSV file

We can extract the required images from the response object. The response object has a key results that contains all the related image data. We will extract and save all the image data in the data frame. Later, this dataframe can be saved in a CSV file using the following code.

image_results = result["results"]["organic"]

# Create a DataFrame
df = pd.DataFrame(columns=["Image Title", "Image Description", "Image URL"])

for i in image_results:
   title = i["title"]
   description = i["desc"]
   url = i["url"]

   df = pd.concat(
       [pd.DataFrame([[title, description, url]], columns=df.columns), df],
       ignore_index=True,
   )

# Copy the data to CSV and JSON files
df.to_csv("google_image_results.csv", index=False)
df.to_json("google_image_results.json", orient="split", index=False)

Now, let's take an example URL of a cat as the query image and put all the code together to make more cognitive sense. Assume that we want to scrape the first page from Google Images and want to restrict search to wikipedia.org only. Here is what the code looks like:

# Import Required libraries
import requests
import pandas as pd
from pprint import pprint

# Set your Oxylabs API credentials
USERNAME = "<your_username>"
PASSWORD = "<your_password>"

# Structure payload.
payload = {
   "source": "google_images",
   "domain": "com",
   "query": "https://upload.wikimedia.org/wikipedia/commons/a/a3/June_odd-eyed-cat.jpg",
   "context": [
       {
           "key": "search_operators",
           "value": [
               {"key": "site", "value": "wikipedia.org"},
               {"key": "filetype", "value": "html"},
               {"key": "inurl", "value": "image"},
           ],
       }
   ],
   "parse": "true",
   "geo_location": "United States"

}

# Get response.
response = requests.request(
   "POST",
   "https://realtime.oxylabs.io/v1/queries",
   auth=(USERNAME, PASSWORD),
   json=payload,
)

# Extract data from the response
result = response.json()["results"][0]["content"]
image_results = result["results"]["organic"]

# Create a DataFrame
df = pd.DataFrame(columns=["Image Title", "Image Description", "Image URL"])

for i in image_results:
   title = i["title"]
   description = i["desc"]
   url = i["url"]

   df = pd.concat(
       [pd.DataFrame([[title, description, url]], columns=df.columns), df],
       ignore_index=True,
   )

   # Print the data on the screen
   print("Image Name: " + title)
   print("Image Description: " + description)
   print("Image URL: " + url)

# Copy the data to CSV and JSON files
df.to_csv("google_image_results.csv", index=False)
df.to_json("google_image_results.json", orient="split", index=False)

Here is what our output looks like:

image

The complete API response for this API request can be found here.

Conclusion

Scraping Google Images without a dedicated tool is a complex task. As such, since Google Images as a repository offers a vast and diverse collection that's invaluable for various applications and analyses, implementing a solution like Oxylabs Google Images Scraper API can be key.

Looking to scrape data from other Google sources? See our in-depth guides for scraping Jobs, Search, Scholar, Trends, News, Flights, Shopping, and Maps.

Releases

No releases published

Packages

No packages published