Skip to content

18520339/finding-similar-images

Repository files navigation

Finding similar images

Introduction

  • In my Data Science project, my team had to collect images through many kinds of Search Engines for creating dataset and we chose Google Sheets for assigning labeling tasks to each member because of its convenient.

  • There are lots of similar images when crawling from the Internet, this will result in biases in the dataset. Here is my solution to filter similar images for the Data Preparation step.

Implementation

  1. Get image urls from Search Engines. I have a repo for that here

  2. Copy + paste these urls to Google Sheets. Here, we can see how similar images arranged next to each other

  3. Connect to Google Sheets using Python

  4. If just using 1 hash value, some images will be said to be the same even if they are different. Therefore, we decided to caculate 3 hash values for each 2 images:

    • Average hashing (ahash)
    • Perceptual hashing (phash)
    • Difference hashing (dhash)

  1. If the distances of 2 in these 3 values tell 2 images are similar (≤ different points) then arrange these images next to each other

    distances = [ahash0 - ahash1, phash0 - phash1, dhash0 - dhash1]
    diff_results = sum(dist < args['diff'] for dist in distances)
    
    if diff_results >= 2:
        print(f'|--Similar with url {idx1 + 1}: {url1}')
  2. Decide what images to keep and begin labeling

Usage

  1. Install libraries: pip install -r requirements.txt

  2. Sort similar images in Google Sheets:

  • Example: python sort_similar.py -s "example" -w "Sheet1" -r "B2:C" -a credentials.json
usage: sort_similar.py [-h] -s SPREADSHEET -w WORKSHEET -r RANGE -a AUTH [-d DIFF]

optional arguments:
-h, --help                                    show this help message and exit
-s SPREADSHEET, --spreadsheet SPREADSHEET     spreadsheet name
-w WORKSHEET, --worksheet WORKSHEET           worksheet name
-r RANGE, --range RANGE                       updated range
-a AUTH, --auth AUTH                          credentials file
-d DIFF, --diff DIFF                          different points
  1. Download images from urls in Google Sheets:
  • Example: python download_images.py -s "example" -w "Sheet1" -r "B2:C" -a credentials.json -o images/
usage: download_images.py [-h] -s SPREADSHEET -w WORKSHEET -r RANGE -a AUTH -o OUT

optional arguments:
-h, --help                                    show this help message and exit
-s SPREADSHEET, --spreadsheet SPREADSHEET     spreadsheet name
-w WORKSHEET, --worksheet WORKSHEET           worksheet name
-r RANGE, --range RANGE                       updated range
-a AUTH, --auth AUTH                          credentials file
-o OUT, --out OUT                             path to images directory

Reference