Skip to content

Commit

Permalink
Fix installation by replacing tensorflow with pytorch for CNN embeddi…
Browse files Browse the repository at this point in the history
…ngs (#175)

* Bump version to 0.2.4.

* Update builtin.h (#123)

* Update readme to reflect changes in TF 2.1.

* Bump version to 0.2.4. (#122)

* Update builtin.h

respelled keyword as taking to some malicious site.

Co-authored-by: Dat Tran <datitran@gmail.com>
Co-authored-by: Tanuj Jain <tanujjain@users.noreply.github.com>

* Add recursive option in encode_image() (#104)

* Update tests for new recursive option

* Add recursive option

to the following functions:
encode_image()
find_duplicates()
find_duplicates_to_remove()

Recusive is off by default.

* Add tests for recursive option

* Modify tests to ignore hidden files '.DS_Store' that are automatically created on mac by the Finder application. (#131)

* Port CNN to pytorch, other major changes (#173)

* Add feature generation with mobilenet v3.

* Integrate multi image encoding generation in CNN class.

* Update tests to adhere to new cnn embeddings as well as new hashes generated.

* Change antialias resampling to lanczos as antialias is deprecated and maps to lanczos in latest pillow versions. Also fix tests to adapt cnn scores, hashes.

* Clean up code to consolidate preprocessing in one place.

* Add tests for data generator.

* Update documentation to reflect the changes occurring due to the use of pytorch mobilenet v3 instead of tensorflow mobilenetv2.

* Update readme to remove tensorflow specifics.

* Update version for Pillow to be minimum 9.0 (released beginning of Jan 2022). Also update description to reflect python version support.

* Update requirements.txt and do some cleanup.

* Update travis and azure-pipelines os versions.

Co-authored-by: danidavid <33230485+danidavid@users.noreply.github.com>
Co-authored-by: Dat Tran <datitran@gmail.com>
Co-authored-by: Emilv2 <emil@vanherp.me>
  • Loading branch information
4 people committed Oct 14, 2022
1 parent 3465540 commit 28d4bd6
Show file tree
Hide file tree
Showing 22 changed files with 524 additions and 334 deletions.
6 changes: 5 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
language: python
os:
- linux
- osx
- windows
python:
- 3.6
- 3.10
install:
- pip install "cython>=0.29"
- pip install -e ".[tests, docs]"
Expand Down
8 changes: 2 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,12 +52,6 @@ There are two ways to install imagededup:
pip install imagededup
```

> ⚠️ **Note**: The TensorFlow >=2.1 and TensorFlow 1.15 release now include GPU support by default.
> Before that CPU and GPU packages are separate. If you have GPUs, you should rather
> install the TensorFlow version with GPU support especially when you use CNN to find duplicates.
> It's way faster. See the [TensorFlow guide](https://www.tensorflow.org/install/gpu) for more
> details on how to install it for older versions of TensorFlow.
* Install imagededup from the GitHub source:

```bash
Expand Down Expand Up @@ -128,6 +122,8 @@ repository.
For more detailed usage of the package functionality, refer: [https://idealo.github.io/imagededup/](https://idealo.github.io/imagededup/)

## ⏳ Benchmarks
**Update**: Provided benchmarks are only valid upto `imagededup v0.2.2`. The next releases have significant changes to all methods, so the current benchmarks may not hold.

Detailed benchmarks on speed and classification metrics for different methods have been provided in the [documentation](https://idealo.github.io/imagededup/user_guide/benchmarks/).
Generally speaking, following conclusions can be made:

Expand Down
42 changes: 21 additions & 21 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,33 +16,33 @@ jobs:
- job: 'Test'
strategy:
matrix:
Python36Linux:
imageName: 'ubuntu-16.04'
python.version: '3.6'
Python36Windows:
imageName: 'vs2017-win2016'
python.version: '3.6'
Python36Mac:
imageName: 'macos-10.14'
python.version: '3.6'
Python37Linux:
imageName: 'ubuntu-16.04'
python.version: '3.7'
Python37Windows:
imageName: 'vs2017-win2016'
python.version: '3.7'
Python37Mac:
imageName: 'macos-10.14'
python.version: '3.7'
Python38Linux:
imageName: 'ubuntu-16.04'
imageName: 'ubuntu-latest'
python.version: '3.8'
Python38Windows:
imageName: 'vs2017-win2016'
imageName: 'windows-latest'
python.version: '3.8'
Python38Mac:
imageName: 'macos-10.14'
imageName: 'macOS-latest'
python.version: '3.8'
Python39Linux:
imageName: 'ubuntu-latest'
python.version: '3.9'
Python39Windows:
imageName: 'windows-latest'
python.version: '3.9'
Python39Mac:
imageName: 'macOS-latest'
python.version: '3.9'
Python310Linux:
imageName: 'ubuntu-latest'
python.version: '3.10'
Python310Windows:
imageName: 'windows-latest'
python.version: '3.10'
Python310Mac:
imageName: 'macOS-latest'
python.version: '3.10'
maxParallel: 2
pool:
vmImage: $(imageName)
Expand Down
2 changes: 1 addition & 1 deletion imagededup/handlers/search/builtin/builtin.h
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/* Builtins and Intrinsics
* Portable Snippets - https://gitub.com/nemequ/portable-snippets
* Portable Snippets - https://github.com/nemequ/portable-snippets
* Created by Evan Nemerson <evan@nemerson.com>
*
* To the extent possible under law, the authors have waived all
Expand Down
126 changes: 87 additions & 39 deletions imagededup/methods/cnn.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,23 @@
from pathlib import Path, PurePath
from typing import Dict, List, Optional, Union
import warnings

import numpy as np
from PIL import Image
import torch
from torchvision.transforms import transforms

from imagededup.handlers.search.retrieval import get_cosine_similarity
from imagededup.utils.general_utils import save_json, get_files_to_remove
from imagededup.utils.data_generator import img_dataloader, MobilenetV3
from imagededup.utils.general_utils import (
generate_relative_names,
get_files_to_remove,
save_json,
)
from imagededup.utils.image_utils import (
expand_image_array_cnn,
load_image,
preprocess_image,
expand_image_array_cnn,
)
from imagededup.utils.logger import return_logger

Expand All @@ -32,21 +41,13 @@ class CNN:

def __init__(self, verbose: bool = True) -> None:
"""
Initialize a keras MobileNet model that is sliced at the last convolutional layer.
Set the batch size for keras generators to be 64 samples. Set the input image size to (224, 224) for providing
as input to MobileNet model.
Initialize a pytorch MobileNet model v3 that is sliced at the last convolutional layer.
Set the batch size for pytorch dataloader to be 64 samples.
Args:
verbose: Display progress bar if True else disable it. Default value is True.
"""
from tensorflow.keras.applications.mobilenet import MobileNet, preprocess_input
from imagededup.utils.data_generator import DataGenerator

self.MobileNet = MobileNet
self.preprocess_input = preprocess_input
self.DataGenerator = DataGenerator

self.target_size = (224, 224)
self.target_size = (256, 256)
self.batch_size = 64
self.logger = return_logger(
__name__
Expand All @@ -57,17 +58,28 @@ def __init__(self, verbose: bool = True) -> None:

def _build_model(self):
"""
Build MobileNet model sliced at the last convolutional layer with global average pooling added.
Build MobileNet v3 model sliced at the last convolutional layer with global average pooling added. Also initialize the corresponding preprocessing transform.
"""
self.model = self.MobileNet(
input_shape=(224, 224, 3), include_top=False, pooling='avg'
)

self.model = MobilenetV3()
self.logger.info(
'Initialized: MobileNet pretrained on ImageNet dataset sliced at last conv layer and added '
'GlobalAveragePooling'
'Initialized: MobileNet v3 pretrained on ImageNet dataset sliced at GAP layer'
)
self.transform = transforms.Compose(
[
transforms.Resize(self.target_size),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
]
)

def apply_mobilenet_preprocess(self, im_arr: np.array) -> torch.tensor:
image_pil = Image.fromarray(im_arr)
return self.transform(image_pil)

def _get_cnn_features_single(self, image_array: np.ndarray) -> np.ndarray:
"""
Generate CNN encodings for a single image.
Expand All @@ -78,35 +90,55 @@ def _get_cnn_features_single(self, image_array: np.ndarray) -> np.ndarray:
Returns:
Encodings for the image in the form of numpy array.
"""
image_pp = self.preprocess_input(image_array)
image_pp = np.array(image_pp)[np.newaxis, :]
return self.model.predict(image_pp)

def _get_cnn_features_batch(self, image_dir: PurePath) -> Dict[str, np.ndarray]:
image_pp = self.apply_mobilenet_preprocess(image_array)
image_pp = image_pp.unsqueeze(0)
img_features_tensor = self.model(image_pp)
return img_features_tensor.detach().numpy()[..., 0, 0]

def _get_cnn_features_batch(
self, image_dir: PurePath, recursive: Optional[bool] = False
) -> Dict[str, np.ndarray]:
"""
Generate CNN encodings for all images in a given directory of images.
Args:
image_dir: Path to the image directory.
recursive: Optional, find images recursively in the image directory.
Returns:
A dictionary that contains a mapping of filenames and corresponding numpy array of CNN encodings.
"""
self.logger.info('Start: Image encoding generation')
self.data_generator = self.DataGenerator(
self.dataloader = img_dataloader(
image_dir=image_dir,
batch_size=self.batch_size,
target_size=self.target_size,
basenet_preprocess=self.preprocess_input,
basenet_preprocess=self.apply_mobilenet_preprocess,
recursive=recursive,
)

feat_vec = self.model.predict_generator(
self.data_generator, len(self.data_generator), verbose=self.verbose
)
self.logger.info('End: Image encoding generation')
feat_arr, all_filenames = [], []
bad_im_count = 0

filenames = [i.name for i in self.data_generator.valid_image_files]
for ims, filenames, bad_images in self.dataloader:
arr = self.model(ims)
feat_arr.extend(arr)
all_filenames.extend(filenames)
if bad_images:
bad_im_count += 1

self.encoding_map = {j: feat_vec[i, :] for i, j in enumerate(filenames)}
if bad_im_count:
self.logger.info(
f'Found {bad_im_count} bad images, ignoring for encoding generation ..'
)

feat_vec = torch.stack(feat_arr).squeeze().detach().numpy()
valid_image_files = [filename for filename in all_filenames if filename]
self.logger.info('End: Image encoding generation')

filenames = generate_relative_names(image_dir, valid_image_files)
if len(feat_vec.shape) == 1: # can happen when encode_images is called on a directory containing a single image
self.encoding_map = {filenames[0]: feat_vec}
else:
self.encoding_map = {j: feat_vec[i, :] for i, j in enumerate(filenames)}
return self.encoding_map

def encode_image(
Expand Down Expand Up @@ -143,15 +175,15 @@ def encode_image(
)

image_pp = load_image(
image_file=image_file, target_size=self.target_size, grayscale=False
image_file=image_file, target_size=None, grayscale=False
)

elif isinstance(image_array, np.ndarray):
image_array = expand_image_array_cnn(
image_array
) # Add 3rd dimension if array is grayscale, do sanity checks
image_pp = preprocess_image(
image=image_array, target_size=self.target_size, grayscale=False
image=image_array, target_size=None, grayscale=False
)
else:
raise ValueError('Please provide either image file path or image array!')
Expand All @@ -162,11 +194,14 @@ def encode_image(
else None
)

def encode_images(self, image_dir: Union[PurePath, str]) -> Dict:
def encode_images(
self, image_dir: Union[PurePath, str], recursive: Optional[bool] = False
) -> Dict:
"""Generate CNN encodings for all images in a given directory of images.
Args:
image_dir: Path to the image directory.
recursive: Optional, find images recursively in the image directory.
Returns:
dictionary: Contains a mapping of filenames and corresponding numpy array of CNN encodings.
Example:
Expand All @@ -182,7 +217,7 @@ def encode_images(self, image_dir: Union[PurePath, str]) -> Dict:
if not image_dir.is_dir():
raise ValueError('Please provide a valid directory path!')

return self._get_cnn_features_batch(image_dir)
return self._get_cnn_features_batch(image_dir, recursive)

@staticmethod
def _check_threshold_bounds(thresh: float) -> None:
Expand Down Expand Up @@ -268,6 +303,7 @@ def _find_duplicates_dir(
min_similarity_threshold: float,
scores: bool,
outfile: Optional[str] = None,
recursive: Optional[bool] = False,
) -> Dict:
"""
Take in path of the directory in which duplicates are to be detected above the given threshold.
Expand All @@ -280,14 +316,15 @@ def _find_duplicates_dir(
scores: Optional, boolean indicating whether Hamming distances are to be returned along with retrieved
duplicates.
outfile: Optional, name of the file the results should be written to.
recursive: Optional, find images recursively in the image directory.
Returns:
if scores is True, then a dictionary of the form {'image1.jpg': [('image1_duplicate1.jpg',
score), ('image1_duplicate2.jpg', score)], 'image2.jpg': [] ..}
if scores is False, then a dictionary of the form {'image1.jpg': ['image1_duplicate1.jpg',
'image1_duplicate2.jpg'], 'image2.jpg':['image1_duplicate1.jpg',..], ..}
"""
self.encode_images(image_dir=image_dir)
self.encode_images(image_dir=image_dir, recursive=recursive)

return self._find_duplicates_dict(
encoding_map=self.encoding_map,
Expand All @@ -303,6 +340,7 @@ def find_duplicates(
min_similarity_threshold: float = 0.9,
scores: bool = False,
outfile: Optional[str] = None,
recursive: Optional[bool] = False,
) -> Dict:
"""
Find duplicates for each file. Take in path of the directory or encoding dictionary in which duplicates are to
Expand All @@ -319,6 +357,7 @@ def find_duplicates(
scores: Optional, boolean indicating whether similarity scores are to be returned along with retrieved
duplicates.
outfile: Optional, name of the file to save the results, must be a json. Default is None.
recursive: Optional, find images recursively in the image directory.
Returns:
dictionary: if scores is True, then a dictionary of the form {'image1.jpg': [('image1_duplicate1.jpg',
Expand Down Expand Up @@ -349,8 +388,14 @@ def find_duplicates(
min_similarity_threshold=min_similarity_threshold,
scores=scores,
outfile=outfile,
recursive=recursive,
)
elif encoding_map:
if recursive:
warnings.warn(
'recursive parameter is irrelevant when using encodings.',
SyntaxWarning,
)
result = self._find_duplicates_dict(
encoding_map=encoding_map,
min_similarity_threshold=min_similarity_threshold,
Expand All @@ -369,6 +414,7 @@ def find_duplicates_to_remove(
encoding_map: Dict[str, np.ndarray] = None,
min_similarity_threshold: float = 0.9,
outfile: Optional[str] = None,
recursive: Optional[bool] = False,
) -> List:
"""
Give out a list of image file names to remove based on the similarity threshold. Does not remove the mentioned
Expand All @@ -381,6 +427,7 @@ def find_duplicates_to_remove(
corresponding CNN encodings.
min_similarity_threshold: Optional, threshold value (must be float between -1.0 and 1.0). Default is 0.9
outfile: Optional, name of the file to save the results, must be a json. Default is None.
recursive: Optional, find images recursively in the image directory.
Returns:
duplicates: List of image file names that should be removed.
Expand All @@ -406,6 +453,7 @@ def find_duplicates_to_remove(
encoding_map=encoding_map,
min_similarity_threshold=min_similarity_threshold,
scores=False,
recursive=recursive,
)

files_to_remove = get_files_to_remove(duplicates)
Expand Down

0 comments on commit 28d4bd6

Please sign in to comment.