Fix installation by replacing tensorflow with pytorch for CNN embeddi…

…ngs (#175) * Bump version to 0.2.4. * Update builtin.h (#123) * Update readme to reflect changes in TF 2.1. * Bump version to 0.2.4. (#122) * Update builtin.h respelled keyword as taking to some malicious site. Co-authored-by: Dat Tran <datitran@gmail.com> Co-authored-by: Tanuj Jain <tanujjain@users.noreply.github.com> * Add recursive option in encode_image() (#104) * Update tests for new recursive option * Add recursive option to the following functions: encode_image() find_duplicates() find_duplicates_to_remove() Recusive is off by default. * Add tests for recursive option * Modify tests to ignore hidden files '.DS_Store' that are automatically created on mac by the Finder application. (#131) * Port CNN to pytorch, other major changes (#173) * Add feature generation with mobilenet v3. * Integrate multi image encoding generation in CNN class. * Update tests to adhere to new cnn embeddings as well as new hashes generated. * Change antialias resampling to lanczos as antialias is deprecated and maps to lanczos in latest pillow versions. Also fix tests to adapt cnn scores, hashes. * Clean up code to consolidate preprocessing in one place. * Add tests for data generator. * Update documentation to reflect the changes occurring due to the use of pytorch mobilenet v3 instead of tensorflow mobilenetv2. * Update readme to remove tensorflow specifics. * Update version for Pillow to be minimum 9.0 (released beginning of Jan 2022). Also update description to reflect python version support. * Update requirements.txt and do some cleanup. * Update travis and azure-pipelines os versions. Co-authored-by: danidavid <33230485+danidavid@users.noreply.github.com> Co-authored-by: Dat Tran <datitran@gmail.com> Co-authored-by: Emilv2 <emil@vanherp.me>
idealo · Oct 14, 2022 · 28d4bd6 · 28d4bd6
1 parent 3465540
commit 28d4bd6
Show file tree

Hide file tree

Showing 22 changed files with 524 additions and 334 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -1,6 +1,10 @@
 language: python
+os:
+  - linux
+  - osx
+  - windows
 python:
-  - 3.6
+  - 3.10
 install:
   - pip install "cython>=0.29"
   - pip install -e ".[tests, docs]"

diff --git a/README.md b/README.md
@@ -52,12 +52,6 @@ There are two ways to install imagededup:
 pip install imagededup
 ```
 
-> ⚠️ **Note**: The TensorFlow >=2.1 and TensorFlow 1.15 release now include GPU support by default.
-> Before that CPU and GPU packages are separate. If you have GPUs, you should rather
-> install the TensorFlow version with GPU support especially when you use CNN to find duplicates.
-> It's way faster. See the [TensorFlow guide](https://www.tensorflow.org/install/gpu) for more
-> details on how to install it for older versions of TensorFlow.
-
 * Install imagededup from the GitHub source:
 
 ```bash
@@ -128,6 +122,8 @@ repository.
 For more detailed usage of the package functionality, refer: [https://idealo.github.io/imagededup/](https://idealo.github.io/imagededup/)
 
 ## ⏳ Benchmarks
+**Update**: Provided benchmarks are only valid upto `imagededup v0.2.2`. The next releases have significant changes to all methods, so the current benchmarks may not hold.
+
 Detailed benchmarks on speed and classification metrics for different methods have been provided in the [documentation](https://idealo.github.io/imagededup/user_guide/benchmarks/).
 Generally speaking, following conclusions can be made:
 

diff --git a/azure-pipelines.yml b/azure-pipelines.yml
@@ -16,33 +16,33 @@ jobs:
   - job: 'Test'
     strategy:
       matrix:
-        Python36Linux:
-          imageName: 'ubuntu-16.04'
-          python.version: '3.6'
-        Python36Windows:
-          imageName: 'vs2017-win2016'
-          python.version: '3.6'
-        Python36Mac:
-          imageName: 'macos-10.14'
-          python.version: '3.6'
-        Python37Linux:
-          imageName: 'ubuntu-16.04'
-          python.version: '3.7'
-        Python37Windows:
-          imageName: 'vs2017-win2016'
-          python.version: '3.7'
-        Python37Mac:
-          imageName: 'macos-10.14'
-          python.version: '3.7'
         Python38Linux:
-          imageName: 'ubuntu-16.04'
+          imageName: 'ubuntu-latest'
           python.version: '3.8'
         Python38Windows:
-          imageName: 'vs2017-win2016'
+          imageName: 'windows-latest'
           python.version: '3.8'
         Python38Mac:
-          imageName: 'macos-10.14'
+          imageName: 'macOS-latest'
           python.version: '3.8'
+        Python39Linux:
+          imageName: 'ubuntu-latest'
+          python.version: '3.9'
+        Python39Windows:
+          imageName: 'windows-latest'
+          python.version: '3.9'
+        Python39Mac:
+          imageName: 'macOS-latest'
+          python.version: '3.9'
+        Python310Linux:
+          imageName: 'ubuntu-latest'
+          python.version: '3.10'
+        Python310Windows:
+          imageName: 'windows-latest'
+          python.version: '3.10'
+        Python310Mac:
+          imageName: 'macOS-latest'
+          python.version: '3.10'
       maxParallel: 2
     pool:
       vmImage: $(imageName)

diff --git a/imagededup/handlers/search/builtin/builtin.h b/imagededup/handlers/search/builtin/builtin.h
@@ -1,5 +1,5 @@
 /* Builtins and Intrinsics
- * Portable Snippets - https://gitub.com/nemequ/portable-snippets
+ * Portable Snippets - https://github.com/nemequ/portable-snippets
  * Created by Evan Nemerson <evan@nemerson.com>
  *
  *   To the extent possible under law, the authors have waived all

diff --git a/imagededup/methods/cnn.py b/imagededup/methods/cnn.py
@@ -1,14 +1,23 @@
 from pathlib import Path, PurePath
 from typing import Dict, List, Optional, Union
+import warnings
 
 import numpy as np
+from PIL import Image
+import torch
+from torchvision.transforms import transforms
 
 from imagededup.handlers.search.retrieval import get_cosine_similarity
-from imagededup.utils.general_utils import save_json, get_files_to_remove
+from imagededup.utils.data_generator import img_dataloader, MobilenetV3
+from imagededup.utils.general_utils import (
+    generate_relative_names,
+    get_files_to_remove,
+    save_json,
+)
 from imagededup.utils.image_utils import (
+    expand_image_array_cnn,
     load_image,
     preprocess_image,
-    expand_image_array_cnn,
 )
 from imagededup.utils.logger import return_logger
 
@@ -32,21 +41,13 @@ class CNN:
 
     def __init__(self, verbose: bool = True) -> None:
         """
-        Initialize a keras MobileNet model that is sliced at the last convolutional layer.
-        Set the batch size for keras generators to be 64 samples. Set the input image size to (224, 224) for providing
-        as input to MobileNet model.
+        Initialize a pytorch MobileNet model v3 that is sliced at the last convolutional layer.
+        Set the batch size for pytorch dataloader to be 64 samples.
 
         Args:
             verbose: Display progress bar if True else disable it. Default value is True.
         """
-        from tensorflow.keras.applications.mobilenet import MobileNet, preprocess_input
-        from imagededup.utils.data_generator import DataGenerator
-
-        self.MobileNet = MobileNet
-        self.preprocess_input = preprocess_input
-        self.DataGenerator = DataGenerator
-
-        self.target_size = (224, 224)
+        self.target_size = (256, 256)
         self.batch_size = 64
         self.logger = return_logger(
             __name__
@@ -57,17 +58,28 @@ def __init__(self, verbose: bool = True) -> None:
 
     def _build_model(self):
         """
-        Build MobileNet model sliced at the last convolutional layer with global average pooling added.
+        Build MobileNet v3 model sliced at the last convolutional layer with global average pooling added. Also initialize the corresponding preprocessing transform.
         """
-        self.model = self.MobileNet(
-            input_shape=(224, 224, 3), include_top=False, pooling='avg'
-        )
-
+        self.model = MobilenetV3()
         self.logger.info(
-            'Initialized: MobileNet pretrained on ImageNet dataset sliced at last conv layer and added '
-            'GlobalAveragePooling'
+            'Initialized: MobileNet v3 pretrained on ImageNet dataset sliced at GAP layer'
+        )
+        self.transform = transforms.Compose(
+            [
+                transforms.Resize(self.target_size),
+                transforms.CenterCrop(224),
+                transforms.ToTensor(),
+                transforms.Normalize(
+                    mean=[0.485, 0.456, 0.406],
+                    std=[0.229, 0.224, 0.225]
+                ),
+            ]
         )
 
+    def apply_mobilenet_preprocess(self, im_arr: np.array) -> torch.tensor:
+        image_pil = Image.fromarray(im_arr)
+        return self.transform(image_pil)
+
     def _get_cnn_features_single(self, image_array: np.ndarray) -> np.ndarray:
         """
         Generate CNN encodings for a single image.
@@ -78,35 +90,55 @@ def _get_cnn_features_single(self, image_array: np.ndarray) -> np.ndarray:
         Returns:
             Encodings for the image in the form of numpy array.
         """
-        image_pp = self.preprocess_input(image_array)
-        image_pp = np.array(image_pp)[np.newaxis, :]
-        return self.model.predict(image_pp)
-
-    def _get_cnn_features_batch(self, image_dir: PurePath) -> Dict[str, np.ndarray]:
+        image_pp = self.apply_mobilenet_preprocess(image_array)
+        image_pp = image_pp.unsqueeze(0)
+        img_features_tensor = self.model(image_pp)
+        return img_features_tensor.detach().numpy()[..., 0, 0]
+
+    def _get_cnn_features_batch(
+        self, image_dir: PurePath, recursive: Optional[bool] = False
+    ) -> Dict[str, np.ndarray]:
         """
         Generate CNN encodings for all images in a given directory of images.
         Args:
             image_dir: Path to the image directory.
+            recursive: Optional, find images recursively in the image directory.
 
         Returns:
             A dictionary that contains a mapping of filenames and corresponding numpy array of CNN encodings.
         """
         self.logger.info('Start: Image encoding generation')
-        self.data_generator = self.DataGenerator(
+        self.dataloader = img_dataloader(
             image_dir=image_dir,
             batch_size=self.batch_size,
-            target_size=self.target_size,
-            basenet_preprocess=self.preprocess_input,
+            basenet_preprocess=self.apply_mobilenet_preprocess,
+            recursive=recursive,
         )
 
-        feat_vec = self.model.predict_generator(
-            self.data_generator, len(self.data_generator), verbose=self.verbose
-        )
-        self.logger.info('End: Image encoding generation')
+        feat_arr, all_filenames = [], []
+        bad_im_count = 0
 
-        filenames = [i.name for i in self.data_generator.valid_image_files]
+        for ims, filenames, bad_images in self.dataloader:
+            arr = self.model(ims)
+            feat_arr.extend(arr)
+            all_filenames.extend(filenames)
+            if bad_images:
+                bad_im_count += 1
 
-        self.encoding_map = {j: feat_vec[i, :] for i, j in enumerate(filenames)}
+        if bad_im_count:
+            self.logger.info(
+                f'Found {bad_im_count} bad images, ignoring for encoding generation ..'
+            )
+
+        feat_vec = torch.stack(feat_arr).squeeze().detach().numpy()
+        valid_image_files = [filename for filename in all_filenames if filename]
+        self.logger.info('End: Image encoding generation')
+
+        filenames = generate_relative_names(image_dir, valid_image_files)
+        if len(feat_vec.shape) == 1:  # can happen when encode_images is called on a directory containing a single image
+            self.encoding_map = {filenames[0]: feat_vec}
+        else:
+            self.encoding_map = {j: feat_vec[i, :] for i, j in enumerate(filenames)}
         return self.encoding_map
 
     def encode_image(
@@ -143,15 +175,15 @@ def encode_image(
                 )
 
             image_pp = load_image(
-                image_file=image_file, target_size=self.target_size, grayscale=False
+                image_file=image_file, target_size=None, grayscale=False
             )
 
         elif isinstance(image_array, np.ndarray):
             image_array = expand_image_array_cnn(
                 image_array
             )  # Add 3rd dimension if array is grayscale, do sanity checks
             image_pp = preprocess_image(
-                image=image_array, target_size=self.target_size, grayscale=False
+                image=image_array, target_size=None, grayscale=False
             )
         else:
             raise ValueError('Please provide either image file path or image array!')
@@ -162,11 +194,14 @@ def encode_image(
             else None
         )
 
-    def encode_images(self, image_dir: Union[PurePath, str]) -> Dict:
+    def encode_images(
+        self, image_dir: Union[PurePath, str], recursive: Optional[bool] = False
+    ) -> Dict:
         """Generate CNN encodings for all images in a given directory of images.
 
         Args:
             image_dir: Path to the image directory.
+            recursive: Optional, find images recursively in the image directory.
         Returns:
             dictionary: Contains a mapping of filenames and corresponding numpy array of CNN encodings.
         Example:
@@ -182,7 +217,7 @@ def encode_images(self, image_dir: Union[PurePath, str]) -> Dict:
         if not image_dir.is_dir():
             raise ValueError('Please provide a valid directory path!')
 
-        return self._get_cnn_features_batch(image_dir)
+        return self._get_cnn_features_batch(image_dir, recursive)
 
     @staticmethod
     def _check_threshold_bounds(thresh: float) -> None:
@@ -268,6 +303,7 @@ def _find_duplicates_dir(
         min_similarity_threshold: float,
         scores: bool,
         outfile: Optional[str] = None,
+        recursive: Optional[bool] = False,
     ) -> Dict:
         """
         Take in path of the directory in which duplicates are to be detected above the given threshold.
@@ -280,14 +316,15 @@ def _find_duplicates_dir(
             scores: Optional, boolean indicating whether Hamming distances are to be returned along with retrieved
                     duplicates.
             outfile: Optional, name of the file the results should be written to.
+            recursive: Optional, find images recursively in the image directory.
 
         Returns:
             if scores is True, then a dictionary of the form {'image1.jpg': [('image1_duplicate1.jpg',
             score), ('image1_duplicate2.jpg', score)], 'image2.jpg': [] ..}
             if scores is False, then a dictionary of the form {'image1.jpg': ['image1_duplicate1.jpg',
             'image1_duplicate2.jpg'], 'image2.jpg':['image1_duplicate1.jpg',..], ..}
         """
-        self.encode_images(image_dir=image_dir)
+        self.encode_images(image_dir=image_dir, recursive=recursive)
 
         return self._find_duplicates_dict(
             encoding_map=self.encoding_map,
@@ -303,6 +340,7 @@ def find_duplicates(
         min_similarity_threshold: float = 0.9,
         scores: bool = False,
         outfile: Optional[str] = None,
+        recursive: Optional[bool] = False,
     ) -> Dict:
         """
         Find duplicates for each file. Take in path of the directory or encoding dictionary in which duplicates are to
@@ -319,6 +357,7 @@ def find_duplicates(
             scores: Optional, boolean indicating whether similarity scores are to be returned along with retrieved
                     duplicates.
             outfile: Optional, name of the file to save the results, must be a json. Default is None.
+            recursive: Optional, find images recursively in the image directory.
 
         Returns:
             dictionary: if scores is True, then a dictionary of the form {'image1.jpg': [('image1_duplicate1.jpg',
@@ -349,8 +388,14 @@ def find_duplicates(
                 min_similarity_threshold=min_similarity_threshold,
                 scores=scores,
                 outfile=outfile,
+                recursive=recursive,
             )
         elif encoding_map:
+            if recursive:
+                warnings.warn(
+                    'recursive parameter is irrelevant when using encodings.',
+                    SyntaxWarning,
+                )
             result = self._find_duplicates_dict(
                 encoding_map=encoding_map,
                 min_similarity_threshold=min_similarity_threshold,
@@ -369,6 +414,7 @@ def find_duplicates_to_remove(
         encoding_map: Dict[str, np.ndarray] = None,
         min_similarity_threshold: float = 0.9,
         outfile: Optional[str] = None,
+        recursive: Optional[bool] = False,
     ) -> List:
         """
         Give out a list of image file names to remove based on the similarity threshold. Does not remove the mentioned
@@ -381,6 +427,7 @@ def find_duplicates_to_remove(
                           corresponding CNN encodings.
             min_similarity_threshold: Optional, threshold value (must be float between -1.0 and 1.0). Default is 0.9
             outfile: Optional, name of the file to save the results, must be a json. Default is None.
+            recursive: Optional, find images recursively in the image directory.
 
         Returns:
             duplicates: List of image file names that should be removed.
@@ -406,6 +453,7 @@ def find_duplicates_to_remove(
                 encoding_map=encoding_map,
                 min_similarity_threshold=min_similarity_threshold,
                 scores=False,
+                recursive=recursive,
             )
 
         files_to_remove = get_files_to_remove(duplicates)