Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: enhance with EXIF data, specifically geodata and datetimeoriginal #253

Open
ohade opened this issue Aug 19, 2023 · 3 comments

Comments

@ohade
Copy link

ohade commented Aug 19, 2023

Feature Name

[Feature Request]: Enhance fastdup with EXIF Data Integration, Including Geodata and DateTimeOriginal

Feature Description

  1. What does the feature do?
    Integrates EXIF data (geodata and DateTimeOriginal) into fastdup, allowing for more nuanced sorting, filtering, and deduplication by recognizing the original images.

  2. Why do you think it's important?
    EXIF data provides essential context and can detect original images among duplicates, thereby preserving crucial metadata. It's vital for industries that require location and time-specific insights.

  3. How will it benefit users?
    Users will gain richer insights, more accurate deduplication, and the preservation of important metadata. This will increase dataset quality, streamline data operations, and potentially reduce costs.

Contact Information [Optional]

No response

@dbickson
Copy link
Collaborator

dbickson commented Aug 19, 2023

HI @ohade sounds like a good feature request to add.

  • Can you please point us to a few example images containing exif data we can use to test the support.
  • In addition, which type of functionality would you like to have once we read the exif information. For example, assume two images are duplicates but their exif data is different. What would you like to do in this case? Would it be useful to show exif data in the galleries. For example for outliers?
    Thanks

@ohade
Copy link
Author

ohade commented Aug 19, 2023

Hi,
regarding Exif format: https://en.wikipedia.org/wiki/Exif
regarding the data that can be extracted and how:
take any picture taken on a mobile phone and run the following python code:

from PIL import Image, ExifTags
from PIL.ExifTags import TAGS, GPSTAGS

def get_exif_data(image_path):
  img = Image.open(image_path)
  
  image_exif = img.getexif()
  for key, val in image_exif.items():
      if key in ExifTags.TAGS:
          print(f"ID: {key}, TAG: {ExifTags.TAGS[key]}, VAL: {val}")


def get_geotagging(exif):
    if not exif:
        raise ValueError("No EXIF metadata found")

    geotagging = {}
    for (idx, tag) in TAGS.items():
        if tag == 'GPSInfo':
            if idx not in exif:
                raise ValueError("No EXIF geotagging found")

            for (key, val) in GPSTAGS.items():
                if key in exif[idx]:
                    geotagging[val] = exif[idx][key]

    return geotagging

def get_location(image_path):
    image = Image.open(image_path)
    exif = image._getexif()
    geotagging = get_geotagging(exif)
    for key, val in geotagging.items():
        print(key, val)

image_path = ...
get_exif_data(image_path)
get_location(image_path)

For example, here is the data I extracted from a picture I have on my android Samsung phone:

**get_exif_data(image_path)**
ID: 256, TAG: ImageWidth, VAL: 4000
ID: 257, TAG: ImageLength, VAL: 3000
ID: 34853, TAG: GPSInfo, VAL: 696
ID: 296, TAG: ResolutionUnit, VAL: 2
ID: 34665, TAG: ExifOffset, VAL: 238
ID: 271, TAG: Make, VAL: samsung
ID: 272, TAG: Model, VAL: SM-G998B
ID: 305, TAG: Software, VAL: G998BXXU5CVDD
ID: 274, TAG: Orientation, VAL: 6
ID: 306, TAG: DateTime, VAL: 2022:06:18 10:56:28
ID: 531, TAG: YCbCrPositioning, VAL: 1
ID: 282, TAG: XResolution, VAL: 72.0
ID: 283, TAG: YResolution, VAL: 72.0

**get_location(image_path)**
GPSLatitudeRef N
GPSLatitude (32.0, 5.0, 32.412119)
GPSLongitudeRef E
GPSLongitude (34.0, 49.0, 3.7128)
GPSAltitudeRef 0
GPSAltitude 64.0

So a few use cases:

  1. I use it to determine what was the original copy of the image, usually copied pictures will be stripped of the geodata so it's easy to track the copied ones. also, the DateTime tag helps to track the original one.
  2. knowing which was the the original picture can assist in understanding what was the original content and see transformations it went through by time (I am talking about a sequence of copies, each one a copy of the previous generation with small changes. that can create a sort of history path that can lead you to the source.
  3. two exact images with different EXIF data, will probably indicate some bug in my code, because it's not a fingerprint but close enough
  4. time and location can be strong features to users who want to cluster their images

@ohade
Copy link
Author

ohade commented Aug 19, 2023

Also, I use the geodata and datetime to convert it to UTC time, like so:

from timezonefinder import TimezoneFinder
import pendulum

def fix_timestamp_using_geoDataExif(latitude, longitude, timestamp):
    if latitude == 0 and longitude == 0:
        return timestamp
    
    tf = TimezoneFinder()
    time_zone_str = tf.timezone_at(lat=latitude, lng=longitude)

    if not time_zone_str:
        return timestamp
    
    local_time = pendulum.from_timestamp(timestamp, tz=time_zone_str)
    utc_time = local_time.in_timezone('UTC')
    return int(utc_time.timestamp())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants