detect-exif 📷 — Protecting privacy one image at a time

Image credit: DALL·E 3
Table of Contents

Privacy issues with uploading photos

I recently decided to update my personal website with some new photos from recent events. Little did I know this would lead me down a privacy rabbit hole and eventually to creating a new open-source tool!

While preparing these images for upload, I randomly checked one photo’s properties and was surprised to discover just how much personal information was embedded in the Exif (Exchangeable Image File Format) metadata:

  • Precise GPS coordinates of where the photo was taken
  • My phone’s make and model
  • The exact date and time the photo was taken

I immediately wondered: how many other images on my website or in my public repositories were silently sitting with this data?

The Problem with Exif Data

Exif is a standard that specifies the formats for images and the metadata embedded in them. This metadata serves useful purposes for photographers and image processing software (and also to display a cool map of your photos in Apple Photos), but from a privacy perspective, it can be problematic.

Here’s what surprised me most about Exif data:

  1. It persists through many editing operations - even after cropping or resizing, the GPS data often remains.
  2. Many publishing platforms and social media sites strip Exif data automatically, but many don’t.
  3. And importantly - There weren’t great tools to catch this issue in development workflows.

Building detect-exif

After discovering this privacy concern and facing the daunting task of manually going through all photos on my website, I decided to build a tool that would:

  1. Easily detect potentially sensitive metadata in images
  2. Provide clear, human-readable information about what was found
  3. Optionally clean images while preserving their visual quality (or attempt to at least)
  4. Work as a pre-commit hook to catch issues before they’re committed

Let’s dive into how I built it and how it works!

The Technical Approach

The code for this project was developed with help from Claude Sonnet 3.7, so please do take it with a grain of salt. I have checked and gone through it, but I do not know if what all the bytes actually refer to what it says they do, or if they are correct at all. I am actually surprised that Claude could pull them from its memory stores, and it actually worked! This might need a post of its own…

I built the tool using Python with the Pillow library, which provides excellent image handling capabilities. Here’s how the core detection works:

def is_unsafe_exif(img):
def is_unsafe_exif(img):
    """
    Check if the image has any non-technical EXIF data that could be private.
    Instead of using a safe list approach, we'll assume all EXIF is unsafe
    except for purely technical image parameters.
    """
    if not hasattr(img, "getexif"):
        return False, {}

    exif_data = img.getexif()
    if not exif_data:
        return False, {}

    # Check if the image has any EXIF data at all (except orientation)
    has_exif = False
    for tag_id in exif_data:
        # Skip orientation tag as that's purely technical
        if tag_id != 0x0112 and exif_data[tag_id]:
            has_exif = True
            break

    if not has_exif:
        return False, {}

    # Get the full EXIF data for deeper checks
    full_exif = None
    if hasattr(img, "_getexif"):
        full_exif = img._getexif() or {}
    else:
        full_exif = exif_data

    # Collect information for reporting
    special_info = {}

    # Check for special tags we want to highlight
    for tag_id, tag_name in SPECIAL_CHECK_TAGS.items():
        if tag_id in full_exif and full_exif[tag_id]:
            special_info[tag_name] = full_exif[tag_id]

    # Special check for GPS information
    gps_info = get_gps_info(full_exif)
    if gps_info:
        special_info["GPS Coordinates"] = gps_info

    return True, special_info

Rather than using a complex classification system, I went with a simple approach: if an image has any Exif data other than pure technical parameters (like orientation), it’s flagged. This may occasionally produce false positives, but when it comes to privacy, I think it’s better to err on the side of caution (also an interesting showcase of precision vs recall — we (I) want maximum recall).

Click for more info about Precision vs Recall

This approach is a classic example of the precision–recall tradeoff in detection systems. In machine learning terms, precision is how many of the flagged images actually contain private data, while recall is how many of the total private data items you successfully catch. For privacy protection, I’m prioritizing maximum recall (catching all potential privacy leaks) even at the cost of some precision (occasionally flagging non-problematic metadata).

Extracting Human-Readable GPS Information

An interesting challenge was extracting and presenting GPS data in a human-readable format:

def get_gps_info(exif_data):
    """
    Extract human-readable GPS information from EXIF data if present.
    """
    if not exif_data:
        return None

    # Check for GPSInfo tag
    gps_info = None
    if 0x8825 in exif_data:
        gps_data = exif_data[0x8825]
        if not isinstance(gps_data, dict):
            return "GPS data present but unreadable"

        # Extract latitude
        if 2 in gps_data and 1 in gps_data:
            lat = gps_data[2]
            lat_ref = gps_data[1]
            if isinstance(lat, tuple) and len(lat) == 3:
                try:
                    lat_value = lat[0] + lat[1] / 60 + lat[2] / 3600
                    if lat_ref == "S":
                        lat_value = -lat_value
                    gps_info = f"Lat: {lat_value:.6f}"
                except (TypeError, ZeroDivisionError):
                    gps_info = "Latitude present but unreadable"

        # Extract longitude
        if 4 in gps_data and 3 in gps_data:
            lon = gps_data[4]
            lon_ref = gps_data[3]
            if isinstance(lon, tuple) and len(lon) == 3:
                try:
                    lon_value = lon[0] + lon[1] / 60 + lon[2] / 3600
                    if lon_ref == "W":
                        lon_value = -lon_value
                    gps_info = f"{gps_info or ''} Lon: {lon_value:.6f}".strip()
                except (TypeError, ZeroDivisionError):
                    gps_info = (
                        f"{gps_info or ''} Longitude present but unreadable".strip()
                    )

        # If we couldn't parse specific values but GPS data exists
        if not gps_info and gps_data:
            return "GPS data present"

    return gps_info

GPS data in Exif is stored in a fairly complex format (degrees, minutes, seconds), so I added code to convert it to a more human-readable decimal format.

Preserving Image Quality While Removing Metadata

A tricky aspect was ensuring that removing Exif data didn’t affect image quality or orientation. Images with rotation information in their Exif data would suddenly appear sideways if you simply stripped all metadata. The solution I developed may also avoid a common problem with other Exif tools where jpg quality drops when you edit and save the Image object itself. For my images I have not seen this quality issue, but maybe I have not looked hard enough. If it does happen, do open an issue on GitHub!

Anyway, I solved the rotation and exif stripping issues with the following approach:

def sanitize_image(img):
    """Sanitize image by removing all Exif data while preserving orientation."""
    # First, apply any orientation from Exif
    img = ImageOps.exif_transpose(img)

    # Create a new image without any Exif data
    img_format = img.format or "JPEG"
    with io.BytesIO() as buffer:
        img.save(buffer, format=img_format)
        buffer.seek(0)
        cleaned_img = Image.open(buffer)
        cleaned_img.load()

    return cleaned_img

The approach above first applies any rotation specified in the Exif data, then creates a fresh image without any metadata, effectively giving us a “clean” version that still looks correct (hopefully!).

Making it a pre-commit hook

For the tool to be truly useful, it needed to integrate smoothly into my development workflows. I made it work as both a standalone command-line tool and a pre-commit hook:

def process_files(files, remove=False, quiet=False):
    """Process image files to check for and optionally remove Exif data."""
    unsupported_files = []
    unsafe_exif_found = False
    sensitive_files = []

    # -----
    # Processing logic...
    # -----

    # Return non-zero exit code if unsafe Exif data was found
    # This causes pre-commit hooks to fail and prevent the commit
    return 1 if unsafe_exif_found else 0

The exit code is crucial for pre-commit integration — by returning a non-zero value when sensitive data is detected, it prevents committing those files until they’re cleaned.

Try It Yourself!

Want to check your own images? You can install detect-exif with pip:

pip install detect-exif

Then run it on any image:

# Check images without modifying them
detect-exif img1.png img2.jpg

# Remove metadata while preserving orientation
detect-exif --remove img1.jpg

For pre-commit integration, just add this to your .pre-commit-config.yaml:

- repo: https://github.com/olipinski/detect-exif
  rev: v0.1.0
  hooks:
    - id: detect-exif
      # args: ["--remove", "--quiet"]

Conclusion

I’d love to hear about your experiences with the tool or suggestions for improvements! Feel free to open an issue or submit a PR on GitHub.

This project reminded me that privacy often hinges on these small, overlooked details. By making tools that help protect privacy by default, we can all contribute to a digital world that respects personal boundaries a bit more.


Appendix: Understanding Exif Byte Codes and Standards

Click to expand the appendix

For those interested in the technical details, here’s a quick explanation of the mysterious byte codes mentioned in the article:

  • 0x0112 - This is the Orientation tag. It determines how the image should be rotated when displayed. Without preserving this value, images might appear sideways or upside down after metadata removal.

  • 0x8825 - This is the GPSInfo tag. It’s a pointer to a whole block of GPS-related metadata, including latitude, longitude, altitude, and timestamps. When we check for this tag, we’re seeing if the image has any location data at all.

Exif data is organized in a hierarchical structure with several categories (called IFDs or Image File Directories):

  • IFD0 (Main Image) - Contains basic image metadata
  • Exif SubIFD - Contains detailed camera settings
  • GPS IFD - Contains location information
  • Various other sections for specialized data

Each piece of information is stored with a specific tag number (like our 0x0112), making it possible to selectively preserve technical tags while removing privacy-sensitive ones.

Is the Exif schema standardized?

The Exif specification exists in a strange middle ground between standardized and manufacturer-specific. The base specification is maintained by JEITA (Japan Electronics and Information Technology Industries Association, formerly JEIDA) and defines common tags and structures. The current version is 3.0, though many devices implement older versions.

However, things get complicated for a few reasons:

  1. Official specs aren’t free - You typically need to purchase the official specification from JEITA, which is why many developers rely on reverse engineering or third-party documentation like ExifTool’s tag reference.

  2. Manufacturer extensions - Camera makers like Canon, Nikon, and Sony all implement their own proprietary Exif extensions in “MakerNote” sections. These contain custom data in formats that can be difficult to interpret without manufacturer documentation.

  3. Implementation inconsistencies - Despite the specification, implementation details vary between devices and software, which is why robust Exif parsing libraries need to handle many edge cases.

For detect-exif, I took the approach that all non-essential metadata is potentially privacy-sensitive. This is safer than trying to maintain an exhaustive safe list of “safe” tags, especially given the complexity of the Exif ecosystem.

Olaf Lipinski
Olaf Lipinski
Researcher in Artificial Intelligence

Making sure AI actually helps in real-world scenarios.