Skip to content

AI Search Deep Dive

A comprehensive look at how CullVue's semantic search works under the hood, from the model that powers it to tips for getting the best results.

How Semantic Search Works

Traditional photo search relies on filenames, dates, or tags you have manually applied. Semantic search is fundamentally different. It understands what is in your photos and lets you find them by describing their visual content in plain language.

The process works in three stages:

1. Embedding generation

When you add a watch folder, CullVue runs each image through a vision model to produce a compact numerical representation called an embedding. An embedding is a high-dimensional vector (a list of numbers) that captures the visual content of the photo: the objects, scenes, colors, textures, and composition. Two photos of a sunset at a beach will have embeddings that are numerically close to each other, even if the photos were taken on different cameras at different beaches.

2. Query encoding

When you type a search query such as "sunset at the beach", the same model converts your text into the same embedding space. The text embedding lands near the region of image embeddings that visually match your description.

3. Similarity matching

CullVue compares your text embedding against every image embedding in your library using cosine similarity, a mathematical measure of how close two vectors are in direction. The results are ranked by similarity score and the most relevant matches are displayed first. This entire comparison happens in milliseconds, even for libraries with tens of thousands of photos.

The MobileCLIP-S2 Model

CullVue uses MobileCLIP-S2 as its vision-language model. MobileCLIP is a family of efficient CLIP (Contrastive Language-Image Pre-training) models designed specifically for on-device inference.

What is CLIP?

CLIP is a type of model that learns to associate images and text in a shared embedding space. It was trained on a large dataset of image-text pairs so that it understands the relationship between visual content and natural language descriptions. This is what makes it possible to search for photos by typing a sentence rather than relying on keywords or filenames.

Why MobileCLIP-S2?

CullVue chose MobileCLIP-S2 for several reasons:

  • Optimized for on-device use — MobileCLIP-S2 is specifically designed for efficient local inference. It runs well on consumer hardware without requiring a powerful GPU.
  • Apple Neural Engine support — On Apple Silicon Macs (M1 and newer), the model takes advantage of the dedicated Neural Engine for hardware-accelerated processing, significantly speeding up both indexing and search.
  • Strong accuracy for its size — Despite being much smaller than server-class vision models, MobileCLIP-S2 provides highly relevant search results for everyday photo queries.
  • Compact model size — The model download is approximately 400 MB, a one-time download that is stored locally and never needs to be re-downloaded unless you explicitly remove it.

Runs entirely locally

The model runs entirely on your CPU or GPU. No internet connection is required for either indexing or searching. The model weights are stored on disk and loaded into memory when you activate semantic search mode.

Search Tips & Examples

Semantic search understands natural language, so you do not need to use special syntax or keywords. That said, some query styles produce better results than others.

Effective queries

Query What it finds
"sunset at the beach" Photos of sunsets with ocean, sand, or coastline visible
"cat on a couch" Indoor photos of cats resting on sofas or cushions
"birthday party with cake" Celebration scenes with cakes, candles, or party decorations
"red car" Photos featuring red-colored automobiles
"snowy mountain landscape" Wide shots of snow-covered peaks and alpine scenery
"people hiking in a forest" Outdoor trail photos with people walking among trees

Tips for better results

  • Be descriptive, not abstract — Describe what is visually present in the image. Queries like "golden hour portrait" work better than "beautiful photo" because the model matches on visual content, not subjective judgments.
  • Include context clues — Adding details like location, setting, or objects helps narrow results. "dog playing in the park" is more specific than just "dog".
  • Try different phrasings — If your first query does not return what you expect, rephrase it. "autumn leaves" and "fall foliage" may surface slightly different results.
  • Start broad, then narrow — If you are not sure how your photos were composed, start with a general query and refine from there.
  • Use the scope filter — Use the scope dropdown next to the search bar to limit results to a specific folder or search across all watched folders.

What semantic search does not do

Semantic search is based on visual similarity, not optical character recognition or face recognition. It will not reliably find photos containing specific text (like a sign with a particular word) or identify specific individuals by name. It understands visual concepts and scenes, not identities or written words.

How Indexing Works

Indexing is the process of generating an embedding for each photo in your library. This is what makes semantic search possible.

Background processing

Indexing runs as a background task that does not block the rest of the application. You can browse, tag, create albums, and use filename search while indexing is active. CullVue throttles the indexing process to keep the app responsive.

There are two indexing modes:

  • Foreground indexing — Triggered from the indexing modal dialog. This runs at full speed and processes images faster, which is ideal during initial setup.
  • Background indexing — Runs at lower priority when the modal is closed. This ensures the app stays snappy while gradually working through your library.

Progress tracking

The status bar at the bottom of the window shows indexing progress, including the number of images processed and the current file being analyzed. When indexing is idle, the status bar displays the total number of indexed images in your library.

Incremental updates

CullVue watches your folders in real time. When new photos are added to a watched folder, they are automatically queued for indexing. When photos are deleted, their corresponding index entries are cleaned up. You do not need to manually re-index your library after adding or removing files.

Index storage

All embeddings are stored in a local database on your device. The index size depends on the number of photos in your library, but it is very compact. For a typical library of 10,000 photos, expect the index to use approximately 30 MB of disk space. The embeddings are small numerical vectors, not copies of your images.

Privacy & Local Processing

Privacy is a core design principle of CullVue's AI search. Every part of the pipeline runs entirely on your device.

  • No cloud processing — Your photos are never uploaded to a server. The AI model runs locally on your CPU or GPU.
  • No internet required — After the one-time model download, semantic search works completely offline. You can disconnect from the internet entirely and search will continue to work.
  • No data leaves your device — Your images, embeddings, search queries, and results are all processed and stored locally. Nothing is transmitted to external servers.
  • No third-party access — There are no third-party analytics, tracking pixels, or data-sharing agreements involved in the AI search feature. The only network request the app makes is an optional check for software updates, which can be disabled in Settings.
  • Embeddings are not images — The embeddings stored in the local database are compact numerical vectors. They cannot be reversed to reconstruct the original image. They contain no EXIF data, no GPS coordinates, and no personally identifiable information.

Performance

Search speed

Once your library is indexed, searches complete in milliseconds. The similarity comparison is a lightweight mathematical operation that scales well even to large libraries. A library of 10,000 photos typically returns results in under 50 ms.

Indexing speed

Indexing speed depends on your hardware. On an Apple Silicon Mac with the Neural Engine, CullVue can index several hundred images per minute. On machines without dedicated ML hardware, indexing relies on the CPU and is slower but still practical for most library sizes.

A dedicated GPU is not required but will significantly speed up the initial embedding generation on Windows and Linux.

Index size

The embedding index is very space-efficient. Each image embedding is a small numerical vector, so 10,000 photos produce an index of roughly 30 MB. Even libraries with 50,000+ photos will have an index well under 200 MB.

Memory usage

The AI model is loaded into memory only when semantic search mode is active. When you switch back to filename search, the model is unloaded to free up RAM. Photo Organizer requires at least 4 GB of system RAM, though 8 GB is recommended for the best experience when using AI search alongside large photo libraries.

Supported Content & Limitations

Supported image formats

AI search indexes all image formats that CullVue supports, including:

  • Standard formats — JPEG, PNG, WebP, GIF, BMP, AVIF
  • Apple formats — HEIC, HEIF
  • RAW formats — CR2, CR3 (Canon), NEF (Nikon), ARW (Sony), DNG (Adobe), ORF (Olympus), RAF (Fujifilm), RW2 (Panasonic), PEF (Pentax)
  • Other — TIFF, TIF, SVG

Videos

Video files are recognized and displayed in the library, but they are not indexed for AI search at this time. Video files will not appear in semantic search results. Text-based filename search still works for videos.

Known limitations

  • No text recognition (OCR) — The model does not read text within images. Searching for a specific word on a sign or document will not produce reliable results.
  • No face recognition — Semantic search cannot identify specific people by name. It understands general concepts like "person" or "group of people" but not individual identities.
  • Abstract concepts — Subjective or emotional queries like "happy photo" or "my best shot" may not produce meaningful results. The model works best with concrete visual descriptions.
  • Very small or occluded objects — If the subject of your query is very small within the frame or heavily obscured, the model may not detect it reliably. Prominent visual elements produce the strongest matches.
  • Query length — Queries must be at least 3 characters long. Very short queries may produce broad or unexpected results. Longer, more descriptive queries tend to be more precise.