Why Multimodal Embeddings Could Change Enterprise Search

March 12, 2026

Imagine this.

A field technician sees a machine component on a factory floor, takes a photo of it, and the system instantly retrieves the relevant manual or maintenance documentation.

No part numbers.
No exact keywords.
Just a photo.

This kind of search experience is becoming possible with multimodal embeddings.

In simple terms, embeddings convert things like text or images into numerical representations that capture their meaning. When two things are conceptually similar, their embeddings end up closer to each other in vector space.

Recently, Google released Gemini Embedding 2, which creates embeddings for multiple types of data such as text, images, audio, and video, all in the same shared space.

This means that a photo, a written description, or even a video frame of the same object can be understood as related by a search system. I ran a quick experiment using the Gemini API.

I compared a sports car image with two text queries:

“cycle” → similarity ≈ 0.28
“red sports car” → similarity ≈ 0.40

The results are not very strong yet, but the relative ranking is correct. The system understands that the sports car image is closer to “red sports car” than to “cycle.” This is probably where the technology stands today: early but promising. The performance will likely improve as models and training data evolve. But the important thing is that the foundation of multimodal embeddings has now been laid.

For businesses, this opens up some interesting possibilities.

In manufacturing and field service: Technicians could take a picture of a component and instantly find manuals, troubleshooting guides, or spare part catalogs.
In industrial supply chains: Teams could search large parts catalogs using photos, making it easier to identify compatible components.
In media and content organizations: Large archives of images and videos could be searched using natural language descriptions instead of manual tagging.
In enterprise knowledge systems: Employees could search using whatever they have available, a description, a screenshot, or a photo and still retrieve relevant information.

We’ve already seen the shift from keyword search → semantic search.

With models like Gemini Embedding 2, we are starting to see the next step: multimodal search.

Search This Blog

Ai for Business

Why Multimodal Embeddings Could Change Enterprise Search

Comments

Post a Comment

Popular posts from this blog

5 Skills to Work Better with AI

AI learns language with the help of a child

Part 1: Prompt engineering: The Art of Delegating to Generative AI