Why Multimodal Embeddings Could Change Enterprise Search
Why Multimodal Embeddings Could Change Enterprise Search Imagine this. A field technician sees a machine component on a factory floor, takes a photo of it, and the system instantly retrieves the relevant manual or maintenance documentation. No part numbers. No exact keywords. Just a photo. This kind of search experience is becoming possible with multimodal embeddings. In simple terms, embeddings convert things like text or images into numerical representations that capture their meaning. When two things are conceptually similar, their embeddings end up closer to each other in vector space. Recently, Google released Gemini Embedding 2, which creates embeddings for multiple types of data such as text, images, audio, and video, all in the same shared space. This means that a photo, a written description, or even a video frame of the same object can be understood as related by a search system. I ran a quick experiment using the Gemini API. I compared a sports car image with two text...