Vector search in layman's terms

Vector search has once again become a hot topic, underpinning myriad use cases in AI. From image search to fuzzy document search, it’s become a hot new paradigm in search technology. And the most recent AI explosion, driven primarily by OpenAI’s ChatGPT, has boded well for providers of vectors search technology such as Pinecone, Vectara, or Qdrant.

So it’s worth diving a bit deeper into the technology and seeing where the Omnisearch team stands on this and how we’re positioned in the market.

‍

A short intro to vector search

What is vector search and how exactly does it differ from more traditional search methods? The key difference, unsurprisingly, is that vector search operates on vector representations of data (also known as embeddings). What exactly does this mean?

At the risk of dramatically oversimplifying a complex topic, an intuitive explanation that always made the most sense to me is that embeddings are essentially a neural network’s internal representation of the input data. Imagine a neural network trained for object classification on a large data set of images. Intuitively speaking, while its final layers will take care of assigning labels to the image, the preceding layers will contain a certain internal representation of the image.

‍

Image credit: Baeldung on Computer Science

‍

It’s precisely these internal representations that provide the basis for vector search. The embedding is in fact just an array of real numbers. As with any array of real numbers, we can interpret it as a point in space. The main trick that vector search is based on is that conceptually similar inputs will be mapped to fairly close points in space. This is why vectors are a great mechanism for problems like finding similar images. Map the image in your database to embeddings and store the embeddings. And when a user uploads an image as a search query, map that image to an embedding as well and find the closest ones.

Another very cool property is that it’s possible to get embeddings for other kinds of data and not just images, for instance, text. In fact, one of the most popular methods, Word2Vec, has been around for a while. This method learns word embeddings and is surprisingly good at conceptual relationships. As an example, the difference between the embeddings for “Paris” and “France” will be pretty much the same as the difference between “Rome” and “Italy”. Similar for pairs like “Japan/sushi” and “Germany/bratwurst”.

‍

Vectors vs. exact search methods

The question now is the difference between vector methods and more traditional, more exact search methods. As always, it comes with pros and cons.

Vector methods are great at fuzzy search, as they’re able to find conceptually similar results even when there are no exact matches. “Cuisine” might yield results with “food” or “canine” with “dog”, or even breeds of dogs. Overall, they’re also superior when handling long-tail queries. And that’s just the language part of the story - vector methods are a great fit for image search, as demonstrated by models like OpenAI’s CLIP.

At the same time, they’re not a silver bullet. In our case, the main drawback of vector methods is their relative lack of precision, their lack of debuggability or interpretability, and far more difficult result reconstruction. For instance, using a more exact search method makes it a lot easier to reconstruct results - timestamps in audio/video, bounding boxes in images, and paragraphs in text or documents. And since they’re neural network-based, it’s more difficult to interpret their results.

‍

The search ecosystem

The ecosystem of search startups is rather large, each offering its unique take on customer problems. I would broadly classify them as follows.

Text-first search solutions: In search, text is the most fundamental type of content. Though it’s no longer enough, text search is still the backbone of a truly complete search experience. Here we have a large set of great vendors: Elastic, with its industry-standard Elasticsearch open source product is the default for many applications. Meilisearch is another extremely popular open source solution, which aims to mitigate Elasticsearch’s complexity and sometimes unintuitive defaults. Algolia provides an easy-to-use search API and is particularly strong in the lucrative e-commerce vertical.
Vector databases: Vector databases essentially operate on embeddings, as explained above. The main operation when operating on embeddings is a “nearest neighbor” query. This, in layman terms, simply means answering the question “What’s the closest point to X?” This is a far more complex problem than it sounds, and vector databases specialize in it. Some of the key players in the space are Pinecone, Qdrant, Weaviate.
Search solutions for specific content types: Twelve labs does a great job of searching videos and images. Vectara is excellent at document processing. Overall, many search providers will focus on one part of the content spectrum and become great at it.

‍

Omnisearch’s edge

So, what do we do differently? It boils down to three basic things:

Bundling: Unlike the solutions mentioned above, Omnisearch gives you the entire indexing and search pipeline completely out of the box. Omnisearch can take any type of content as input, extract and index the most relevant information and perform search queries, either via API or through a dashboard. Therefore, there’s no need to stitch multiple tools together.
Holistic view: In a variation of the above, Omnisearch benefits from having the full holistic view of your content. Consider a vector database as a counterpoint. A blog post from leading vendor Qdrant describes an idea of powering your search experience by combining their product and a text search solution like Meilisearch. But combining solutions has negative effects on important aspects of search like ranking and pagination. Omnisearch eliminates this and provides everything out of the box.
Result accuracy: One of Omnisearch’s most important benefits is how seamlessly it finds and displays exact results in a variety of content types, even if they’re non-textual ones. Whether it’s exact moments inside audio or video files, or bounding boxes in images or documents, it offers significantly better granularity than any competing solution.

‍

Overall, I hope this post gives a good overview of the search landscape and our position in it. Stay tuned for more posts like this, and of course, follow us on Linkedin or reach out via email if you have more questions about vector search.

Redefining search paradigms

From keyword-based search to AI-powered search

The evolution of web content

What is data ingestion and why is it important?

Make your search magical

Redefining search paradigms

Related articles

From keyword-based search to AI-powered search

The evolution of web content

What is data ingestion and why is it important?

Make your search magical