Exploring Vectors, Embeddings, and Vector Databases

Chat with your data essential notions.

TL;DR

Explore the world of vectors, embeddings, and vector databases in mathematics, physics, and machine learning. Learn how to generate embeddings using Hugging Face Inference API and store them in vector databases for efficient retrieval. Discover the significance of vector databases in handling high-dimensional numerical vectors and their applications in various fields. Gain insights into retrieval-augmented generation (RAG) and indexing methods employed in vector databases.

Vectors

In mathematics and physics

According to Wikipedia Vector is: a term that refers colloquially to some quantities that cannot be expressed by a single number (a scalar), or to elements of some vector spaces.. Here you can play with vectors

In machine learning

According to Algolia’s documentation, vectors signify : input data, including bias and weight. In the same way, output from a machine-learning model (for example, a predicted class), can be put into vector format.

Here is another definition from IBM : Vectors are arrays of numbers that can represent complex objects like words, images, videos and audio, generated by a machine learning (ML) model. High-dimensional vector data is essential to machine learning, natural language processing (NLP) and other AI tasks

Vector space

Embeddings

Embeddings are numerical representations of documents, which can include various types such as text, images, videos, and audio. These embeddings capture semantic relationships and contextual information, enabling algorithms to perform tasks such as similarity analysis, recommendation systems, and natural language understanding.

Let’s embed(encode) the above paragraph using Hugging Face Inference API. If you want to follow along you need to create an account at Hugging Face . Also, an access token with write permission is required.

Using a tool such as Postman, you can generate the embedding by sending a POST request to the endpoint

 https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}

where model_id corresponds to

sentence-transformers/all-MiniLM-L6-v2
{
"inputs": "Embeddings are numerical representations of documents, ...",
"options":{
"wait_for_model": true
}
}

The generated embedding is a list of 384 numbers (dimensions of the data), representing the semantic meaning of the text.

{
[
-0.003196842037141323,
-0.07144313305616379,
0.06496944278478622,
...
]
}

The length of this list varies depending on the model employed. If OpenAI text-embedding-ada-002 model is utilized, the length of this list would be 1536 dimensions.

Depending on the index schema, additional fields, including the text and the equivalent embedding, must be stored in an index of a vector database for future retrieval. But what exactly is a vector database?

Vector Databases

Vector databases specialize in handling vector data, which are mathematical entities represented across multiple dimensions. They find extensive application in semantic searches, retrieval augmentation generation (RAG), and geographic information systems (GIS). Essentially, they enable the storage and retrieval of more relevant results in a highly scalable manner.

One of the most common ways to store and search over unstructured data is through vector databases.

At the core of a vector database is its ability to determine content that is close in meaning to a given query. This is achieved by first embedding the data and storing the resulting embedding vectors. During query time, an embedding is generated for the query, which is then used to query the stored data and retrieve the entries that are most similar to the embedded query.

In contrast to traditional databases, which store data in rows and columns, vector databases store data in the form of vectors. This difference in storage allows for more efficient handling of high-dimensional numerical vectors, which is essential for tasks such as similarity search and content recommendation.

When it comes to choosing a vector database, there are numerous options available. Here is some of them:

Open-Source

Commericial and platform

Cloud providers


Retrieval Augmented Generation

Perhaps like me, you’ve noticed that LLMs don’t always seem to know what they’re talking about. That’s because they only understand the statistical relationships among words, not their actual meanings. Here comes RAG

to the rescue. RAG is an AI framework that allows grounding an LLM on external and up-to-date information. It is a method that uses a combination of retrieving existing information from a large dataset and creatively generating new content from that retrieved information.

chatbot and question-answer applications are some of the use cases of RAG. There are 4 steps that need to be implemented in order to get a response for a user query:

  1. Vectorize the user query using the same model that have been used to generate embeddings for your dataset.
  2. Passe the vectorize query to the vector database which will return similar vectors to query.
  3. Use the vectors to extract the content they relate to.
  4. Augument the user query with your content and instruct the LLM to respond only from the provided context.
RAG

Indexes

Indexes enable faster searching. The database converts queries, whether they are images, text, or other types of documents, into vectors. Then, it utilizes its index to determine where the new vector fits within the existing data, ultimately returning neighboring vectors. This is of course depends on the algorithm that has been set when constructing the index.

The HNSW (Hierarchical Navigable Small World) which is graph-based algorithm stands out as a favored option for tasks like similarity search, recommendation systems, and content-based retrieval.