Retrieval-Augmented Generation (RAG) is a powerful technique that combines the retrieval of informational documents and generative language models to enhance question-answering and content generation tasks. RAG works by fetching relevant documents from a corpus based on a query and then using these documents as context for generating responses. This process not only increases the accuracy of the responses but also provides them with a richer, more detailed backdrop, making the model's output more informative and precise. In this article, we are going to explore how to set up a robust RAG system using Pinecone for vector management and OpenAI's language models for generating embeddings in a Java environment.
Before starting describing the approach and the technical solution, it's important to briefly describe what vector DBs are; In short, they are a specific type of database designed to efficiently store and retrieve vector embeddings of text. These embeddings are high-dimensional representations of text generated by models like those from an LLM (in this case, OpenAI), capturing semantic meanings. The vector database allows for querying these vectors to find the most similar items in the space, which is crucial for the retrieval part of RAG.
Now let's proceed with the problem description and the technical solution.
In one of our projects we had the challenge of implementing a text engine that needed to do a semantic-rich search. Traditional full-text, phonetic or difference/distance based search was not enough for this use case. So, we decided to use RAG for that.
The first step is to start applying RAG in this use case to populate the vector DB. Within this process, all the entries that can be searched need to be converted into embeddings and inserted into the vector DB. Each insertion is done considering the most relevant semantic textual information for each searchable entry. In addition, each entry is identified by a key that we can use to link the row in the vector DB with the real data object that originated it. After the BD is populated, then we can start performing searching operations. In this case, the search string needs to be converted into an embedding sent to the vector DB in the form of a query. The response of the vector DB will be a top K of the most similar results, including the key for each one.
The first step to start populating the DB is to generate the embeddings to be inserted. In this case we are going to use OpenAI LLM (specifically, gpt-4 model). To use OpenAI's language models in Java, here we are going to use the com.theokanning.openai library. This library provides a straightforward API to interact with OpenAI and generate embeddings. In this context, an embedding is just a list of double precision numbers (Double). Below is a simple example of how to generate embeddings for a given text:
Now the next step after obtaining the embed is to insert into the DB, including the key that links every entry to the source element.
In this case, we are going to use Pinecone as a vector DB since that is what we used in our implementation originally.
Now, by using the embed generation method already introduced, we can perform the insertion as shown below
(Note: in this piece of code we are using jakarta.json-api and OkHttp libraries for JSON serialization and HTTP communication)
We can execute these code pieces in sequence (embedding generation + DB insertion) on a regular basis or by demand to ensure the vector DB is updated. Let's now proceed so the querying
Querying the DB for similarity search consists in (1) generating the embedding for the query and (2) performing the search against the vector DB. We can reuse the same exact embedding generation code that we've introduced already, but in this case we need to invoke the `/query` endpoint. The structure of the implementation (pretty similar to the one in the previous section) is detailed below.
This query is going to return the top K matching results from the DB. From those results we can extract the keys and fetching the real content from our DB
Note: results are returned in JSON format, that needs to be further parsed, we omitted that part for the sake of simplicity. We've also omitted proper exception management for the same reason
Now, we have a full working RAG solution including insertion and querying. The rest is just ensuring that the DB is properly indexed and fine-tuning the embeddings to ensure that our semantic-rich search works properly.