RAG Workflow

Fig 1. Basic RAG Workflow

RAG typically involves a set of components as illustrated in the image above. The input data set or knowledge bank is usually the enterprise proprietary data that contains the information to be provided to the user. The knowledge bank typically contains large amounts of data across various documents. The context window size, which is a property of all LLMs, limits the maximum no. of words that we can pass as input to any LLM. Different LLMs have varying context window sizes but are usually 1000’s of words. This means that we cannot pass the entire knowledge base to the LLM and ask it to find the answer of the answer from it. Hence we divide the documents in sections known as chunks and their size is chosen intelligently based on the context size of the LLM we are using.

Each document chunk needs to be converted to a numeric representation known as embedding before it can be stored in the vector database. There are several algorithms that convert text data to embeddings and usually they are much smaller models compared to the LLM itself. These embeddings enable quick search and retrieval of document chunks from the vector database. Whenever the user asks a question, the embedding model converts the question to an embedding that’s used for semantic-similarity based search on the vector DB and retrieve the most relevant chunks. The retrieved chunks are re-ranked to ensure the most relevant results are the top before passing to the LLM.

LLM generation module typically contains the user query, retrieved chunks and a system prompt guiding the LLM to answer the question based on the query. If the all the preceding modules are implemented well, most LLMs can easily generate the relevant answer based on the question.

To evaluate the performance of an implemented RAG, several components discussed above must be evaluated separately. The following sections explain how we use RAGAS and what are the different metrics to calibrate the performance of RAG.

RAGAS

RAGAS is an opensource framework that enables the evaluation of the different components of RAG on pre-defined relevant metrics. To ensure performance, the retrieval and the generation components of RAG have to be evaluated separately. Below are some of the metrics derived by RAGAS on those components.

Context Precision measures how many of total number of the retrieved chunks are relevant to the query

Context Recall measures how many of the total relevant chunks were retrieved

Faithfulness measures the factual accuracy of the generated answer based on the retrieved chunks and the LLM’s answer

Answer Relevancy measures how relevant the answer is based on the question

Noise Sensitivity measures how incorrect responses are generated when irrelevant documents are used

Implementing RAGAS evaluation

RAGAS utilizes the popular LLM frameworks LangChain and OpenAI. The evaluation of the metrics on specific datasets are illustrated below. In the code provided below, we create a RAG using customer’s dataset and use LangChain utilities to chunk the data and store it in Qdrantvector database.

Fig 2. RAGAS Metrics Generation for a Custom Dataset