From Demo to Deployment: Why Scaling RAG Is No Easy Task

March 10, 2024

❓RAG demos are easy. RAG in production is much harder. Why is it so?

Recently I wrote about Retrieval Augmented Generation (RAG), the technique that enables enterprises to leverage the power of large language models (link in comment ).

Many companies have experienced that their technology teams are able to demonstrate the use of RAG quite rapidly. However, when the time comes to put the solution into production, they face difficulties. The users do not accept the solution easily and many don’t think it adds a lot of value.

We will first see why the demos are easy and then reflect on the challenge of productionising RAG.

Why are RAG demos easy?

👉 Simple architecture: The principle of RAG is quite simple to understand. In the basic RAG pipeline, there are very few components: vector database, source chunking, vector matching and LLM interface. It is easy for engineers to understand the architecture.

👉 Framework availability: To make matters simpler, helper frameworks such as Langchain and LlamaIndex are available. They come with built-in support for chunking, vector databases and LLMs. It is possible to build an RAG pipeline by writing a few lines of code, if you are using one of these frameworks.

The initial results of such a pipeline are very impressive, mainly because of the language abilities of the LLMs. Thus great demonstrations can be built quickly with RAG. 👌🏻

Now we will see why using RAG in production is challenging.

👉 Correct but not comprehensive: For simpler questions, the replies are often satisfactory. However, when complex questions are asked, the users find the answers correct, but not satisfactory. They feel that the reply covers only basic aspects, but not what they really needed.

👉 Source requirements: In order that RAG works, the necessary knowledge sources should be added to the vector index. It takes a lot of effort to judge the requirements of users and accordingly build the index. It does not stop here. The index must be forever updated with current information.

👉 Easy to add, difficult to remove: Once a source is added to the index, it starts influencing the answers. Many times, multiple sources contain conflicting information (for example, old and new price list). The answers in this case become unreliable.

👉 Data privacy: The quick RAG implementations typically use an external LLM api such as OpenAI or Google. In such cases, the internal data of an organization is sent to the outside world; creating a possibility of data privacy issues.

👉 Latency: As the index size grows, it takes a long time to select the right material for a query. This is unacceptable for real time use.

Overcoming these challenges involves building a more complex pipeline. I will describe that in a later post, but here are some techniques that can help: reranking, knowledge graph, internal models, trust layers.

llm
retrievalaugmentedgeneration
aiinbusiness

By Devesh Rajadhyax

Co-Founder, Cere Labs

Comments

Anna LeeApril 30, 2025 at 5:47 AM
Really appreciate the clarity and structure of this post. You’ve done an excellent job of making the information both accessible and insightful. Keep them coming!
Teen patti game development services
ReplyDelete
Replies

Add comment

Search This Blog

Ai for Business

From Demo to Deployment: Why Scaling RAG Is No Easy Task

Comments

Post a Comment

Popular posts from this blog

5 Skills to Work Better with AI

AI learns language with the help of a child

Should companies train their own LLM?