RAG 101: The Secret Behind Smarter AI

A simple guide to how modern AI models retrieve real-time knowledge—without retraining

The Web Changed Everything—RAG Will Too

A Cold Night of 1989.

Deep inside CERN—the European Organization for Nuclear Research—a frustrated physicist Tim Berners-Lee was staring at his computer screen.

He was stuck. Confused.

Scientific breakthroughs of CERN were being buried under heaps of disconnected documents.

The data was there, but accessing it was a nightmare.

Every insight from every experiment was isolated, making it extremely difficult to draw any meaningful conclusions.

He wanted a way to retrieve knowledge on demand, across systems. Instantly. Easily.

Tim Berners-Lee (AI Image)

So he built the World Wide Web - the Internet.

It changed how humans accessed and shared knowledge. Forever.

Today, AI is facing its own Tim Berners-Lee moment. Connecting traditional AI models with outside world !!

The Problem with LLMs

Large Language Models (LLMs) are incredibly powerful but only up to a point..

They're trained on vast datasets, but that training is frozen in time.

In other words, they can’t learn from data they haven’t seen. Their knowledge is locked at the time of training.

They don’t know your company’s latest support documents, or that Mars might’ve just been found to have three more moons (one of which NASA named Jeff for reasons no one quite understands).

They also can’t cite their sources. Ask them a question and you might get an answer, but not the origin.

And if your use case demands accuracy, traceability, or up-to-date data… well, you’re out of luck.

RAG changes the game.

RAG: Retrieval-Augmented Generation

What is RAG?

RAG is a framework that lets LLMs access external knowledge at runtime.

Instead of relying solely on what they were trained on, they retrieve relevant documents first—then generate a response based on that data.

It is similar to adding a real-time research assistant to your AI.

How RAG Works: A 3-Step Flow

1. Indexing

First, documents are split into smaller parts (called chunks). Each chunk is then turned into a vector.

A vector is like a coordinate (or address) that tells you where a piece of text "lives" in a space based on its meaning.

These coordinates are nothing but a list of numbers that captures the meaning of the text. Texts with similar ideas end up close to each other.

Example:

Let’s say a document has this content: “Boil pasta in salted water for 10 minutes”… First, it gets split into a chunk.

That chunk is then passed through an embedding model (like OpenAI’s or Sentence Transformers), which turns it into a vector.

Let’s say the result is: [0.12, 0.56, -0.34, 0.98]

Another chunk: “Cook spaghetti by boiling it until soft.”…might get: [0.14, 0.52, -0.31, 0.95]

Now, both sentences have similar content and are contextually related with each other. Thus, as you can see, both vectors numbers are close enough.

All such vectors are stored in a vector database.

2. Retrieval

When a user asks a question, the system first turns that question into a vector, using the same method as before.

It then compares that question vector with all the stored vectors from step 1 to find the closest matches—i.e., the most relevant pieces of information.

This is called semantic search—looking for meaning, not exact wording.

Think of it like shouting into a vast, multidimensional room where every document (bunch of words) is floating in space. Similar ideas like food, apple, watermelon etc. cluster together.

When you ask, “Which planet has the most moons?”, the question echoes through the room:

“HEY, WHO HERE KNOWS MOONS?!”

The closest documents—those with matching content—shout back. Those are what get passed to the LLM to craft an accurate answer.

Behind the scenes, smart tools like splitters and document loaders help slice and organize content to ensure the best matches rise to the top.

Example:

Let’s understand it in a better way.

Say user’s query is: “How do I make pasta?”

This will be now turned into a vector: [0.13, 0.54, -0.33, 0.96]

The system then compares this question vector with the vectors in the database using a distance metric (like cosine similarity or Euclidean distance).

A distance metric measures how similar or different two vectors are. In this context, it's used to find which document chunks are closest to the user's question—lower values mean higher relevance. In other words, the closest ones are considered most relevant.

For example:

Chunk Text

Vector

Distance to Query

"Boil pasta in salted water..."

[0.12, 0.56, -0.34, 0.98]

0.02

"Cook spaghetti by boiling..."

[0.14, 0.52, -0.31, 0.95]

0.03

"The moon orbits the Earth.”

[0.80, -0.33, 0.44, -0.12]

1.5

The first two are much closer to the query vector, so those chunks are retrieved.

Note that this is just a simple explanation. In real world AI, vectors with multiple rows and columns (called matrix) are used.

3. Generation

The language model (LLM) gets those top-matching chunks and uses them to generate a response.

Since the answer is based on actual source material, it’s usually more accurate, with fewer mistakes or “hallucinations.”

You can even trace where the information came from—useful for verifying facts or citing sources.

LLM + RAG

Let’s say your company sells electric bikes. Sometimes the bike won’t start if it’s been stored in very low temperatures. A customer might ask your AI chatbot for help.

Here’s how different setups might respond:

  • LLM Only: “Cold can affect batteries. Try warming it up to room temperature.

  • RAG Only: “Check that the battery is properly seated and above 5°C before starting.”

  • LLM + RAG: “The battery may be too cold for normal operation. Warm it to at least 5°C, ensure it’s properly seated (per the manual), and try starting again.”

Neither LLM nor RAG completely solves the problem on their own. LLM understands the cause but doesn't give specific next steps. RAG offers basic troubleshooting but misses the context.

Together, LLM + RAG gives you both broad reasoning and precise, context-aware instruction.

RAG vs No-RAG

Note: Tools like ChatGPT’s web browsing mode or Perplexity.ai retrieve real-time information, but they typically use live web search—not a pre-indexed vector store. Similar in spirit to RAG, but technically different.

Where RAG Falls Short

RAG is powerful. But it’s not perfect.

While it has real-time relevance and domain-specific knowledge, it comes with trade-offs that can’t be ignored.

🧩 System Complexity

RAG has a lot of moving parts: document indexing, retrieval, embedding, re-ranking, and generation. If any one piece misfires—say, the index is poorly structured or the retrieval logic mismatches the query—everything downstream breaks.

🔒 Data Security Concerns

Many worry about putting personal or confidential data into vector embeddings. Once encoded, it can be hard to fully control how or where that data might be leaked.

🧠 Context Windows Are Catching Up

Newer LLMs can now handle massive context windows—some reaching millions of tokens. That’s enough to preload entire product manuals or research papers directly into the prompt. In these cases, RAG might be overkill.

That’s why some of the alternative/modified models have been explored such as:

→ Graph RAG: 

Instead of only using flat vectors, Graph RAG builds a knowledge graph where concepts (like “Paris” or “Eiffel Tower”) are linked by relationships.

When you ask, “Where is the Eiffel Tower?”, the system doesn’t just match keywords—it follows the graph structure to find contextual answers. This reduces ambiguity and hallucinations.

→ Query Rewriting:

Before searching, the user’s original query is sometimes simplified or adjusted by a mini language model.

If someone asks, “Hey there, can you give me the temperature in Celsius and Fahrenheit for London right now, please?”, the system might rewrite it to “Current temperature in London in Celsius and Fahrenheit?” It removes unnecessary bits—like “Hey there!”—and refines the main question.

This is helpful when searching large databases, as it focuses on the core request only.

Combines the strengths of vector search (semantic similarity) with keyword search (exact matches).

If a user searches “simple tomato sauce recipe,” vectors may bring back “marinara,” while keyword matching ensures “tomato” and “sauce” are both hit. Together, they cast a wider and more precise net..

→ Re-Ranking:

After retrieval, a second model re-evaluates the relevance of documents and reorders them.

Say a query returns 10 snippets—re-ranking pushes the top 2 most relevant ones to the front, increasing precision and trust in the final response.

→ Cache-Augmented Generation (CAG): 

Rather than retrieving documents every time, CAG preloads frequently used knowledge (like company FAQs or policies) directly into the model’s short-term memory. This reduces latency, improves reliability, and minimizes security exposure.

If employees frequently ask about holiday work policies, CAG preloads that info so it can be answered instantly, without a database search.

Think of RAG as going to the library every time you need an answer. CAG is like having the most-used books already open on your desk.

What is Agentic RAG?

RAG has another limitation.

Traditional RAG pulls from a fixed data source. It works. But it’s rigid.

Agentic RAG adds a layer of decision-making. The model can choose which data source to query—and how to respond, whether with text, a chart, or even code.

Think of it as “RAG with judgment”.

Example:

  • “What’s our holiday remote work policy?” → Pulls from internal HR docs.

  • “What do most tech companies offer for remote work?” → Switches to public industry data.

  • “Who won the World Series in 2015?” → Replies: “I don’t have that information.”

In practice, Agentic RAG acts like a smart router—dynamically picking the right source for the right query to deliver the most relevant answer.

TL;DR

  • LLMs are brilliant, but static.

  • RAG brings them to life with real-time, relevant information.

  • The future is modular: RAG, CAG, and Agentic architectures are evolving rapidly.

Just like the Web unlocked human knowledge at scale, RAG is unlocking machine intelligence that’s dynamic, reliable, and grounded in real-world context.

Cheers,

Teng Yan & Ravi