Building a Simple RAG System in Python

What is RAG?

Large language models like Gemini know a lot — but they don’t know your data. Retrieval-Augmented Generation (RAG) fixes this by finding relevant passages from your own documents and handing them to the model before it answers. Instead of guessing, the model reads the right content first.

The process has three steps: 1. Embed your documents — convert each one into a list of numbers representing its meaning 2. Retrieve — when a question is asked, find the documents closest in meaning to that question 3. Generate — send those documents + the question to the LLM and let it answer

Install Required Libraries

!pip install google-generativeai numpy scikit-learn

Setup and Document Library

We’ll use a small set of data science descriptions as our “documents.” In a real system these would be loaded from files or a database.

import os
import numpy as np
import google.generativeai as genai
from sklearn.metrics.pairwise import cosine_similarity

genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("gemini-1.5-flash")

documents = [
    "Pandas is a Python library for data manipulation. It provides DataFrames for working with tabular data.",
    "NumPy is the foundation of scientific computing in Python. It provides fast array operations.",
    "Scikit-learn is a machine learning library for Python with tools for classification, regression, and clustering.",
    "Matplotlib is a plotting library for Python used to create charts, graphs, and visualizations.",
    "PyTorch is a deep learning framework developed by Meta, widely used for building neural networks.",
    "Polars is a fast DataFrame library written in Rust. It is often used as a high-performance alternative to Pandas.",
]

print(f"{len(documents)} documents loaded.")

Step 1 — Embed the Documents

An embedding converts text into a list of numbers that captures its meaning. Similar texts produce similar numbers. We embed all documents once and store the results.

def embed(text):
    result = genai.embed_content(model="models/text-embedding-004", content=text)
    return np.array(result["embedding"])

doc_embeddings = np.array([embed(doc) for doc in documents])
print(f"Embedded {len(documents)} documents. Each has {doc_embeddings.shape[1]} dimensions.")

Step 2 — Retrieve Relevant Documents

When a question comes in, we embed it the same way and use cosine similarity to find which documents are closest in meaning.

def retrieve(question, top_k=2):
    q_embedding = embed(question).reshape(1, -1)
    scores = cosine_similarity(q_embedding, doc_embeddings)[0]
    top_indices = np.argsort(scores)[::-1][:top_k]
    return [documents[i] for i in top_indices]

# Test retrieval
question = "What library should I use to make charts?"
print("Question:", question)
print("Retrieved documents:")
for doc in retrieve(question):
    print(f"  - {doc}")

Step 3 — Generate an Answer

We combine the retrieved documents with the question into a single prompt and send it to Gemini. The model is instructed to only use the provided context.

def ask(question):
    context = "\n".join(f"- {doc}" for doc in retrieve(question))
    prompt = f"""Answer the question using only the context below.
If the answer isn't in the context, say "I don't have that information."

Context:
{context}

Question: {question}"""
    return model.generate_content(prompt).text

print("Q: What library should I use to make charts?")
print("A:", ask("What library should I use to make charts?"))

print("\nQ: What is PyTorch used for?")
print("A:", ask("What is PyTorch used for?"))

print("\nQ: How do I book a flight?")
print("A:", ask("How do I book a flight?"))  # Should say it doesn't know

Summary

In this post we built a working RAG system from scratch: - Embedded a set of documents using Gemini’s embedding model - Retrieved the most relevant documents for any given question using cosine similarity - Generated grounded answers by passing retrieved context to Gemini

The last example — asking about booking a flight — shows why RAG is reliable: the model correctly says it doesn’t have that information rather than making something up. Real-world RAG systems use vector databases like Pinecone or ChromaDB to scale this to millions of documents, but the core pipeline is exactly what you built here.