How to Create Your Own RAG with Free LLM Models and a Knowledge Base

Alexander Uspenskiy
7 min readNov 10, 2024

This article explores the implementation of a straightforward yet effective question-answering system that combines modern transformer-based models. The system uses T5 (Text-to-Text Transfer Transformer) for answer generation and Sentence Transformers for semantic similarity matching.

How to write your own RAG with a Free LLM Models and Knowledge Base

In my previous article, I explained how to create a simple translation API with a web interface using a free foundational LLM model. This time, let’s dive into building a Retrieval-Augmented Generation (RAG) system using free transformer-based LLM models and a knowledge base.

RAG (Retrieval-Augmented Generation) is a technique that combines two key components:

Retrieval: First, it searches through a knowledge base (like documents, databases, etc.) to find relevant information for a given query. This usually involves:

Converting text into embeddings (numerical vectors that represent meaning)

Finding similar content using similarity measures (like cosine similarity)

Selecting the most relevant pieces of information

Generation: Then it uses a language model (like T5 in our code) to generate a response by:

Combining the retrieved information with the original question

Creating a natural language response based on this context

In our code:

  • The SentenceTransformer handles the retrieval part by creating embeddings
  • The T5 model handles the generation part by creating answers

Benefits of RAG:

  • More accurate responses since they’re grounded in specific knowledge
  • Reduced hallucination compared to pure LLM responses
  • Ability to access up-to-date or domain-specific information
  • More controllable and transparent than pure generation

System Architecture Overview

The implementation consists of a SimpleQASystem class that orchestrates two main components:

  1. A semantic search system using Sentence Transformers
  2. An answer generation system using T5

You can download the latest version of source code here: https://github.com/alexander-uspenskiy/rag_project

System Diagram

System Diagram

RAG Project Setup Guide

This guide will help you set up your Retrieval-Augmented Generation (RAG) project on both macOS and Windows.

Prerequisites

For macOS:

  1. Install Homebrew (if not already installed):
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
  1. Install Python 3.8+ using Homebrew
brew install python@3.10

For Windows:

  1. Download and install Python 3.8+ from python.org
  2. Make sure to check “Add Python to PATH” during installation

Project Setup

Step 1: Create Project Directory

macOS:

mkdir RAG_project
cd RAG_project

Windows:

mkdir RAG_project
cd RAG_project

Step 2: Set Up Virtual Environment

macOS:

python3 -m venv venv
source venv/bin/activate

Windows:

python -m venv venv
venv\Scripts\activate

Core Components

1. Initialization

def __init__(self):
self.model_name = 't5-small'
self.tokenizer = T5Tokenizer.from_pretrained(self.model_name)
self.model = T5ForConditionalGeneration.from_pretrained(self.model_name)
self.encoder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

The system initializes with two primary models:

  • T5-small: A smaller version of the T5 model for generating answers
  • paraphrase-MiniLM-L6-v2: A sentence transformer model for encoding text into meaningful vectors

2. Dataset Preparation

def prepare_dataset(self, data: List[Dict[str, str]]):
self.answers = [item['answer'] for item in data]
self.answer_embeddings = []
for answer in self.answers:
embedding = self.encoder.encode(answer, convert_to_tensor=True)
self.answer_embeddings.append(embedding)

The dataset preparation phase:

  • Extracts answers from the input data
  • Creates embeddings for each answer using the sentence transformer
  • Stores both answers and their embeddings for quick retrieval

How the System Works

1. Question Processing

When a user submits a question, the system follows these steps:

Embedding Generation: The question is converted into a vector representation using the same sentence transformer model used for the answers.

Semantic Search: The system finds the most relevant stored answer by:

  • Computing cosine similarity between the question embedding and all answer embeddings
  • Selecting the answer with the highest similarity score

Context Formation: The selected answer becomes the context for T5 to generate a final response.

2. Answer Generation

def get_answer(self, question: str) -> str:
# ... semantic search logic ...
input_text = f"Given the context, what is the answer to the question: {question} Context: {context}"
input_ids = self.tokenizer(input_text, max_length=512, truncation=True,
padding='max_length', return_tensors='pt').input_ids
outputs = self.model.generate(input_ids, max_length=50, num_beams=4,
early_stopping=True, no_repeat_ngram_size=2

The answer generation process:

  1. Combines the question and context into a prompt for T5
  2. Tokenizes the input text with a maximum length of 512 tokens
  3. Generates an answer using beam search with these parameters:
  • max_length=50: Limits answer length
  • num_beams=4: Uses beam search with 4 beams
  • early_stopping=True: Stops generation when all beams reach an end token
  • no_repeat_ngram_size=2: Prevents repetition of bigrams

3. Answer Cleaning

def clean_answer(self, answer: str) -> str:
words = answer.split()
cleaned_words = []
for i, word in enumerate(words):
if i == 0 or word.lower() != words[i-1].lower():
cleaned_words.append(word)
cleaned = ' '.join(cleaned_words)
return cleaned[0].upper() + cleaned[1:] if cleaned else cleaned
  • Removes duplicate consecutive words (case-insensitive)
  • Capitalizes the first letter of the answer
  • Removes extra whitespace

Full Source Code

You can download the latest version of source code here: https://github.com/alexander-uspenskiy/rag_project

import os
# Set tokenizers parallelism before importing libraries
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
from typing import List, Dict
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

class SimpleQASystem:
def __init__(self):
"""Initialize QA system using T5"""
try:
# Use T5 for answer generation
self.model_name = 't5-small'
self.tokenizer = T5Tokenizer.from_pretrained(self.model_name, legacy=False)
self.model = T5ForConditionalGeneration.from_pretrained(self.model_name)

# Move model to CPU explicitly to avoid memory issues
self.device = "cpu"
self.model = self.model.to(self.device)

# Initialize storage
self.answers = []
self.answer_embeddings = None
self.encoder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

print("System initialized successfully")

except Exception as e:
print(f"Initialization error: {e}")
raise

def prepare_dataset(self, data: List[Dict[str, str]]):
"""Prepare the dataset by storing answers and their embeddings"""
try:
# Store answers
self.answers = [item['answer'] for item in data]

# Encode answers using SentenceTransformer
self.answer_embeddings = []
for answer in self.answers:
embedding = self.encoder.encode(answer, convert_to_tensor=True)
self.answer_embeddings.append(embedding)

print(f"Prepared {len(self.answers)} answers")

except Exception as e:
print(f"Dataset preparation error: {e}")
raise

def clean_answer(self, answer: str) -> str:
"""Clean up generated answer by removing duplicates and extra whitespace"""
words = answer.split()
cleaned_words = []
for i, word in enumerate(words):
if i == 0 or word.lower() != words[i-1].lower():
cleaned_words.append(word)
cleaned = ' '.join(cleaned_words)
return cleaned[0].upper() + cleaned[1:] if cleaned else cleaned

def get_answer(self, question: str) -> str:
"""Get answer using semantic search and T5 generation"""
try:
if not self.answers or self.answer_embeddings is None:
raise ValueError("Dataset not prepared. Call prepare_dataset first.")

# Encode question using SentenceTransformer
question_embedding = self.encoder.encode(
question,
convert_to_tensor=True,
show_progress_bar=False
)

# Move the question embedding to CPU (if not already)
question_embedding = question_embedding.cpu()

# Find most similar answer using cosine similarity
similarities = cosine_similarity(
question_embedding.numpy().reshape(1, -1), # Use .numpy() for numpy compatibility
np.array([embedding.cpu().numpy() for embedding in self.answer_embeddings]) # Move answer embeddings to CPU
)[0]

best_idx = np.argmax(similarities)
context = self.answers[best_idx]

# Generate the input text for the T5 model
input_text = f"Given the context, what is the answer to the question: {question} Context: {context}"
print(input_text)
# Tokenize input text
input_ids = self.tokenizer(
input_text,
max_length=512,
truncation=True,
padding='max_length',
return_tensors='pt'
).input_ids.to(self.device)

# Generate answer with limited max_length
outputs = self.model.generate(
input_ids,
max_length=50, # Increase length to handle more detailed answers
num_beams=4,
early_stopping=True,
no_repeat_ngram_size=2
)

# Decode the generated answer
answer = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print the raw generated answer for debugging
print(f"Generated answer before cleaning: {answer}")

# Clean up the answer
cleaned_answer = self.clean_answer(answer)
return cleaned_answer

except Exception as e:
print(f"Error generating answer: {e}")
return f"Error: {str(e)}"


def main():
"""Main function with sample usage"""
try:
# Sample data
data = [
{"question": "What is the capital of France?", "answer": "The capital of France is Paris."},
{"question": "What is the largest planet?", "answer": "The largest planet is Jupiter."},
{"question": "Who wrote '1984'?", "answer": "George Orwell wrote '1984'."}
]

# Initialize system
print("Initializing QA system...")
qa_system = SimpleQASystem()

# Prepare dataset
print("Preparing dataset...")
qa_system.prepare_dataset(data)

# Start interactive Q&A session
while True:
# Prompt the user for a question
test_question = input("\nPlease enter your question (or 'exit' to quit): ")

if test_question.lower() == 'exit':
print("Exiting the program.")
break

# Get and print the answer
print(f"\nQuestion: {test_question}")
answer = qa_system.get_answer(test_question)
print(f"Answer: {answer}")

except Exception as e:
print(f"Error in main: {e}")


if __name__ == "__main__":
main()Performance Considerations

Memory Management:

  • The system explicitly uses CPU to avoid memory issues
  • Embeddings are converted to CPU tensors when needed
  • Input length is limited to 512 tokens

Error Handling:

  • Comprehensive try-except blocks throughout the code
  • Meaningful error messages for debugging
  • Validation checks for uninitialized components

Usage Example

# Initialize system
qa_system = SimpleQASystem()
# Prepare sample data
data = [
{"question": "What is the capital of France?", "answer": "The capital of France is Paris."},
{"question": "What is the largest planet?", "answer": "The largest planet is Jupiter."}
]
# Prepare dataset
qa_system.prepare_dataset(data)
# Get answer
answer = qa_system.get_answer("What is the capital of France?")

Run in terminal

Run in terminal

Limitations and Potential Improvements

Scalability:

  • The current implementation keeps all embeddings in memory
  • Could be improved with vector databases for large-scale applications

Answer Quality:

  • Relies heavily on the quality of the provided answer dataset
  • Limited by the context window of T5-small
  • Could benefit from answer validation or confidence scoring

Performance:

  • Using CPU only might be slower for large-scale applications
  • Could be optimized with batch processing
  • Could implement caching for frequently asked questions

Conclusion

This implementation provides a solid foundation for a question-answering system, combining the strengths of semantic search and transformer-based text generation. Feel free to play with model parameters (like max_length, num_beams, early_stopping, no_repeat_ngram_size, etc) to find a better way to get more coherent and stable answers. While there’s room for improvement, the current implementation offers a good balance between complexity and functionality, making it suitable for educational purposes and small to medium-scale applications.

Happy coding!

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Alexander Uspenskiy
Alexander Uspenskiy

Written by Alexander Uspenskiy

IT Director | AWS Certified Architect | Sr Engineering Manager | AI/ML | Ex-Levi Strauss & Co., Dell | 3 x Cloud Certified AWS, Azure | Agile expert (SAFe, CSM)

No responses yet

What are your thoughts?