We’re happy to introduce Raghilda, a new Python package for building RAG (Retrieval-Augmented Generation) solutions.
RAG is a simple concept that comes up anytime you want to retrieve content for an LLM to improve or augment the generated output.
LLMs are great at reasoning and generating text, but their knowledge is frozen at training time. They can’t access private documents, recent information, or anything that wasn’t in the training data. When asked about these topics, they either refuse to answer or — worse — hallucinate a confident-sounding response. RAG solves this by giving the model access to relevant information at query time, without needing to retrain it.
In practice, most tools built on top of LLMs already do this. ChatGPT uses web search to
include recent news in its answers. Claude Code reads the codebase using tools like grep,
list_files, and symbol_search before generating code. RAG is what makes LLMs useful for
tasks that require specific, up-to-date, or private knowledge.
Modern LLMs support context windows of 100K tokens or more, which might seem like it makes RAG unnecessary — just paste everything into the prompt. But this doesn’t work well in practice. LLMs suffer from “lost in the middle” (sometimes called “context rot”): they pay less attention to information buried in the middle of a very long prompt, so relevant content gets missed. On top of that, sending your entire knowledge base with every query is expensive, slow, and most real-world document collections are too large to fit in a single context window anyway. Long context and RAG are complementary — a larger window lets you include more retrieved chunks, but retrieval is what gives you precision: the model sees 5 relevant paragraphs instead of 500 irrelevant pages.
raghilda#
Building a good retrieval system is the hard part of RAG. raghilda is a Python framework designed to handle it. You give it URLs or file paths, and it takes care of the rest. The defaults are opinionated but transparent — every step is exposed and replaceable. You can swap the chunker, change the embedding provider, or write a custom ingestion function without fighting the framework.
-
Document processing. Converting HTML pages, PDFs, and DOCX files into clean text is surprisingly messy. raghilda handles this automatically, converting documents to Markdown. For websites,
find_links()crawls and discovers pages so you don’t have to list them by hand. -
Smart chunking. Naive fixed-size chunking can split a code block or paragraph in half. raghilda’s Markdown chunker splits text at semantic boundaries — headings, paragraphs, sentences — and preserves the heading hierarchy so each chunk retains context about where it came from.
-
Multiple storage backends. raghilda supports DuckDB (local, zero-config), ChromaDB, and OpenAI Vector Stores. The API is the same across backends, so you can start with a local DuckDB file and move to a hosted solution later without rewriting your code.
-
Hybrid retrieval. Pure vector search finds semantically similar content but misses exact keyword matches. raghilda combines semantic search, BM25 keyword matching, and attribute filters — so you can search by meaning, by keywords, and by metadata (e.g. source URL, document type, or any custom attribute) all at once.
How it works#
A retrieval system has two phases: ingestion — turning your documents into a searchable store — and retrieval — finding the right chunks given a query. raghilda exposes both phases clearly, with each step exposed as an individual call you can customize or replace.
Let’s walk through a minimal example using a Wikipedia article about Princess Ragnhild of Norway.
First, you create a store with an embedding provider. The store is where your chunks and their vector embeddings will live:
|
|
Then you read the document and chunk it. read_as_markdown() converts the URL
to Markdown, and MarkdownChunker splits it into overlapping chunks at semantic boundaries:
|
|
185 chunksFinally, you upsert the chunked document into the store and build the search indexes. Embedding is handled by the store itself — since the embedding provider is configured at creation time, all chunks in a store are guaranteed to use consistent embeddings:
|
|
During retrieval, you query the store and get back the most relevant chunks. raghilda runs semantic search and BM25 keyword search, then merges the results:
|
|
Lorentzen"), a member of the Lorentzen family of shipping magnates.
In the same year, they moved to Brazil, where her husband was an
industrialist and a main owner of Aracruz Celulose. She lived in
Brazil until her death 59 years later.
---
to Rio de Janeiro, where her husband had substantial business
holdings. Their residence in Brazil was originally temporary, but
they
---Tip: In practice, you’ll usually work with many documents at once — just loop over your URLs or file paths and call
upsert()for each one. See the Getting Started guide for a full walkthrough.
Using with an LLM#
A retrieval system on its own just returns text chunks. To get actual answers, you connect it to an LLM. The simplest way to do this is to register a search function as a tool that the LLM can call when it needs information.
Here’s an example using chatlas :
|
|
Princess Ragnhild moved to Brazil in the same year she got married,
1953. Her marriage and the move to Brazil were connected, as her
husband was an industrialist and owner of business holdings in Brazil.
They settled in Rio de Janeiro and lived there until her death in 2012.Compare this with the same question without the search tool:
|
|
Princess Ragnhild moved to Brazil in 1960.With the search tool, the LLM retrieves the relevant chunks and grounds its answer in the actual document. Without it, the model has to rely on its training data — and may hallucinate or give a vague response.
Learn more#
- Getting Started — full walkthrough building a store from a documentation site
- Examples — complete scripts showing RAG workflows with chatlas, ChromaDB, and more
- GitHub repository — source code, issues, and contributions

