Retrival Augmented Generation - Chunk Size
2 min read
Like many others, I’ve been swept up in the recent wave of AI hype. At first, I was skeptical, but after trying out some of the latest large language models, I was genuinely surprised by how capable they have become.
That initial surprise quickly turned into curiosity, and I started digging deeper into the field. As someone who actively follows the markets, one use case stood out to me: using AI to help digest dense financial documents like 8-Ks and 10-Ks. That led me to explore retrieval-augmented generation (RAG).
I began with the LangChain documentation and started experimenting with Gemini 2.5 Flash, since I prefer free and open-source (FOSS) tools. Unfortunately, things didn’t work quite as expected. Gemini often retrieved irrelevant chunks and returned inaccurate answers. Even though it offers a context window of one million tokens, I noticed that the relevance of the retrieved context tended to decrease as more tokens were added.
This led me to focus on refining the chunking strategy.
Chunks need to be small enough to fit easily within the model’s context window, but large enough to contain meaningful context. It is also important to account for the fact that more context may be added through follow-up prompts or as the conversation grows.
There is also the issue of splitting important content across chunks. To help prevent information loss in these cases, I introduced overlap between chunks. Overlap ensures that if a key sentence or paragraph is divided, it still appears in the neighboring chunk. However, if the relevant information is longer than the overlap size, it can still be lost. This is why tuning both chunk size and overlap is critical.
If you are building something similar, I highly recommend experimenting with different chunk sizes and overlaps, and evaluating your setup using metrics like response speed, faithfulness to the source material, and overall relevance. Getting this right can make a big difference in performance.