Vector DB Search Latency

Why searching a Vector DB inside an LLM's streaming generation loop is a terrible idea.

The idea

Retrieval-Augmented Generation (RAG) uses a Vector Database to search for context (like finding a specific company policy) before feeding it to an LLM. Standard RAG does one search upfront, which takes maybe 100ms. But some advanced Agentic systems try to do "Iterative Search"—the LLM generates a few words, realizes it needs more info, pauses, hits the Vector DB, and resumes. If the Vector DB takes 100ms to respond, and the Agent searches 5 times during a single reply, you just added 500ms of hard blocking latency to the user's experience. The typing animation stutters, and the product feels broken.

Step 1: Standard RAG. One search upfront, then a smooth, uninterrupted stream of LLM tokens.

How it works (Prefetching & Speculative Search)

You cannot block the token generation stream. To fix this, you must decouple the slow DB search from the fast LLM output:

Parallel RAG (Prefetching): While the user is typing their prompt, or while the system is generating the first "filler" sentence (e.g., "Let me look into that for you..."), you fire off the Vector DB searches in the background.
Speculative Retrieval: Retrieve way more context than you think you need (Top K=50) on the very first pass. It's much faster to pass 5,000 extra words to the LLM's context window than it is to stop the LLM midway and make a second network request.

// BAD: Iterative Search (Blocks the stream)
for (let token of llm.stream(prompt)) {
    if (token === "[NEED_INFO]") {
        const info = await vectorDB.search("..."); // Blocks for 150ms! UI stutters.
        llm.addContext(info);
    }
    yield token;
}

// GOOD: Prefetching / Fat Context
// Do one massive search upfront. Pass it all to the context.
const context = await vectorDB.search(prompt, { topK: 50 }); // 100ms total
const fullPrompt = `${context}\n\nAnswer: ${prompt}`;

// Stream flows perfectly without interruption
for (let token of llm.stream(fullPrompt)) {
    yield token;
}

Cost

Passing 50 documents into the LLM's context window instead of 3 documents will cost you significantly more money (since LLM APIs charge per input token). However, compute is getting cheaper, while the speed of light (network latency to the database) is fixed. Trading extra context tokens for zero network interruptions is almost always the right product choice for user-facing chatbots.

Watch out for

Lost in the Middle: If you use the "Fat Context" approach and retrieve 50 documents, be aware that LLMs are notoriously bad at paying attention to information in the middle of a massive context block. They pay attention to the very beginning and the very end. You should re-rank the DB results and put the most relevant documents at the very end of the prompt.