Why searching a Vector DB inside an LLM's streaming generation loop is a terrible idea.
Retrieval-Augmented Generation (RAG) uses a Vector Database to search for context (like finding a specific company policy) before feeding it to an LLM. Standard RAG does one search upfront, which takes maybe 100ms. But some advanced Agentic systems try to do "Iterative Search"—the LLM generates a few words, realizes it needs more info, pauses, hits the Vector DB, and resumes. If the Vector DB takes 100ms to respond, and the Agent searches 5 times during a single reply, you just added 500ms of hard blocking latency to the user's experience. The typing animation stutters, and the product feels broken.
You cannot block the token generation stream. To fix this, you must decouple the slow DB search from the fast LLM output:
// BAD: Iterative Search (Blocks the stream)
for (let token of llm.stream(prompt)) {
if (token === "[NEED_INFO]") {
const info = await vectorDB.search("..."); // Blocks for 150ms! UI stutters.
llm.addContext(info);
}
yield token;
}
// GOOD: Prefetching / Fat Context
// Do one massive search upfront. Pass it all to the context.
const context = await vectorDB.search(prompt, { topK: 50 }); // 100ms total
const fullPrompt = `${context}\n\nAnswer: ${prompt}`;
// Stream flows perfectly without interruption
for (let token of llm.stream(fullPrompt)) {
yield token;
}
Passing 50 documents into the LLM's context window instead of 3 documents will cost you significantly more money (since LLM APIs charge per input token). However, compute is getting cheaper, while the speed of light (network latency to the database) is fixed. Trading extra context tokens for zero network interruptions is almost always the right product choice for user-facing chatbots.