Search & ranking systems

How Elasticsearch finds words inside millions of documents in milliseconds.

The idea

If you want to find the word "apple" in a million books, you can't read every book (a Full Scan). Instead, search engines build an Inverted Index. It maps every unique word (Term) to a list of Document IDs where it appears (Postings).

When you search for multiple words (e.g., "fast car"), the engine looks up the lists for "fast" and "car", intersects them to find documents containing both, and then ranks them using algorithms like TF-IDF (Term Frequency - Inverse Document Frequency) to show the most relevant results first.

Type a word to look up its postings list in O(1) time!

How it works (Inverted Index)

def build_index(documents):
    index = defaultdict(list)
    for doc_id, text in documents.items():
        # Tokenise and normalise (lowercase, remove punctuation)
        tokens = text.lower().split()
        for token in set(tokens): # unique terms in this doc
            index[token].append(doc_id)
    return index
    
def search(index, query):
    tokens = query.lower().split()
    if not tokens: return []
    
    # Intersect the posting lists for all query tokens
    result_set = set(index.get(tokens[0], []))
    for token in tokens[1:]:
        result_set &= set(index.get(token, []))
        
    return list(result_set)