Building a search engine
Search looks easy: type a few words, hit Enter, and instantly see results.
In this tutorial, I walk through building a simple search engine, focusing only on term search: finding documents that contain given words. I cover howwe can index documents, store the index on disk, handle updates and deletions, and run queries while keeping the system efficient and reliable.
Source code: https://github.com/harshagw/postings
At the heart of most search systems is the inverted index. Instead of scanning every document for a query, we flip the relationship: we map each term to the list of documents where it appears. For example, our index might look like this:
"hello" → [doc1, doc3]
"world" → [doc2]
Now a query like “hello” or “world” becomes a quick lookup: find the term in the dictionary and retrieve its list of documents. Before diving deeper, let’s define some key terms:
- Term: A unit of text extracted from a document after extraction, such as hello or world.
- Posting: A reference to a document containing a term. In this implementation, a posting is just a docID (for example, doc1 or doc2). But we can store additional info as well along with docId such as term frequency in the doc, position of the term in the doc.
- Posting List: The list of all postings (document references) for a particular term. For example, the posting list for hello is [doc1, doc2].
- Term Dictionary: A structure that maps each term to its posting list.
The First Approach
Our starting point is simple: build an inverted index in memory. When a document arrives, we extract its terms and add the document’s ID to each term’s posting list. Query processing becomes a simple lookup:
term → posting list
This is very fast and easy if you have enough RAM. But two major problems immediately arise:
- Memory is limited: As the number of documents grows, the index may no longer fit in RAM.
- Volatility: RAM is volatile. If the process crashes or restarts, an in-memory index is lost and must be rebuilt from all documents.
Both issues are unacceptable in production. Therefore, the first requirement is clear: the index must live on disk. A naive attempt might be to periodically serialize the entire in-memory index to disk. However, rewriting large files on every update is very expensive and can lead to inconsistent state or corruption if interrupted.
To address this, we adopt an append-only strategy: once we write a piece of index data to disk, we never modify it. Also, instead of one big monolithic file, we break the index into smaller, immutable chunks called segments.
Segments
A segment is a complete inverted index written to disk exactly once. After creation, it is read-only. Each segment contains:
- A term dictionary (mapping terms to posting lists).
- The posting lists for those terms.
Segments are written sequentially and then never changed. When new documents arrive, we index them in a small in-memory structure. Once the in-memory index grows too large (or after a time interval), we flush it: write out a new segment to disk and then reset the in-memory index.
This means our on-disk index consists of multiple segments. Older segments stay intact; new ones are simply appended. This design eliminates expensive rewrites and ensures durability (since each segment, once written, is safely stored on disk).
Updates and Deletions
With immutable segments, how do we update or delete a document? We can’t modify existing segment files, so we use logical deletion combined with appending new data:
- Updating a document is effectively a delete plus an add. We mark the old version as deleted, and index the new version as a fresh document.
- Deleting a document means marking it so it’s ignored during search.
To support this, each segment assigns documents a compact internal number (docNum) when it’s written (for example, 0, 1, 2, … within that segment). Inside the segment we store an immutable mapping: docID → docNum
This lets us find a document’s docNum given its global docID. We do not modify the segment file itself for deletions. Instead, we keep a separate deletion bitmap for each segment: one bit per docNum. A bit value of 0 means the document is live, and 1 means it is deleted.
When a document is updated or deleted, we:
-
Look up its docID in the segment’s ID mapping to get docNum.
-
Set the corresponding bit in that segment’s deletion bitmap to 1.
We only update the bitmap; the segment’s term dictionary and postings remain untouched. During search, any document whose bit is 1 is simply skipped.
Where Mutable State Lives
At this point, our on-disk data is mostly immutable segments. The only mutable data is lightweight metadata:
- Segment list: A record of which segment files currently exist.
- Deletion bitmaps: The bitmaps that mark deleted documents in each segment.
- Counters/versions: Any additional info needed for consistency (e.g. a generation or timestamp).
This metadata (stored in a small file or database) tells us the current view of the index without modifying the segment files themselves. By separating immutable segment files from mutable metadata, we ensure safe, atomic updates: adding a segment or marking deletions is just an update to the metadata.
How Search Runs
When we search for a term, we use a point-in-time view of all data:
- The current in-memory index (for very recent documents not yet flushed).
- All immutable segment files on disk.
- The deletion bitmaps for those segments.
We query each piece independently (often in parallel). For each segment (and the in-memory index), we look up the term in its term dictionary and retrieve the posting list. We filter out any document IDs marked as deleted by that segment’s bitmap. Finally, we merge all results from memory and all segments.
This way, searches remain simple: each segment is searched independently, and the results are combined with deletes and duplicates handled during merging.
Efficient Term Lookup
So far, we’ve solved persistence and updates. The remaining challenge is making each term lookup inside a segment fast and compact. Several data structures are possible for the term dictionary in a segment:
- Hashmap: Great for in-memory exact lookups. On disk, it falls apart. There’s no ordering, no traversal, and every lookup becomes random I/O. Prefix, range, fuzzy, and regex queries are impossible.
- Sorted array: Compact and simple, with fast exact lookups via binary search. Prefix and range queries are technically possible, but only the first seek is efficient. After that, queries degrade into sequential scans on disk, which makes fuzzy and regex queries impractical.
- B-tree: Designed for mutable, update-heavy workloads. Segments are read-only. The pointer-heavy structure wastes space and hurts cache locality, and term enumeration requires walking large subtrees for every query.
- Trie: Efficient for prefix queries by sharing common prefixes. Space efficiency is the problem. Beyond the shared prefix, each character becomes a node, leading to millions of small nodes and poor locality for large term sets.
- Finite State Automaton (FSA) / Finite State Transducer (FST): An FSA is a compacted trie that merges equivalent states, making it highly space-efficient, but it only represents the set of terms and cannot map them to postings. An FST extends this idea by associating each term with its posting list, while retaining the same compact structure. This enables fast exact lookups and naturally supports prefix, range, fuzzy, and regex queries. As a result, many high-performance search engines use FSTs for their term dictionaries.
At a high level, the goal is simple: each segment must efficiently map a term to its posting list.
Note: the same idea can be applied to the docId map. The mapping from docID → docNum can be stored using the same data structures (for example, a sorted structure or FST), allowing efficient checks for whether a document ID exists in a segment and, if so, retrieving its internal document number without scanning.
Putting It All Together
Now let’s see how indexing, updating, and querying work in our system.
Indexing
When a new document arrives:
-
Check for existing ID: If its docID already exists (in the in-memory index or an existing segment), mark the older version as deleted using the deletion process below.
-
Assign docNum: Give the document a new internal docNum in the in-memory index.
-
Extract terms: Tokenize the document text into terms and add this docNum to each term’s posting list in memory.
-
Hold in memory: Keep collecting documents in memory until the index is flushed.
At this point, the new document is indexed in RAM and will be included in search results.
Deleting
To delete a document by docID:
-
In memory: If the document is still in the in-memory index, mark its docNum as deleted there.
-
In segments: For each on-disk segment, look up docID in that segment’s ID map. If found, mark that segment’s deletion bit for the corresponding docNum.
This ensures the document is excluded from future queries.
Updating
Updates are just delete-plus-add:
-
Run the delete process for the existing docID (mark the old version deleted).
-
Index the new version as if it were a fresh document (giving it a new docNum in memory).
The old version remains in its segment but marked deleted, and the fresh version will be indexed in memory (and later in a new segment).
Flushing
When the in-memory index grows large or at set intervals, we flush it to disk:
-
Write segment: Serialize the in-memory index as a new segment file on disk (writing its term dictionary and posting lists sequentially).
-
Persist deletions: Atomically update metadata with any pending deletion bitmap changes for existing segments.
-
Update segment list: Add the new segment to the list of searchable segments (atomically).
-
Reset memory: Clear the in-memory index to start fresh.
After flushing, all documents that were in memory are now in an immutable segment on disk.
Searching
When answering a query for a term:
-
Look up the term’s posting list in the in-memory index (if present).
-
Look up the term’s posting list in each segment’s term dictionary.
-
For each retrieved posting list, filter out any document IDs marked as deleted by that segment’s bitmap.
-
Merge all remaining results. If a document appears multiple times (older and newer versions), keep only the newest version.
This process returns the correct set of live documents that match the term.
Loading After a Restart
On restart, the system reconstructs a searchable view directly from disk.
- load segment metadata and deletion state from mutable database
- open all segment files using memory mapping (
mmap) - read segment metadata (term dictionaries - lazy loaded only when needed, docId mapping)
- initialize an empty in-memory index for new documents
Segments are memory-mapped because they are immutable. This allows the index to be accessed directly from disk with on-demand paging, fast startup, and low memory overhead. Only the data touched by queries is loaded into RAM.
After initialization, the system has the same point-in-time view as before the restart: immutable segments ready for search, deletion state restored, and a fresh in-memory index for new writes.
We’re Done
At this point, the core search engine is complete. Everything that follows builds on this foundation. You can find the full implementation here: https://github.com/harshagw/postings
Indexing JSON Documents
A structured document like JSON is handled similarly. Treat each key-value pair as part of the document’s content. The typical approach is:
- Separate index per key: For each JSON key, build a separate term dictionary of the values under that key.
- Tokenize values: For each key’s value, break it into terms as usual.
- Query with keys: If a query specifies a key (e.g. author:smith), only search that key’s dictionary. If no key is given, search all keys’ dictionaries and merge results.
In effect, each JSON field has its own inverted index. Queries targeting specific fields only hit one index, while general searches combine results across all fields.
Advance search Prefix, Fuzzy, and Regex Queries
Advanced query types still rely on term lookup. They all come down to enumerating matching terms:
- Prefix search: If the query is a prefix, traverse the term dictionary (e.g. a trie or FSA) to list all terms with that prefix.
- Fuzzy search: Use an edit-distance automaton to enumerate all dictionary terms within a given Levenshtein distance of the query term.
- Regex search: Build a regex automaton to enumerate all terms matching the pattern.
After collecting the matching terms, we fetch each term’s posting list and merge them as usual (skipping deleted documents). This works efficiently if the term dictionary supports structured traversal (tries, FSA/FST are good for this).
What This Implementation Omits
Our simple search engine covers the basics, but it omits several real-world optimizations:
- Segment merging: We don’t merge older segments together to reclaim space and speed up queries.
- Advanced compression: Posting lists and dictionaries could be compressed, but aren’t here.
- Document scoring: There is no relevance ranking or scoring; results are simply all matching documents.
- Query optimization or caching: We always do a straightforward term lookup without any fancy caching layer.
- Distributed features: We aren’t dealing with shards, replicas, or clustering.
All of these can be added later. The core design remains the foundation for a robust search system.