Building a search engine

Search looks easy: type a few words, hit Enter, and instantly see results.

In this tutorial, I walk through building a simple search engine, focusing only on term search: finding documents that contain given words. I cover howwe can index documents, store the index on disk, handle updates and deletions, and run queries while keeping the system efficient and reliable.

Source code: https://github.com/harshagw/postings

At the heart of most search systems is the inverted index. Instead of scanning every document for a query, we flip the relationship: we map each term to the list of documents where it appears. For example, our index might look like this:

"hello" → [doc1, doc3]
"world" → [doc2]

Now a query like “hello” or “world” becomes a quick lookup: find the term in the dictionary and retrieve its list of documents. Before diving deeper, let’s define some key terms:

The First Approach

Our starting point is simple: build an inverted index in memory. When a document arrives, we extract its terms and add the document’s ID to each term’s posting list. Query processing becomes a simple lookup:

term → posting list

This is very fast and easy if you have enough RAM. But two major problems immediately arise:

Both issues are unacceptable in production. Therefore, the first requirement is clear: the index must live on disk. A naive attempt might be to periodically serialize the entire in-memory index to disk. However, rewriting large files on every update is very expensive and can lead to inconsistent state or corruption if interrupted.

To address this, we adopt an append-only strategy: once we write a piece of index data to disk, we never modify it. Also, instead of one big monolithic file, we break the index into smaller, immutable chunks called segments.

Segments

A segment is a complete inverted index written to disk exactly once. After creation, it is read-only. Each segment contains:

Segments are written sequentially and then never changed. When new documents arrive, we index them in a small in-memory structure. Once the in-memory index grows too large (or after a time interval), we flush it: write out a new segment to disk and then reset the in-memory index.

This means our on-disk index consists of multiple segments. Older segments stay intact; new ones are simply appended. This design eliminates expensive rewrites and ensures durability (since each segment, once written, is safely stored on disk).

Updates and Deletions

With immutable segments, how do we update or delete a document? We can’t modify existing segment files, so we use logical deletion combined with appending new data:

To support this, each segment assigns documents a compact internal number (docNum) when it’s written (for example, 0, 1, 2, … within that segment). Inside the segment we store an immutable mapping: docID → docNum

This lets us find a document’s docNum given its global docID. We do not modify the segment file itself for deletions. Instead, we keep a separate deletion bitmap for each segment: one bit per docNum. A bit value of 0 means the document is live, and 1 means it is deleted.

When a document is updated or deleted, we:

  1. Look up its docID in the segment’s ID mapping to get docNum.

  2. Set the corresponding bit in that segment’s deletion bitmap to 1.

We only update the bitmap; the segment’s term dictionary and postings remain untouched. During search, any document whose bit is 1 is simply skipped.

Where Mutable State Lives

At this point, our on-disk data is mostly immutable segments. The only mutable data is lightweight metadata:

This metadata (stored in a small file or database) tells us the current view of the index without modifying the segment files themselves. By separating immutable segment files from mutable metadata, we ensure safe, atomic updates: adding a segment or marking deletions is just an update to the metadata.

How Search Runs

When we search for a term, we use a point-in-time view of all data:

We query each piece independently (often in parallel). For each segment (and the in-memory index), we look up the term in its term dictionary and retrieve the posting list. We filter out any document IDs marked as deleted by that segment’s bitmap. Finally, we merge all results from memory and all segments.

This way, searches remain simple: each segment is searched independently, and the results are combined with deletes and duplicates handled during merging.

Efficient Term Lookup

So far, we’ve solved persistence and updates. The remaining challenge is making each term lookup inside a segment fast and compact. Several data structures are possible for the term dictionary in a segment:

At a high level, the goal is simple: each segment must efficiently map a term to its posting list.

Note: the same idea can be applied to the docId map. The mapping from docID → docNum can be stored using the same data structures (for example, a sorted structure or FST), allowing efficient checks for whether a document ID exists in a segment and, if so, retrieving its internal document number without scanning.

Putting It All Together

Now let’s see how indexing, updating, and querying work in our system.

Indexing

When a new document arrives:

  1. Check for existing ID: If its docID already exists (in the in-memory index or an existing segment), mark the older version as deleted using the deletion process below.

  2. Assign docNum: Give the document a new internal docNum in the in-memory index.

  3. Extract terms: Tokenize the document text into terms and add this docNum to each term’s posting list in memory.

  4. Hold in memory: Keep collecting documents in memory until the index is flushed.

At this point, the new document is indexed in RAM and will be included in search results.

Deleting

To delete a document by docID:

  1. In memory: If the document is still in the in-memory index, mark its docNum as deleted there.

  2. In segments: For each on-disk segment, look up docID in that segment’s ID map. If found, mark that segment’s deletion bit for the corresponding docNum.

This ensures the document is excluded from future queries.

Updating

Updates are just delete-plus-add:

  1. Run the delete process for the existing docID (mark the old version deleted).

  2. Index the new version as if it were a fresh document (giving it a new docNum in memory).

The old version remains in its segment but marked deleted, and the fresh version will be indexed in memory (and later in a new segment).

Flushing

When the in-memory index grows large or at set intervals, we flush it to disk:

  1. Write segment: Serialize the in-memory index as a new segment file on disk (writing its term dictionary and posting lists sequentially).

  2. Persist deletions: Atomically update metadata with any pending deletion bitmap changes for existing segments.

  3. Update segment list: Add the new segment to the list of searchable segments (atomically).

  4. Reset memory: Clear the in-memory index to start fresh.

After flushing, all documents that were in memory are now in an immutable segment on disk.

Searching

When answering a query for a term:

  1. Look up the term’s posting list in the in-memory index (if present).

  2. Look up the term’s posting list in each segment’s term dictionary.

  3. For each retrieved posting list, filter out any document IDs marked as deleted by that segment’s bitmap.

  4. Merge all remaining results. If a document appears multiple times (older and newer versions), keep only the newest version.

This process returns the correct set of live documents that match the term.

Loading After a Restart

On restart, the system reconstructs a searchable view directly from disk.

  1. load segment metadata and deletion state from mutable database
  2. open all segment files using memory mapping (mmap)
  3. read segment metadata (term dictionaries - lazy loaded only when needed, docId mapping)
  4. initialize an empty in-memory index for new documents

Segments are memory-mapped because they are immutable. This allows the index to be accessed directly from disk with on-demand paging, fast startup, and low memory overhead. Only the data touched by queries is loaded into RAM.

After initialization, the system has the same point-in-time view as before the restart: immutable segments ready for search, deletion state restored, and a fresh in-memory index for new writes.

We’re Done

At this point, the core search engine is complete. Everything that follows builds on this foundation. You can find the full implementation here: https://github.com/harshagw/postings

Indexing JSON Documents

A structured document like JSON is handled similarly. Treat each key-value pair as part of the document’s content. The typical approach is:

In effect, each JSON field has its own inverted index. Queries targeting specific fields only hit one index, while general searches combine results across all fields.

Advance search Prefix, Fuzzy, and Regex Queries

Advanced query types still rely on term lookup. They all come down to enumerating matching terms:

After collecting the matching terms, we fetch each term’s posting list and merge them as usual (skipping deleted documents). This works efficiently if the term dictionary supports structured traversal (tries, FSA/FST are good for this).

What This Implementation Omits

Our simple search engine covers the basics, but it omits several real-world optimizations:

All of these can be added later. The core design remains the foundation for a robust search system.