From Source Code to Machine Code: The Whole Pipeline

Working doc: What I’m learning day by day. Sections will grow and get reorganized.

Computers only understand machine code. All the fancy languages we write in — Go, Python, Java, Clojure - eventually need to become a form the CPU can run.

If you’ve noticed .exe files (Windows), or the binaries inside /usr/bin on Linux, or the apps inside macOS bundles, that’s what they are: files that contain machine instructions plus metadata the operating system knows how to load.

The OS doesn’t interpret these files. It loads them into memory and jumps to the very first instruction. Those bytes are raw CPU opcodes for whatever architecture the file targets—x86-64, ARM64, RISC-V, anything.

Every tool in this ecosystem—compilers, interpreters, virtual machines—is ultimately trying to bridge one gap: take human-readable code and turn it into something a CPU can execute, either ahead of time (as an exe) or on the fly (inside another program that is an exe).

What compilers, interpreters, and VMs actually are

It helps to start with the simplest truth: any program is just an executable that consumes input, produces output, and may trigger side effects.

A compiler, interpreter or a VM is no different. They’re regular executables. What changes is what they take as input and what they produce.

Compiler

A compiler takes source code and transforms it into another language. That target might be native machine code, a VM’s bytecode, or even another high-level language. The defining property: a compiler produces an artifact ahead of time.

Virtual Machine

A virtual machine (VM) is a spec for an imaginary CPU and an instruction set plus the rules for how each instruction behaves. Real CPUs don’t understand that bytecode, so you implement a VM program that executes those instructions. That VM program is compiled separately for x86-64, ARM64, RISC-V, etc. The VM binary is architecture-specific. The bytecode is portable.

Interpreter

An interpreter is the simplest execution model: it walks code directly — source, AST, or bytecode — and performs operations step by step. No standalone binary, no ahead-of-time output. A VM is essentially an interpreter for a custom bytecode format; adding a JIT is an optimization, not the definition.

Assembler

An assembler is a simple translator that turns assembly language into machine code. Assembly sits just one level above raw CPU instructions, so each line maps almost directly to a single opcode. We rarely write it today, but the tool itself works like a tiny, specialized translator focused only on converting mnemonics into real executable instructions.

Linker

A linker takes one or more object files and libraries and combines them into a single executable or shared library.

An object file is the output of a compiler or assembler for a single source file. It contains machine code for that file, a list of symbols it defines, a list of symbols it depends on, and relocation information for addresses that are not known yet.

The linker resolves those symbols across files, assigns final memory addresses, patches the machine code accordingly, and produces a binary in a format the operating system can load, such as ELF (Linux), Mach-O (macOS) etc.


The execution paths: how code actually reaches the CPU

No matter the language, the journey usually starts the same way.

Shared frontend

Most languages have a common frontend:

At this point, the language understands the code’s shape, but nothing can run yet.

What happens next depends on the execution model.

Path 1: Native compilation

In a native compilation pipeline, the compiler lowers the AST into machine-level representations.

The flow usually looks like:

lexer → parser → compiler → assembler (sometimes implicit) → linker → executable

The compiler may emit assembly or object code directly. The assembler turns assembly into object files. The linker combines those object files and libraries into a final executable that the OS can load and run.

Example - C, C++, Rust and Go largely live on this path. Go compiles directly to native machine code. An x86-64 Go binary contains x86-64 instructions. It won’t run on ARM64 because the CPU instruction sets differ.

Path 2: Bytecode + virtual machine

Some languages compile to an intermediate instruction set instead of native machine code.

The flow looks like:

lexer → parser → compiler → bytecode → virtual machine

Here, the compiler produces bytecode for a virtual CPU. A virtual machine executable then interprets or JIT-compiles that bytecode at runtime. The VM itself is native machine code, but the bytecode is portable.

Example - Java, Kotlin, and many scripting languages use this model. Java compiles to JVM bytecode. The JVM (Java Virtual Machine) interprets that bytecode (and JITs hot code) on whatever CPU it’s running on. One .class file runs everywhere. Clojure emits JVM bytecode too, inheriting JVM portability and its runtime optimizations.

Path 3: Direct interpretation

The simplest model skips ahead-of-time code generation entirely.

The flow looks like:

lexer → parser → interpreter

The interpreter walks the AST (or a lightly compiled form) and executes operations directly. There is no separate executable produced for the program being run.

This model is common in early language implementations and in languages optimized for flexibility over raw performance.

Example - Early JavaScript engines, small DSLs like jq


How a compiler is actually structured

A real compiler isn’t one big box. It’s three specialized stages, each solving a different class of problems.

Frontend

This stage understands your language. It parses the source, builds the AST, resolves identifiers, checks types, reports errors, and lowers the program into a machine-friendly intermediate form.

Middle-end

Optimizes that intermediate form: constant folding, dead-code elimination, function inlining, loop optimizations, etc.

Backend

Turns the optimized IR into the final target: machine code, bytecode, C code, or anything else.


LLVM vs JVM: two different philosophies

Where LLVM fits

LLVM originally meant “Low Level Virtual Machine,” but today it’s a full compiler toolchain. You write the frontend for your language.

You lower your program to LLVM IR.

LLVM’s middle-end runs many optimization passes.

LLVM’s backends generate machine code for x86-64, ARM64, RISC-V, WebAssembly, and more.

Languages like Rust, Swift, Julia, Zig, and many others rely on LLVM because it removes the need to build your own optimizer and code generator. There are other notable backends too:

Where JVM fits

The JVM isn’t a static compiler backend; it’s a runtime.

Your frontend produces JVM bytecode.

The JVM interprets that bytecode, profiles it at runtime, applies optimizations, and JIT-compiles hot code to native machine instructions on the fly.


Bootstrapping: how a language learns to build itself

A language doesn’t start by compiling itself. It can’t - the language doesn’t exist yet. So the early versions are built in some existing host language, and only later does the language “stand on its own legs.” This process is called bootstrapping.

Key terms:

Here’s how it actually works:

  1. You write compiler v1 in C (host language). It’s minimal, maybe only supports basic syntax, but it works.
  2. You extend it — still in C — and release compiler v2, a more capable version of the same compiler. Now the language is mature enough that you can write larger programs in it. So you take the next big step:
  3. You rewrite the compiler in the language itself. This produces the source code for compiler v3 — but v3 isn’t a binary yet; it’s just code written in your new language.
  4. You take compiler v2 (the C-based version) and use it to compile that v3 source code into a real executable. The output binary is compiler v3. At this point, the language becomes self-hosting.

Compiler v3 is written in the language and can compile programs written in that same language — including future versions of the compiler itself.

Now the cycle continues naturally:

  1. You improve the compiler again, producing the source for compiler v4 (still written in the language).
  2. You use compiler v3 to compile that source into the v4 compiler.

From here on, the language is independent. Each new generation of the compiler is produced by the previous one. The original C code fades away; the toolchain is now sustained by the language itself.

This is the exact path followed by: