Most Opcodes Are Typesafe If Memory Is Infinite

Modelling An Executable Binary In Coq

Aug 27, 2024

Most existing program verification sits on top of the compiler, and assumes that compiler transformations are sound. It is also possible, if not a little tedious, to model program semantics at the instruction level. The only trick used to do so is to assume that stack and heap memory are infinite, which allows for us to model these memory spaces as typed memory regions. After making this assumption it is then possible to project typed effects onto a definition of each individual opcode. As an example program we can model some properties of a hello world program.

What is a Typed Memory Region?

(* Memory Is Denominated in Bytes *)
Record RegionByte := mkRegionByte {
   tt : nat;      (* The type of this region represented as an Ordinal *)
   tt_byte : nat; (* The type-byte-index of this byte *)
}.
Definition beq_rb (l: RegionByte)(r: RegionByte): bool :=
   (Nat.eqb l.(tt) r.(tt)) && (Nat.eqb l.(tt_byte) r.(tt_byte)).
Definition beqo_rb (l: option RegionByte)(r: RegionByte): bool := match l with
   | Some lo => (Nat.eqb lo.(tt) r.(tt)) && (Nat.eqb lo.(tt_byte) r.(tt_byte))
   | None => false end.

In Coq we can associate an individual byte of memory with a nominal type. Each byte will then receive a type ordinal that we will associate with known properties of that type. The region byte also needs a byte index which tells us which ordered byte of the corresponding type is saved in this memory cell. Memory can be allocated in ascending order, or descending order, or broken up into pieces across registers, so we need to keep track of what goes where. We can use these properties to verify that pushing and popping to move typed regions is still typesafe.

(* Knowledge of a Memory Region is a Partial Function *)
Record Region := mkRegion {
   known : ZM.t RegionByte;
}.

(* Simplified Memory State assumes that
   1. stack space is sufficient (effectively infinite)
   2. sys_brk always succeeds at acquiring more memory (effectively infinite)
 *)
Record MemoryState := mkMemoryState {
   register_state : Region;
   stack_state : Region;
   frame_state : Region;
   heap_state : Region;
}.

A typed memory Region is then defined as a partial function that stores some information about what typed bytes are stored and where. We can model register state as a simple Region if we assign an ordinal to each register byte. When doing so it is important to remember that there are overlapping register regions for different sized registers so %AL %AX %EAX %RAX all need to have partially overlapping ordinals. After doing so we can model a standard stack-based program with known registers, stack-pointer-relative memory, base-pointer-relative memory, and heap memory. If we are careful about tracking offsets then we can assume that stack-relative and heap-relative memory are independent. We can’t model absolute chaos with this method, but many practical programs can fit into this model.

What is An Instruction?

(* Initially nothing is known about the memory state *)
Definition initial_memory_state := mkMemoryState (mkRegion ZM.empty) (mkRegion ZM.empty) (mkRegion ZM.empty) (mkRegion ZM.empty).

(* This is for internal use, it does not directly correspond to the actual instruction *)
Definition push_stack (st: MemoryState)(tt: nat)(tt_byte: nat): MemoryState :=
   let rb := mkRegionByte tt tt_byte in
   let new_stack := (ZM.fold (fun k e m => ZM.add (BinInt.Z.add k BinInt.Z.one) e m) st.(stack_state).(known) ZM.empty) in
   let new_stack := mkRegion (ZM.add BinInt.Z.zero rb new_stack) in
   mkMemoryState st.(register_state) new_stack st.(frame_state) st.(heap_state).

(* This is for internal use, it does not directly correspond to the actual instruction *)
Definition pop_stack (st: MemoryState): (MemoryState * RegionByte) :=
   let rb := region_lookup st.(stack_state) (BinInt.Z.zero) in
   let new_stack := mkRegion (ZM.fold (fun k e m => ZM.add (BinInt.Z.sub k (BinInt.Z.one)) e m) st.(stack_state).(known) ZM.empty) in
   let st := mkMemoryState st.(register_state) new_stack st.(frame_state) st.(heap_state) in
   (st , rb).

Check eq_refl : ((pop_stack initial_memory_state) = (_ , {| tt := 0; tt_byte := 0; |})).
Check eq_refl : (pop_stack (push_stack initial_memory_state 123 456) = (_ , {| tt := 123; tt_byte := 456; |})).

push and pop can be modelled as pushing some information onto the stack-pointer-relative memory. If an instruction wants to allocate data onto the base-pointer-relative section then it can add a special annotation to this instruction saying so. This is a little cheat to say that we can have the programmer add annotations to clarify intent. Remember, this is interactive theorem proving: interactive!

Actual instructions can be modelled by what effects they have on memory state. When paired with sufficient annotations, we can start to achieve more interesting effects with rich semantics to model high-level ideas about what the intention of the program is.

(* An Instruction is defined by its mnemonic and effect *)
Record Instruction := mkInstruction {
   mnemonic : string;
   effect : MemoryState -> MemoryState;
}.

Sticking It All Together

(* A Jmp Instruction *)
Record JmpInstruction := mkJmp {
   dst : string;
   condition : MemoryState;
}.

(* A Basic Block is a list of instructions with no branches *)
Record BasicBlock := mkBasicBlock {
   instructions : list Instruction;
   tail : list JmpInstruction;
}.

(* A Control Flow Graph is an index of basic blocks *)
Record ControlFlowGraph := mkCFG {
   blocks : ZM.t BasicBlock;
   labels : ZM.t string;
}.

Modelling Control Flow Graphs is fairly simple. A Basic Block is a sequential list of instructions followed by a sequential list of jumps and fall-through transitions. The graph is then defined as a mess of blocks mapped by their labels.

Defining An Actual Program

Now that we have defined some primitives we can model an actual hello world program.

Compute
   let cfg := empty_control_flow_graph in
   let cfg := (declare_global cfg "_start") in
   let cfg := (declare_text cfg) in
   let cfg := (declare_label cfg "_start") in
   let cfg := (unary_op cfg "push" (register "rbp")) in
   let cfg := (binary_op cfg "mov" (register "rsp") (register "rbp")) in
   let cfg := (unary_op cfg "call" (raw_value "main")) in
   let cfg := (binary_op cfg "mov" (register "rbp") (register "rsp")) in
   let cfg := (unary_op cfg "pop" (register "rbp")) in
   let cfg := (binary_op cfg "mov" (address "60") (register "rax")) in
   let cfg := (binary_op cfg "mov" (address "0") (register "rdi")) in
   let cfg := (zero_op cfg "syscall") in
   let cfg := (declare_label cfg "main") in
   let cfg := (binary_op cfg "mov" (address "1") (register "RAX")) in
   let cfg := (binary_op cfg "mov" (address "1") (register "RDI")) in
   let cfg := (binary_op cfg "mov" (address "uuid_0000000000007838") (register "RSI")) in
   let cfg := (binary_op cfg "mov" (address "11") (register "RDX")) in
   let cfg := (zero_op cfg "syscall") in
   let cfg := (binary_op cfg "mov" (register "rbp") (register "rsp")) in
   let cfg := (binary_op cfg "sub" (address "8") (register "rsp")) in
   let cfg := (zero_op cfg "ret") in
   let cfg := (declare_data cfg) in
   let cfg := (declare_label cfg "uuid_0000000000007838") in
   let cfg := (commit_ascii cfg "Hello World\n") in
   let cfg := (commit_zero cfg 1) in
   verify_control_flow_graph cfg.

We can use the builder model to gradually create a control flow graph. In a future post we will look at how these individual instructions are defined, but for now we can just look at the skeleton of the program which is a direct model of the actual input program. Yes, these proofs are mostly automatically generated from a source that looks like this.

.global _start
.text
_start:
        push %rbp
        mov %rsp, %rbp
        call main
        mov %rbp, %rsp
        pop %rbp
        mov $60, %rax
        mov $0, %rdi
        syscall
#expect STDOUT.ends_with("Hello World\n")
#expect EXIT_STATUS=0
main:
        mov $1, %RAX
        mov $1, %RDI
        mov $uuid_0000000000007838, %RSI
        mov $11, %RDX
        syscall
        mov %rbp, %rsp
        sub $8, %rsp
        ret
.data
uuid_0000000000007838:
        .ascii "Hello World\n"
        .zero 1

Conclusion

This is a rough sketch of low-level verification in Coq. We still need to define all the instruction set and run the verification but that will be another post. For now this verification process is live on the lmv dev branch of Lambda-Mountain.

Andrew’s Substack

Discussion about this post