Assembler in Ink, Part II: x86 assembly, instruction encoding, and debugging symbols

15 February 2021
14 mins
Contents

This is the sequel to Assembler in Ink, Part I where I started documenting my process of building an x86 assembler in Ink.

In Part I, we looked at the high-level jobs of an assembler, and how August constructs an ELF file that an operating system (specifically Linux here) could understand and turn into a process. In this second part, we’ll dive into how assembly source code is transformed into machine code and data that we can place into the ELF file. We’ll also look at adding a symbol table into the generated binary, so a disassembly will show us functions and labels we’ve defined in the source code.

See August on GitHub →

We’ll start right where we left off. In Part I, we built a program that accepted a fully encoded chunk of machine code and some data and generated an ELF binary. This time, we need to transform assembly source code into that encoded machine code, so we can give it to the ELF library.

Assembling x86 instructions

x86, especially 64-bit x86 that we’re dealing with here, is a complex instruction set encumbered with lots of historical baggage and sprawling feature sets like vector instructions and addressing modes. But for building August, we only need to support a small subset we can use to validate our ideas and write some small programs. Namely, I elected to support:

How machine instructions are encoded into bits differs pretty dramatically between architectures. Intel sticks to the style of having an opcode with optional prefixes, suffixes, and operands, while something like RISC-V sticks to a single-size instruction but with a much more complex encoding scheme that interleaves data with the instruction opcodes.

In the case of our basic x86 arithmetic instructions, I used a few different reference resources available to understand what types of instruction layouts we need to know about. Specifically, these came in handy:

Of those, the first link – the ISA reference – was most helpful, but there’s a bit of a learning curve to understanding exactly what all the notations mean. I won’t bore you with the details of that notation here, and instead tell you what I learned.

First, there are what I’d call the simple instructions, the ones that are trivial because they’re just one value. For example, the syscall instruction for invoking a system call is just two bytes, 0x0f05. nop (the no-op instruction) is another such instruction, but it’s just one byte, 0x90.

Then, there are instructions that always take the same type of operand. The jmp instruction to jump to a different part of the program is one good example. jmp always takes one immediate value operand, which is the virtual address to which the program should jump relative to the current instruction’s address. jmp is encoded with an opcode byte 0xe9 then a 32-bit little-endian number representing the offset to jump to. The branching instructions, like call, jne (jump if not equal), and jg (jump if greater than) are all laid out similarly.

Simple arithmetic instructions like inc (increment register) and neg (negate register) are also straightforward, but their operands are register values, not immediates. So they’re represented by an opcode byte, followed by an “R/M byte” encoding the operand register and some extra information about the instruction. For example, neg eax is encoded as byte 0xf7, which corresponds to either neg or not, followed by the R/M byte 0xd8, which does double-duty to encode eax and distinguish neg from not. By comparison, not eax is 0xf7d0.

There are also the arithmetic instructions with multiple operands like add and and and mov. These instructions are much more complex to encode not only because they take two arguments, but also because either of those two arguments could be any of (1) a register, (2) an immediate value encoded into the instruction, or (3) a pointer to a memory location. The same add instruction could be used in any of

add eax ebx     ; register <- register
add eax 0x1     ; register <- immediate
add eax [ebx]   ; register <- memory
add [ebx 2 ecx] eax
                ; memory <- register
add [ebx 4 ecx 0x10] 0x10
                ; memory <- immediate

Our instruction encoder needs to deal with all these cases for all the arithmetic instructions.

For me, this part took the longest to get right. I ended up writing special-case branches to handle each of a few different cases. Here’s the code that handles the add instruction, for example:

['add', _, _] -> map(args, type) :: {
    ['string', 'composite'] -> operands := encodeMem(args.0, args.1) :: {
        () -> ()
        _ -> transform('03') + operands
    }
    ['composite', 'string'] -> operands := encodeMem(args.1, args.0) :: {
        () -> ()
        _ -> transform('01') + operands
    }
    ['string', 'string'] -> transform('01') + encodeRM(args.1, args.0)
    ['string', 'number'] -> transform('81') + encodeRM(0, args.0) + toBytes(args.1, 4)
    _ -> failWith(f('Unsupported instruction: {{0}}', [instString(inst)]))
}

The first case handles memory going to a register, the second case handles a register going to memory, the third handles register-to-register, and so on. In each case, the operands are encoded a little differently, and combined with a specific opcode variant for add.

Lastly, there are a few special case instructions, like mov, which has its own rules, and push and pop that manipulate the program stack, which have their own rules too. These were pretty straightforward though.

Operand size prefixes

The last step in our instruction-encoding process is to handle instructions whose arguments are not 32-bit. 64-bit x86 instructions in August can take operands that are 2, 4, or 8 bytes in size, and memory operands that are 4 or 8 bytes in size. The full x86 instruction set supports an even greater combination of sizes, but I chose to implement these as a starting point.

These specializations to the instructions are implemented as instruction prefixes – a special byte that comes before the opcode that signals a non-default operand size (with the default being 4 bytes). For example, a mov eax ebx (32-bit) is 0x89d8, but mov rax rbx (the 64-bit variant) is 0x4889d8 with the 0x48 prefix for the 64-bit versions of the registers. To deal with these, August simply keeps track of the specified register and memory access sizes at parse time, and adds the right prefixes at the end of each instruction’s encoding step.

Memory accesses in x86

x86, being a complex instruction set, has many memory addressing modes. These are represented in my little made-up assembly notation inside brackets, like [eax] or [rbx 2 rax rcx]. In general, x86 instructions refer to memory locations by referencing up to three register values:

For example, a particular instruction may reference the memory address eax + 2 * ebx + ecx, with eax as the base.

This complexity exists so that memory access into large contiguous arrays of data are fast. For example, say we have an array of objects each taking up 8 bytes in memory. We want to get the fifth element, and specifically the value that sits on the 4th byte of the fifth element.

Normally, we’d have to do a lot of arithmetic to figure out exactly what memory address we want to access. But with a complex addressing mode, we can let the CPU do that calculation for us, by encoding the instruction so that

Since we most likely have these individual values in these registers, even this kind of a complex memory access can be optimized by the processor. Some optimizing compilers also (ab?)use complex addressing modes like this to speed up addition and multiplication by constants.

Any given memory access may come without a displacement, without an index register, or without both. And for each case, the encoding we need is slightly different. That’s mildly annoying, but these are well-documented rules, and August simply implements these rules for these addressing modes in the encodeMem function.

Encoding data and labels into program text

Encoding source code into machine instructions isn’t just a simple matter of mapping each instruction to its encoding rule. Take this program that prints out Hello World or Goodbye to standard output, depending on a variable:

_start:
    ; if 0, says hello world
    ; if 1, says goodbye
    mov eax 0
    cmp eax 0

    mov eax 0x1        ; write syscall
    mov edi 0x1        ; stdout
    jne goodbye

hello:
    mov esi msg_a
    mov edx len_a    ; length
    syscall
    jmp exit

goodbye:
    mov esi msg_b
    mov edx len_b    ; length
    syscall
    jmp exit

exit:
    mov eax 60
    mov edi 0
    syscall

section .rodata
msg_a:
    db "Hello, World!" 0xa
len_a:
    eq 14
msg_b:
    db "Goodbye!" 10
len_b:
    eq 9

Some of the operands to instructions here are pointers to data, like msg_a and len_b. Other operands are labels, like hello and exit. The assembler needs to replace references to these memory addresses with real addresses at time of assembly.

This isn’t trivial, because the encoding of instructions at the beginning of the program usually influence the location of the labels at the end of the program. To resolve our label references correctly, we need two passes over the assembly program – once to encode all the instructions and figure out where the labels are, and a second time to replace the references to labels and data with their real numerical values.

In August’s assembler, the end of the assembly step collects a list of symbols to be replaced in the first-pass program text in a list called relocations, and a second pass replaces slots in the first-pass output with correct locations gathered from the first-pass output in a second pass over the code.

Finally, the generated program text, the symbols from the source code, and any read-only data included in the .rodata section are all gathered up and returned by the assembler to be passed to the ELF file generator function from Part I of this blog. And our assembler is complete!

Disassemblers and symbol tables

There’s one part of the assembler we haven’t looked at in detail – the symbol table. The symbol table is a non-program section in the ELF file (meaning it isn’t loaded into memory when the program runs) that tells other tools like disassemblers and debuggers where labels in the source code are in the final program text.

In our readelf output from Part I, the symbol table looked like this:

Symbol table '.symtab' contains 3 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000401000     0 FUNC    LOCAL  DEFAULT    1 _start
     2: 0000000000401016     0 FUNC    LOCAL  DEFAULT    1 exit

… for the original assembly:

_start:
    mov eax 0x1     ; write syscall
    mov edi 0x1     ; stdout
    mov esi msg     ; string to print
    mov edx len     ; length
    syscall

exit:
    mov eax 60      ; exit syscall
    mov edi 0       ; exit code
    syscall

    ; ...

The symbol table has an entry for every symbol in the source assembly, and some metadata about each. The most important piece of data for us is the Value column, which denotes the virtual address of the instruction that the symbol points to in the running program. In this case, our program text starts at virtual address 0x401000, so our first symbol _start is at that address. exit, which appears 0x16 bytes later in the instruction stream, has virtual address 0x401016. (Sometimes, for other kinds of symbols, “value” refers to something other than an address, but we won’t dig into that detail today.)

What’s the use of a symbol table like this?

Lots of uses, it turns out. But for me, it was most useful when looking at disassembly of August’s own output. As I started writing longer and longer assembly programs to test the assembler, I started using objdump to see a disassembly of the executable and check that it matched my input. An objdump of the Hello World program without symbols reads like

$ objdump -d ./hello-world

./hello-world:     file format elf64-x86-64

Disassembly of section .text:

0000000000401000 <.text>:
  401000:	b8 01 00 00 00       	mov    $0x1,%eax
  401005:	bf 01 00 00 00       	mov    $0x1,%edi
  40100a:	be 00 50 6b 00       	mov    $0x6b5000,%esi
  40100f:	ba 0e 00 00 00       	mov    $0xe,%edx
  401014:	0f 05                	syscall
  401016:	b8 3c 00 00 00       	mov    $0x3c,%eax
  40101b:	bf 00 00 00 00       	mov    $0x0,%edi
  401020:	0f 05                	syscall
	...

Even though we had labels that held meaning in our source code, the disassembly can’t find them within the instruction stream because that data isn’t saved into our machine code. With our symbol table, though, objdump can map symbols back to their location in the disassembly:

$ objdump -d ./hello-world

./hello-world:     file format elf64-x86-64

Disassembly of section .text:

0000000000401000 <_start>:
  401000:       b8 01 00 00 00          mov    eax,0x1
  401005:       bf 01 00 00 00          mov    edi,0x1
  40100a:       be 00 50 6b 00          mov    esi,0x6b5000
  40100f:       ba 0e 00 00 00          mov    edx,0xe
  401014:       0f 05                   syscall

0000000000401016 <exit>:
  401016:       b8 3c 00 00 00          mov    eax,0x3c
  40101b:       bf 00 00 00 00          mov    edi,0x0
  401020:       0f 05                   syscall
    ...

Armed with these tools now, we can start writing real assembly programs and build and run them with August!

Assembler at last, and future work

With the ELF format library from Part I and x86 assembler from this post, we have the main components of a full assembler that can take x86 assembly source files and generate executable binaries.

The only work left to do is to write some plumbing code in a CLI script that connects the two halves together. In August, this is done in src/cli.ink. We read a source file, feed it to the assembler, and then feed its output to the ELF library to generate the file, which we save and mark as executable on disk.

This is where August is today: an assembler for single-file 64-bit x86 programs that generates 64-bit ELF binaries. Real-world assemblers do far more in practice, and there are a few areas where I’d like to experiment and expand August’s features.

Specifically, to make August useful to use in real programs, it first needs to be able to generate object files or executables that can be linked against other libraries like the system libc. This should allow programs assembled with August to, for example, call printf with arguments rather than issue the system call directly. Most modern operating system besides Linux actually require programs running on the platform to dynamically link against the system-provided libc library to make system calls, for reasons of security, performance, and forward-compatibility. So to assemble programs to run on non-Linux platforms, August needs to be able to output dynamically linked executables. This seems like mostly a matter of providing some extra metadata in the binary about the symbols to link and the libraries to link against, but I haven’t had a chance to look closely yet.

Many assemblers are also multi-architecture or multi-platform. While August currently only supports 64-bit ELF on x86, I think it would be a good challenge to try to expand the tool to support other architectures like ARM and RISC-V or other platforms that don’t use ELF as the executable format, like macOS with its Mach-O format.

Even still, in this prototype form, August was one of my most interesting technical projects in the last year because it pushed me to dig into areas of computing I wasn’t familiar with before. And it runs on Ink!

For me, building toy prototypes of components at different layers of the stack seems like the best way to keep expanding my understanding of systems up and down the stack, and I hope to keep investigating by building – perhaps the next one will be a browser. Maybe an operating system?


Assembler in Ink, Part I: processes, assembly, and ELF files