Contents
This is the sequel to Assembler in Ink, Part I where I started documenting my process of building an x86 assembler in Ink.
In Part I, we looked at the high-level jobs of an assembler, and how August constructs an ELF file that an operating system (specifically Linux here) could understand and turn into a process. In this second part, we’ll dive into how assembly source code is transformed into machine code and data that we can place into the ELF file. We’ll also look at adding a symbol table into the generated binary, so a disassembly will show us functions and labels we’ve defined in the source code.
We’ll start right where we left off. In Part I, we built a program that accepted a fully encoded chunk of machine code and some data and generated an ELF binary. This time, we need to transform assembly source code into that encoded machine code, so we can give it to the ELF library.
Assembling x86 instructions
x86, especially 64-bit x86 that we’re dealing with here, is a complex instruction set encumbered with lots of historical baggage and sprawling feature sets like vector instructions and addressing modes. But for building August, we only need to support a small subset we can use to validate our ideas and write some small programs. Namely, I elected to support:
- The basic arithmetic instructions like
add
,sub
,xor
and so on - Basic read/writes to and from memory
- Function calls
- The
syscall
instruction so we could do basic I/O like printing Hello World - Branch and jump instructions so we could write basic loops
How machine instructions are encoded into bits differs pretty dramatically between architectures. Intel sticks to the style of having an opcode with optional prefixes, suffixes, and operands, while something like RISC-V sticks to a single-size instruction but with a much more complex encoding scheme that interleaves data with the instruction opcodes.
In the case of our basic x86 arithmetic instructions, I used a few different reference resources available to understand what types of instruction layouts we need to know about. Specifically, these came in handy:
- The x86asm.net ISA reference, which is comprehensive enough for a toy assembler and easy to navigate once you get used to the compact notation
- Encoding x86 Instructions, which was a helpful guide to understanding how x86 and x64 instructions are laid out into bytes
- The x64 cheat sheet for a handy list of the core x86/x64 instruction set
Of those, the first link – the ISA reference – was most helpful, but there’s a bit of a learning curve to understanding exactly what all the notations mean. I won’t bore you with the details of that notation here, and instead tell you what I learned.
First, there are what I’d call the simple instructions, the ones that are trivial because they’re just one value. For example, the syscall
instruction for invoking a system call is just two bytes, 0x0f05
. nop
(the no-op instruction) is another such instruction, but it’s just one byte, 0x90
.
Then, there are instructions that always take the same type of operand. The jmp
instruction to jump to a different part of the program is one good example. jmp
always takes one immediate value operand, which is the virtual address to which the program should jump relative to the current instruction’s address. jmp
is encoded with an opcode byte 0xe9
then a 32-bit little-endian number representing the offset to jump to. The branching instructions, like call
, jne
(jump if not equal), and jg
(jump if greater than) are all laid out similarly.
Simple arithmetic instructions like inc
(increment register) and neg
(negate register) are also straightforward, but their operands are register values, not immediates. So they’re represented by an opcode byte, followed by an “R/M byte” encoding the operand register and some extra information about the instruction. For example, neg eax
is encoded as byte 0xf7
, which corresponds to either neg
or not
, followed by the R/M byte 0xd8
, which does double-duty to encode eax
and distinguish neg
from not
. By comparison, not eax
is 0xf7d0
.
There are also the arithmetic instructions with multiple operands like add
and and
and mov
. These instructions are much more complex to encode not only because they take two arguments, but also because either of those two arguments could be any of (1) a register, (2) an immediate value encoded into the instruction, or (3) a pointer to a memory location. The same add
instruction could be used in any of
add eax ebx ; register <- register
add eax 0x1 ; register <- immediate
add eax [ebx] ; register <- memory
add [ebx 2 ecx] eax
; memory <- register
add [ebx 4 ecx 0x10] 0x10
; memory <- immediate
Our instruction encoder needs to deal with all these cases for all the arithmetic instructions.
For me, this part took the longest to get right. I ended up writing special-case branches to handle each of a few different cases. Here’s the code that handles the add
instruction, for example:
['add', _, _] -> map(args, type) :: {
['string', 'composite'] -> operands := encodeMem(args.0, args.1) :: {
() -> ()
_ -> transform('03') + operands
}
['composite', 'string'] -> operands := encodeMem(args.1, args.0) :: {
() -> ()
_ -> transform('01') + operands
}
['string', 'string'] -> transform('01') + encodeRM(args.1, args.0)
['string', 'number'] -> transform('81') + encodeRM(0, args.0) + toBytes(args.1, 4)
_ -> failWith(f('Unsupported instruction: {{0}}', [instString(inst)]))
}
The first case handles memory going to a register, the second case handles a register going to memory, the third handles register-to-register, and so on. In each case, the operands are encoded a little differently, and combined with a specific opcode variant for add
.
Lastly, there are a few special case instructions, like mov
, which has its own rules, and push
and pop
that manipulate the program stack, which have their own rules too. These were pretty straightforward though.
Operand size prefixes
The last step in our instruction-encoding process is to handle instructions whose arguments are not 32-bit. 64-bit x86 instructions in August can take operands that are 2, 4, or 8 bytes in size, and memory operands that are 4 or 8 bytes in size. The full x86 instruction set supports an even greater combination of sizes, but I chose to implement these as a starting point.
These specializations to the instructions are implemented as instruction prefixes – a special byte that comes before the opcode that signals a non-default operand size (with the default being 4 bytes). For example, a mov eax ebx
(32-bit) is 0x89d8
, but mov rax rbx
(the 64-bit variant) is 0x4889d8
with the 0x48
prefix for the 64-bit versions of the registers. To deal with these, August simply keeps track of the specified register and memory access sizes at parse time, and adds the right prefixes at the end of each instruction’s encoding step.
Memory accesses in x86
x86, being a complex instruction set, has many memory addressing modes. These are represented in my little made-up assembly notation inside brackets, like [eax]
or [rbx 2 rax rcx]
. In general, x86 instructions refer to memory locations by referencing up to three register values:
- A base pointer, which is always required
- An index register, which can be multiplied by a scale of 1, 2, 4, or 8 and is added to the base
- A displacement, which is added to the total of the base and index + scale
For example, a particular instruction may reference the memory address eax + 2 * ebx + ecx
, with eax
as the base.
This complexity exists so that memory access into large contiguous arrays of data are fast. For example, say we have an array of objects each taking up 8 bytes in memory. We want to get the fifth element, and specifically the value that sits on the 4th byte of the fifth element.
Normally, we’d have to do a lot of arithmetic to figure out exactly what memory address we want to access. But with a complex addressing mode, we can let the CPU do that calculation for us, by encoding the instruction so that
- The base points to the start of the array
- The index is set to 5, with a scale of 8 (for 8 bytes-per-item)
- The displacement is 4, since we want the value in the middle of the 8-byte value
Since we most likely have these individual values in these registers, even this kind of a complex memory access can be optimized by the processor. Some optimizing compilers also (ab?)use complex addressing modes like this to speed up addition and multiplication by constants.
Any given memory access may come without a displacement, without an index register, or without both. And for each case, the encoding we need is slightly different. That’s mildly annoying, but these are well-documented rules, and August simply implements these rules for these addressing modes in the encodeMem
function.
Encoding data and labels into program text
Encoding source code into machine instructions isn’t just a simple matter of mapping each instruction to its encoding rule. Take this program that prints out Hello World or Goodbye to standard output, depending on a variable:
_start:
; if 0, says hello world
; if 1, says goodbye
mov eax 0
cmp eax 0
mov eax 0x1 ; write syscall
mov edi 0x1 ; stdout
jne goodbye
hello:
mov esi msg_a
mov edx len_a ; length
syscall
jmp exit
goodbye:
mov esi msg_b
mov edx len_b ; length
syscall
jmp exit
exit:
mov eax 60
mov edi 0
syscall
section .rodata
msg_a:
db "Hello, World!" 0xa
len_a:
eq 14
msg_b:
db "Goodbye!" 10
len_b:
eq 9
Some of the operands to instructions here are pointers to data, like msg_a
and len_b
. Other operands are labels, like hello
and exit
. The assembler needs to replace references to these memory addresses with real addresses at time of assembly.
This isn’t trivial, because the encoding of instructions at the beginning of the program usually influence the location of the labels at the end of the program. To resolve our label references correctly, we need two passes over the assembly program – once to encode all the instructions and figure out where the labels are, and a second time to replace the references to labels and data with their real numerical values.
In August’s assembler, the end of the assembly step collects a list of symbols to be replaced in the first-pass program text in a list called relocations
, and a second pass replaces slots in the first-pass output with correct locations gathered from the first-pass output in a second pass over the code.
Finally, the generated program text, the symbols from the source code, and any read-only data included in the .rodata
section are all gathered up and returned by the assembler to be passed to the ELF file generator function from Part I of this blog. And our assembler is complete!
Disassemblers and symbol tables
There’s one part of the assembler we haven’t looked at in detail – the symbol table. The symbol table is a non-program section in the ELF file (meaning it isn’t loaded into memory when the program runs) that tells other tools like disassemblers and debuggers where labels in the source code are in the final program text.
In our readelf
output from Part I, the symbol table looked like this:
Symbol table '.symtab' contains 3 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000401000 0 FUNC LOCAL DEFAULT 1 _start
2: 0000000000401016 0 FUNC LOCAL DEFAULT 1 exit
… for the original assembly:
_start:
mov eax 0x1 ; write syscall
mov edi 0x1 ; stdout
mov esi msg ; string to print
mov edx len ; length
syscall
exit:
mov eax 60 ; exit syscall
mov edi 0 ; exit code
syscall
; ...
The symbol table has an entry for every symbol in the source assembly, and some metadata about each. The most important piece of data for us is the Value
column, which denotes the virtual address of the instruction that the symbol points to in the running program. In this case, our program text starts at virtual address 0x401000
, so our first symbol _start
is at that address. exit
, which appears 0x16
bytes later in the instruction stream, has virtual address 0x401016
. (Sometimes, for other kinds of symbols, “value” refers to something other than an address, but we won’t dig into that detail today.)
What’s the use of a symbol table like this?
Lots of uses, it turns out. But for me, it was most useful when looking at disassembly of August’s own output. As I started writing longer and longer assembly programs to test the assembler, I started using objdump
to see a disassembly of the executable and check that it matched my input. An objdump
of the Hello World program without symbols reads like
$ objdump -d ./hello-world
./hello-world: file format elf64-x86-64
Disassembly of section .text:
0000000000401000 <.text>:
401000: b8 01 00 00 00 mov $0x1,%eax
401005: bf 01 00 00 00 mov $0x1,%edi
40100a: be 00 50 6b 00 mov $0x6b5000,%esi
40100f: ba 0e 00 00 00 mov $0xe,%edx
401014: 0f 05 syscall
401016: b8 3c 00 00 00 mov $0x3c,%eax
40101b: bf 00 00 00 00 mov $0x0,%edi
401020: 0f 05 syscall
...
Even though we had labels that held meaning in our source code, the disassembly can’t find them within the instruction stream because that data isn’t saved into our machine code. With our symbol table, though, objdump
can map symbols back to their location in the disassembly:
$ objdump -d ./hello-world
./hello-world: file format elf64-x86-64
Disassembly of section .text:
0000000000401000 <_start>:
401000: b8 01 00 00 00 mov eax,0x1
401005: bf 01 00 00 00 mov edi,0x1
40100a: be 00 50 6b 00 mov esi,0x6b5000
40100f: ba 0e 00 00 00 mov edx,0xe
401014: 0f 05 syscall
0000000000401016 <exit>:
401016: b8 3c 00 00 00 mov eax,0x3c
40101b: bf 00 00 00 00 mov edi,0x0
401020: 0f 05 syscall
...
Armed with these tools now, we can start writing real assembly programs and build and run them with August!
Assembler at last, and future work
With the ELF format library from Part I and x86 assembler from this post, we have the main components of a full assembler that can take x86 assembly source files and generate executable binaries.
The only work left to do is to write some plumbing code in a CLI script that connects the two halves together. In August, this is done in src/cli.ink. We read a source file, feed it to the assembler, and then feed its output to the ELF library to generate the file, which we save and mark as executable on disk.
This is where August is today: an assembler for single-file 64-bit x86 programs that generates 64-bit ELF binaries. Real-world assemblers do far more in practice, and there are a few areas where I’d like to experiment and expand August’s features.
Specifically, to make August useful to use in real programs, it first needs to be able to generate object files or executables that can be linked against other libraries like the system libc. This should allow programs assembled with August to, for example, call printf
with arguments rather than issue the system call directly. Most modern operating system besides Linux actually require programs running on the platform to dynamically link against the system-provided libc library to make system calls, for reasons of security, performance, and forward-compatibility. So to assemble programs to run on non-Linux platforms, August needs to be able to output dynamically linked executables. This seems like mostly a matter of providing some extra metadata in the binary about the symbols to link and the libraries to link against, but I haven’t had a chance to look closely yet.
Many assemblers are also multi-architecture or multi-platform. While August currently only supports 64-bit ELF on x86, I think it would be a good challenge to try to expand the tool to support other architectures like ARM and RISC-V or other platforms that don’t use ELF as the executable format, like macOS with its Mach-O format.
Even still, in this prototype form, August was one of my most interesting technical projects in the last year because it pushed me to dig into areas of computing I wasn’t familiar with before. And it runs on Ink!
For me, building toy prototypes of components at different layers of the stack seems like the best way to keep expanding my understanding of systems up and down the stack, and I hope to keep investigating by building – perhaps the next one will be a browser. Maybe an operating system?
← Assembler in Ink, Part I: processes, assembly, and ELF files