~remexre/blog

Latest reboot of the blog

Another year-long gap, five years after the last one. Yet again, the cause is “problems with the static site generator made it so I didn’t want to write when I had writing time.”

As before, I only migrated the articles that I’ve had cause to refer people to. The blog is still in the same Git repo (https://git.sr.ht/~remexre/blog), with snapshots of old versions at Git tags:

This time around, I’m authoring in the HTML you see rendered here, without a static site generator in between. Can’t break it if it’s not there, after all.

At the moment, this also means that there’s no way to comment. It’s not like I was getting tons of them before, so I may or may not bring them back.

If you want to comment on a post here, send an email to ~remexre/public-inbox@lists.sr.ht — I’ll make an effort to link them here, and maybe build something to do this if I’m getting more than two or three per year. To view the archives of that list, see here.

Using Nix Flakes and direnv to manage a dev shell for projects that don’t use Nix

Typically, projects that I need to do this for are associated with some organization; e.g. melt for the lab I’m in. This approach relies on having a directory that contains the non-Nix project, and organizations tend to be convenient, so I make this file structure:

~/src/
├── melt/
│   ├── .envrc
│   ├── shell/
│   │   ├── flake.lock
│   │   └── flake.nix
│   ├── shell.nix
│   └── example-project/
│       └── …
└── …

The shell directory contains a flake with a devShells.default that contains all the dependencies I need. Notably, it does not reference example-project in any way. It can be a Git repo if this is to your liking, but this isn’t necessary.

Run nix shell in this directory at least once, to create or update flake.lock — one of the disadvantages of this approach is that flake.lock is not automatically updated.

The shell.nix is used to let us import the flake and get its dev shell in the non-flake Nix world. The version of it I currently use is:

let
  inherit (builtins) currentSystem getFlake;
  flake = getFlake (toString ./shell);
in
flake.outputs.devShells.${currentSystem}.default

Finally, the .envrc just says to load the dev shell from the shell.nix:

use nix

Note that direnv won’t be monitoring flake.nix for changes, so you may need to run direnv reload after updating it.


Discussion

Pinebook Pro OSDev: Hello World

I recently got a Pinebook Pro, and I want to port StahlOS to it. The journey of a thousand miles begins with a single step, so here’s a journal entry on getting a “Hello, world” program running on it.

The Hardware

This is written about (and largely on) the ANSI model of the Pinebook Pro. Additionally, I use a cable to access the serial port, which wires it up to a 3.5mm headphone jack physical connector. Pine sells a nicely packaged cable, but as people on the forums note (and I’ve verified), it runs at 5 volts, which causes spooky behavior (up to and including hardware damage) on the Pinebook. Instead, I’m using some jumpers spliced to a headphone cord I cut in half to provide the physical connector to the board.

Initially, I tried Adafruit’s USB to TTL Serial Cable, since I already had it sitting around. However, it turns out it’s based on the CP2102 chip, which only supports speeds up to 1Mbps (1000000 baud). The RK3399 (the board inside the Pinebook Pro), however, boots at 1.5Mbps (1500000 baud). Instead, I bought a converter based on the PL2302DX, which can do up to 12Mbps.

Out of the Box

The Pinebook Pro ships with an ancient (oldstable?) version of Debian. I ditched it for Manjaro running off an SD card. From the default install, I systemctl disabled lightdm and installed i3 instead. I’m doing most of my usage (including typing this!) from the kernel console in tmux, though.

A hardware switch needs to be flipped inside the device to enable UART2, which provides a serial port over the headphone jack. The Pine wiki documents the location of the switch fairly well; check it for pictures.

U-Boot

Once the serial port is connected, rebooting the machine shows the boot logs. Mashing Ctrl-C (or any key on the Manjaro U-Boot, it appears) gets a shell, with vaguely sh-like semantics.

I recommend updating to Manjaro’s U-Boot; the U-Boot the Pinebook Pro ships with (as of the January 2020 batch) can’t boot from 64-bit ELF files. If your machine boots with an amber power LED instead of a green one, that’s a strong indicator you’re on the newer Manjaro U-Boot.

The commands in the shell I find most useful are:

Toolchain

You’ll need at least binutils for the aarch64-none-elf target. On Gentoo, this is fairly easy with crossdev. It’ll also probably be useful to have your system binutils be built multitarget; this doesn’t apply to gas, though, so the aarch64-none-elf versions are still necessary.

Writing to the UART

The RK3399’s Technical Reference Manual is your friend for all of this; it notes that UART2 is mapped to 0xff1a0000. There’s also some information on how to interface with the chip; if you’re familar with programming the 8250 or 16550 UARTs, I believe it’s effectively the latter. (Note that unlike how x86 serial ports are typically connected, the UARTs in the Pinebook Pro are all memory-mapped.)

We can write to the UART with an assembly sequence like:

ldr x0, =0xff1a0000 /* Load the x0 register with 0xff1a0000 */
mov x1, '!'         /* Load the x1 register with '!' (zero-extended) */
strb w1, [x0]       /* Store the value in x1 to the address given by x0 */

This stores the ! character in the Transmit Holding Register of the UART. Technically, we need to wait for the Transmit Holding Register Empty Bit of the Line Status Register to be 1. We do this with:

	ldr x0, =0xff1a0000
wait_for_tx_ok:
	ldrb w1, [x0, #0x14]      /* Offset the address in x0 by 0x14 */
	tbz w1, 5, wait_for_tx_ok /* Loop if bit 5 of x1 is zero */

Of course, a real OS should use the FIFOs, be interrupt-triggered, and maybe even use DMA. That’s outside the scope of this article, but I’ll probably touch on it in a future post.

Putting it All Together

We can use the above with a bit of glue code to make our “Hello, world” program:

.section .text

.global _start
_start:
	ldr x0, =0xff1a0000
	ldr x3, =msg
	mov x4, len
	bl write_string     /* Call the write_string procedure */
	b .                 /* Infinite loop */

/** write_string: Writes a string to a UART
 *
 * Input:
 *   x0: UART base address
 *   x3: Address of first character of string
 *   x4: Length of string
 *
 * Side Effects:
 * - Trashes x1, x2, x5
 */
write_string:
	cbz x4, write_string.end /* If x4 is zero, go to write_string.end */

write_string.wait_for_tx_ok:
	ldrb w1, [x0, #0x14]
	tbz w1, 5, write_string.wait_for_tx_ok

	ldrb w2, [x3], #1 /* After fetching a byte to w2, increment x3 */
	sub x4, x4, 1     /* Decrement x4 */
	strb w2, [x0]

	b write_string
write_string.end:
	ret

.section .rodata

msg: .string "Hello, world!\r\n"
.equ len, . - msg

We also need a linker script for this:

OUTPUT_FORMAT(elf64-littleaarch64)
ENTRY(_start)

MEMORY {
	kernel : ORIGIN = 0x00280000, LENGTH = 0x00080000
}

SECTIONS {
	.text : {
		. += SIZEOF_HEADERS;
		*(.text)
	} > kernel
	.rodata : { *(.rodata) } > kernel
}

We compile and link with:

aarch64-none-elf-as -o main.o main.s
aarch64-none-elf-ld -o main.elf -T linker.ld main.o -N -z max-page-size=4096

Aside: Tricks for a Smaller Executable

Thanks to clever and doug16k in the #osdev channel on Freenode (now on libera.chat) for showing me a couple of tricks to reduce the size of the ELF file:

This brings the binary size down from 66k to 1.3k.

Hello, world!

Finally, we’re ready to run our program. Connect the Pinebook Pro to your serial port, connect Minicom to the serial port, and boot it. Hit a key to drop to the U-Boot shell, then run loady 0x00880000 to start the upload. Hit Ctrl-A, S to open Minicom’s “Send files” menu. Once the file is uploaded, run bootelf 0x00880000. If all’s gone well, you should see Hello, world! printed, followed by the machine hanging.

The StahlDream

The StahlDream is the shorthand I use for constructing a personal computing system, from the hardware up. This encompasses several programming languages, an operating system, at least two databases. (And that’s before I get to any actual applications.)

I realized I don’t actually have all this written down in one place anywhere, so this serves as a snapshot into the current vision.

This is certainly inspired by Oberon, which showed one could create a single-machine system in a few thousand lines of code (12,227 by one count). I don’t think I can fit a system in so few lines of code, but I also want to have a much more complicated language in use for most of the system.

StahlOS

The part I’m currently working on is an operating system on which the rest of the system runs. The system is quite minimal — there’s neither MMU-based process isolation, nor pre-emptive multitasking (though the latter point may change).

The StahlOS model fundamentally relies on all actual machine code executed by the CPU to be trustworthy, and disallows loading binaries other than the kernel itself. Instead, system drivers and essential processes are written in StahlOS Forth, which user programs are compiled to at runtime. Even the Forth programs themselves are only loadable from a read-only filesystem (excepting those that are compiled into the kernel), since Forth is low-level enough that it might as well be machine code, security-wise.

Each of the compiler processes only produce memory-safe code; assuming this property and the correctness of the above TCB, all code the system can run is memory-safe. This allows wholly ignoring the MMU as a security mechanism. Instead, memory ownership is implicit, and controlled at the page level. Each process (with exceptions such as a low-level debugging REPL) is shared-nothing, with message-passing as a fundamental operation, implemented as the transfer of page ownership. (I haven’t ruled out eventually allowing shared pages with STM or similar for concurrent modification, but I’m leery of shared pointers.)

StahlOS will also provide Erlang-like mechanisms for orchestrating processes (i.e. monitors and links). However, cross-machine message-passing will not be directly supported (and for this reason, message-passing should only really be considered a intra-app mechanism, despite being inter-process). Instead, applications should generally use the tuple space as a synchronization point: it’s not significantly more expensive than message passing for local communications, but allows remote communications.

Languages

Stahl

The primary language being designed currently, and the most complicated one by far, is Stahl. Stahl is a dependently typed lambda calculus with a Lisp-like syntax.

I’m probably basing the core type theory on the type theory presented in Homotopy Type Theory, but without the univalence axiom (at least, until I can figure out how to make it computable).

I also want to make the language the testbed for experimenting with automated theorem proving and making manual theorem proving convenient in a “casual” setting (e.g. from a smartphone while on a bus).

StahlOS Forth

StahlOS uses a Forth dialect as the low-level programming language. Forth is the best language I’ve found for bare-metal development. A Forth system can be constructed with amazingly little machine code; the resulting language is capable of Common Lisp-tier metaprogramming, while also being able to peek and poke at memory, without needing dynamic memory allocation.

Databases

Tuple Space

Tuple spaces are a sufficiently old, and sufficiently nice-seeming database abstraction that I’m honestly surprised there isn’t a high-quality implementation some programming subculture is smugly using (in same way similar subcultures exist for e.g. Smalltalk, Erlang, Common Lisp).

Essentially, a tuple space is a distributed multiset with five primitive operations:

With a sufficiently expressive pattern language, it becomes easy to have applications sharing a database with loose coupling between them.

I need more design work to determine many of the details of this tuple space (as well as a name for it!) — particularly, I’m unsure of how precisely I want to make the database distributed. Given that I’m using it as a coordination mechanism as well as a (short-term) database, it’s not clear what semantics I actually want on netsplit. Furthermore, it seems like there ought to be a large class of optimizations I could apply to make common patterns of use more efficient, though these might require real-world usage data to evaluate.

G1

I’m already writing about G1 elsewhere, but I’ll summarize how it fits into the larger StahlDream.

The tuple space doesn’t seem particularly good as a database for bulk storage — I’m planning to implement it with the expectation that it will contain at most a few megabytes of data at once. I therefore want a flexible database for storing and querying larger data.

I’m planning to implement the tuple space on top of both StahlOS and some Linux system, likely with the Linux implementation being in Rust. The thinking here is somewhat similar to Erlang’s port drivers, which allow interfacing with a native-code process as if it were an Erlang process. G1 can then be easily bridged to StahlOS, by acting on the tuple space directly.

Helpful Tools for CSCI2021

radare2

radare2 can replace GDB, and has many more analysis tools.

Installing

Check your repos. It’s in the repos for Arch, Ubuntu, and Homebrew (for you macOS kids).

Example of Usage

Start radare2 with radare2 -d <program>.

Radare2 has very terse commands, unlike GDB. Reading a tutorial is highly recommended; try this one.

However, here’s a cool demo of one of the more useful things. Load your bomblab file with the above commands. Then run the commands:

aaa
VV @ sym.initialize_bomb

You can now use the arrow keys or Vim-style hjkl scrolling to pan around the control-flow graph of your bomb.

radare2 drawing a control-flow graph

Godbolt

Godbolt Compiler Explorer is a useful online tool for reading the assembly output of C code. It highlights the lines different blocks of assembly come from too, which makes reading it much easier.

Protip: Use -O1 in the “Compiler Flags” field — it makes the code a lot more efficient without sacrificing much readability (and sometimes improving it).

Godbolt showing the disassembly of a function

Clang

Clang gives much better error messages than GCC. Just replace gcc in your commands with clang. It’s the default C compiler on macOS, and is installed on the CSELabs machines (and again is probably in your standard repos).

For example, instead of:

gcc -o main main.c

run

clang -o main main.c

Useful flags

Other flags that can check your code include:

valgrind

Valgrind can help find the causes of segmentation faults and memory leaks a lot better than most programmers. Run your program with it to find them.

For example, instead of:

./main

run

valgrind ./main

Installing

Check your repos.

Reading Valgrind’s output

After running valgrind, you might get output like:

==30038== Memcheck, a memory error detector
==30038== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==30038== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==30038== Command: ./a.out
==30038==
==30038== Invalid read of size 1
==30038==    at 0x108611: main (main.c:3)
==30038==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==30038==
==30038==
==30038== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==30038==  Access not within mapped region at address 0x0
==30038==    at 0x108611: main (main.c:3)
==30038==  If you believe this happened as a result of a stack
==30038==  overflow in your program's main thread (unlikely but
==30038==  possible), you can try to increase the size of the
==30038==  main thread stack using the --main-stacksize= flag.
==30038==  The main thread stack size used in this run was 8388608.
==30038==
==30038== HEAP SUMMARY:
==30038==     in use at exit: 0 bytes in 0 blocks
==30038==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==30038==
==30038== All heap blocks were freed -- no leaks are possible
==30038==
==30038== For counts of detected and suppressed errors, rerun with: -v
==30038== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
[1]    30038 segmentation fault (core dumped)  valgrind ./a.out

This may look difficult to read, but the important part is the middle section:

==30038== Invalid read of size 1
==30038==    at 0x108611: main (main.c:3)
==30038==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

Let’s break this down line-by-line.

Invalid read of size 1

The error in your program is that it tried to read one byte from memory in a way that was invalid.

The only common one-byte type is a char, so we can be pretty sure that it was that.

at 0x108611: main (main.c:3)

You can ignore 0x108611 — it’s the memory address the code was at. If it’s the only piece of information present, you might’ve tried to call a string as a function or something similar. Otherwise, the other two pieces of information are much more useful.

We know that it’s in the main function, specifically at line 3 of main.c. If a line number isn’t present, compile your program with -g and run it again.

Address 0x0 is not stack'd, malloc'd or (recently) free'd

From this, we know that the memory address we couldn’t read from was 0x0. Since this is NULL, we know that we’re trying to read from a null pointer. not stack'd, malloc'd or (recently) free'd tells us that this pointer is neither a stack nor a heap pointer, which is obviously true for NULL.

NASM

NASM is an assembler that is often preferred to GAS (the assembler taught directly in class). It uses the more intuitive Intel syntax rather than the AT&T syntax used by GAS, and is versatile enough to have your entire attacklab payload be created from a single assembly file, rather than needing to stich together a bunch of printf calls with cat.

Intel vs. AT&T Syntax

C version:

int main(void) {
	int n = 20;
	while(n != 1) {
		if(n % 2 == 0) {
			n = n / 2;
		} else {
			n = 3 * n + 1;
		}
	}
	return n - 1;
}

GAS/AT&T Syntax version:

main:
	movl $20, %eax               # int n = 20;
	jmp .test                    # while(n != 1) {
.loop:
	testl $1, %eax
	jz .if_true                  #   if(n % 2 == 0)
.if_true:                            #   {
    shrl $1, %eax                    #     n = n / 2;
	jmp .test                    #   }
.if_false:                           #   else {
	leal 1(%rax, %rax, 2), %eax  #     n = 3 * n + 1;
.test:                               #   }
	cmpl $1, %eax
	jne .loop                    # }
.end:
	dec %eax                     # return n - 1;
	ret

Intel Syntax version:

main:
	mov eax, 20                ; int n = 20;
	jmp .test                  ; while(n != 1) {
.loop:
	test eax, 1
	jz .if_true                ;   if(n % 2 == 0)
.if_true:                          ;   {
	shr eax, 1                 ;     n = n / 2;
	jmp .test                  ;   }
.if_false:                         ;   else {
	lea eax, [eax + 2*eax + 1] ;     n = 3 * n + 1;
.test:                             ;   }
	cmp eax, 1
	jne .loop                  ; }
.end:
	dec eax                    ; return n - 1;
	ret

As you can see, the Intel syntax version is more C-like (n = 20 becomes mov eax, 20), and has less visual noise (20 is obviously a number, you don’t need to call it $20). This is especially noticeable in the lea instructions corresponding to n = 3 * n + 1:

; Intel
lea eax, [eax + 2*eax + 1]
# AT&T
leal 1(%rax, %rax, 2), %eax

I really have no idea what the person who came up with 1(%rax, %rax, 2) was thinking...

Misc. Tips

Argument Passing Order

The mnemonic to remember is:

From first to last, these are:

So if we have the code:

int foo(int x, unsigned int y, char* z);

int main(void) {
	foo(1, 2, NULL);
	return 0;
}

This will turn into the assembly:

main:
	; MOVing to a register that starts with e
	; will clear the upper half of the r register
	; that it corresponds to.
	mov edi, 1
	mov esi, 2
	xor edx, edx ; Or `mov edx, 0'
	call foo

	; return 0
	xor eax, eax
	ret