A bootstrapping OS with minimal binary seed (mirror)

Giovanni Mascellani 5540584905 Link to other material.		4 years ago
asmg	7b4908a746 Run more tests.	5 years ago
asmg0	d6c9d1e1f4 Remove unused platform_exit API.	5 years ago
attic	541c6dc267 Remove useless multiboot mmap stuff.	5 years ago
boot	f5ea8a92c4 Other debugging code.	5 years ago
contrib	e743dee0ef Add support for executing mm0-c.	5 years ago
diskfs	7b4908a746 Run more tests.	5 years ago
http	0425eb4555 Minor.	5 years ago
lib	5c704fe41d Use QEMU ACPI shutdown to avoid bad exit code.	5 years ago
test	7b4908a746 Run more tests.	5 years ago
.gitlab-ci.yml	7b4908a746 Run more tests.	5 years ago
.gitmodules	e743dee0ef Add support for executing mm0-c.	5 years ago
.travis.yml	409383f656 Start a test HTTP server in Travis.	5 years ago
ASSUMPTIONS.md	145a6c92c7 Write down low level assumptions.	5 years ago
COPYING	116c0a6d2c Add licensing details.	6 years ago
G_LANGUAGE.md	f7aa688c77 Indirect function call.	5 years ago
MM0.md	ccacf4b68f Introduce Type objects.	5 years ago
Makefile	a22a334b7c Update mm0 repository and fix little things.	5 years ago
README.md	5540584905 Link to other material.	4 years ago
README.old.md	01ae2c5b0a Update README and licensing information.	5 years ago
create_diskfs.py	f597115d0d Fix diskfs creation.	5 years ago
create_partition.py	976b604c0a Implement diskfs.	6 years ago
extract_debugfs.py	8d7b09508c Dump and recover tokens after preprocessing.	6 years ago

`asmc`, a bootstrapping OS with minimal binary seed

asmc is an extremely minimal operating system, whose binary seed (the compiled machine code that is fed to the CPU when the system boots) is very small (around 6 KiB). Such compiled code is enough to contain a minimal IO environment and compiler, which is used to compile a more powerful environment and compiler, further used to compile even more powerful things, until a fully running system is bootstrapped. In this way nearly any code running in your computer is compiled during the boot procedure, except the the initial seed that ideally is kept as small as possible.

This at least is the plan; from the moment we are not yet at the point where we manage to compile a real system environment. However, work is ongoing (and you can contribute!).

The name asmc indicates two of the most prominents languages used in this process: Assembly (with which the initial seed is written) and C (one of the first targets we aim to). The initial plan was to embed an Assembly compiler in the binary seed and then use Assembly to produce a C compiler. In the end a different path was devised: the initial seed is written in Assembly and embeds a G compiler (where G is a custom language, sitting something between Assembly and C, conceived to be very easy to compile); the G compiler is then use to produce a C compiler. Assembly is never directly used in this chain, although it is of course continuously behind the curtains.

asmc is currently able to:

boot from a minimal seed of around 6 KiB;
compile and execute source code written in the G language (see below);
initialize a minimal environment, including a terminal (writing to serial port and to monitor), dynamic memory allocation, simple data structures, disk access and a simple virtual file system;
compile a basic assembler and a basic C compiler, both of which are written in G;
use them to compile a lightly patched version of tinycc, using a custom basic C standard library;
use tinycc to compile another copy of itself (in line of principle that can be repeated as long as you want);
use the second tinycc to compile a patched and customized version of iPXE;
use iPXE to initialize the network card and be ready to download more source code from the network.

Hopefully asmc will eventually be able to:

download the Linux source code, patch it and compile via a custom build script;
compile a minimal userspace environment, which includes at least a copy of tinycc;
boot Linux with this userspace, and be ready to compile other source code to continue bootstrapping.

Enough talking, show me the deal!

You should use Linux to compile asmc, although some parts of it can also be built on macOS. If you use Debian, install the prerequisites with

sudo apt-get install build-essential nasm qemu-system-x86 python3 qemu-utils

If you cloned the GIT repository, you will probably want to checkout the submodules as well:

git submodule init
git submodule update --recursive

Then just call make in the toplevel directory of the repository. A subdirectory named build will be created, with all compilation artifacts inside it. Before running the virtual machine you have to start the web server that serves code for the later stages of asmc: open another terminal, enter the http directory and run

python3 -m http.server 8080

Then:

qemu-system-i386 -m 256M -hda build/boot_asmg.x86.qcow2 -serial stdio -display none

(if your host system supports it, you can add -enable-kvm to benefit of KVM acceleration)

WARNING! ATTENTION! asmc can also be ran on real hardware. However, remember that you are giving full control over you hardware to an experimental program whose author is not an expert operating system programmer: it could have bugs and overwrite your disk, damage the hardware and whatever. I only run it on an old otherwise unused laptop that once belonged to my grandmother. No problem has ever become apparent, but care is never too much!

Why all of this?

Well, the first and most important reason was learning. So far I learnt how to write a basic boot loader, a basic operating system and a few language compilers (for Assembly, G and C). I learnt to write simple Assembly and I invented a G language, that I found pretty satisfying for the specific domain it was written for (more on this below).

Other than that, it bothers me that the fine art of programming is currently based on a misconception: that there are two worlds, the "source world" and the "executable world", and that given the source you can build the executable. This is not completely true: to pass from the source to the executable you need another executable (the compiler). And to compile the compiler, you most often need the compiler itself. In the current situation, if all the executable binary code in the world were erased by some magic power and only the source code remained, we would not be able to rebuild the executable code, because we would not have working compilers.

The aim of the Bootstrappable project is to recover from this situation, i.e., produce a path to rebootstrap all the executable world from the source world that we already have. Source code is knowledge, executable code is a way to use this knowledge, but it is not knowledge itself. It should be derivable for knowledge without having to depend on anything else.

See the site of the Boostrappable project for additional practical and phylosophical reasons. The asmc project is my personal contribution to Bootstrappable.

Of course it is not possible to remove completely the dependency on some executable code for bootstrapping, becuase at some point you have to power up your CPU and you will have to feed it some executable code (which is called the "seed"). The target is to reduce this seed as much as possible, so that it can be directly inspected. Currently asmc is seeded by around 15 KiB of code (plus, unfortunately, the BIOS and the various microcodes and firmwares in the computer, which are not even open in most cases), which is pretty good. Maybe in the future I'll be able to shrink it even more (there is some room for optimization). At some point I would also like to convert it to a free architecture, like RISC-V, but this will require major rewriting of code generation for all compilers and assemblers. I am not aware of completely free and Linux-capable RISC-V implementations, so for the moment I am concentrating on Intel processors.

Beside the Bootstrappable projects (many are listed in the wiki page), one great inspiration for asmc was TCCBOOT, by Fabrice Bellard (the same original author of tinycc). TCCBOOT uses tinycc to build a stripped down version of the Linux kernel at boot time and then executes it, which is kind of what asmc is trying to do, expect that asmc is trying to compile the compiler as well.

Design considerations

Ideally the system seed written in Assembly should be as simple and small as possible. Since that is the part that must be already build when the system boots, it should be verifiable by hand, i.e., it should be so simple that a human being can take a printout of the code and of the binary hex dump and check opcode by opcode that the code was translated correctly. This is very tedious, so everything that is not strictly necessary for building later stages should be moved to later stages.

All other design criteria are of smaller concern: in particular efficiency is not a target (all first stages compilers are definitely not efficient, both in terms of their execution time and of the generated code; however, ideally they are meant to be dropped as soon as a real compiler is built).

Also coding style is very inhomogeneus, mostly because I am working with languages with which I had very small prior experience before starting this project (I had never written more than a few Assembly lines together; the G language did not even exist when I started this, because I invented it, so I could not possibly have prior experience). During writing I established my own style, but never went back to fix already written code. So in theory looking at the style you can probably reconstruct the order in which a wrote code.

The G language

My initial idea, when I begun working on asmc, was to embed an assembler in the initial seed and then use Assembly to write a C compiler. At some point I realized that bridging the gap between Assembly and C with just one jump is very hard: Assembly is very low level, and C, when you try to write its compiler, is much higher level that one would expect in the beginning. At the same time, Assembly is harder to compile properly then I initially expected: there are quite some syntax variations and opcode encoding can be rather quirky. So it is not the ideal thing to put in the binary seed.

Then I set out to invent a language which could offer a somewhat C like writing experience, but that was as simple as possible to parse and compile (without, of course, pretending to do any optimization whatsoever). What eventually came out is the G language.

In my experience, and once you have ditched the optimization part, the two difficult things to do to compile C are parsing the input (a lot of operators, different priorities and syntaxes) and handling the type system (functions, pointers, arrays, type decaying, lvalues, ...). So I decided that G had to be C without types and without complicated syntax. So G has just one type, which is the equivalent of a 32-bits int in C and is also used for storing pointers, and expressions are written in reverse Polish notation, so it is basically a stack machine. Of course G is very tied to the i386 architecture, but it is not meant to do much else.

The syntax of the G language is exaplained in a dedicated document.

Other material

asmc was presented at DebConf 2019. Some older or more technical documentation is mantained in another file.

For other topics on bootstrappability, see the Bootstrappable website, the Bootstrapping wiki and the collection of talks and notes curated by OriansJ.

License

Most of the original code I wrote is covered by the GNU General Public License, version 3 or later. Code that was imported from other projects, with or without modifications, is covered by their own licenses, which are usually either the GPL again or very liberal licenses. Therefore, I believe that the combined project is again distributable under the terms of the GPL-3+ license.

Individual files' headers detail the licensing conditions for that specific file. Having taken material from many different sources, I tried my best to respect all the necessary conditions. Please contact me if you become aware of some mistake on my side.

Author

Giovanni Mascellani gio@debian.org

README.md

asmc, a bootstrapping OS with minimal binary seed