A bootstrapping OS with minimal binary seed (mirror)
Giovanni Mascellani 01ae2c5b0a Update README and licensing information. | преди 5 години | |
---|---|---|
asmg | преди 5 години | |
asmg0 | преди 5 години | |
attic | преди 5 години | |
boot | преди 5 години | |
contrib | преди 5 години | |
diskfs | преди 5 години | |
http | преди 5 години | |
lib | преди 5 години | |
test | преди 5 години | |
.gitlab-ci.yml | преди 5 години | |
.gitmodules | преди 5 години | |
.travis.yml | преди 5 години | |
ASSUMPTIONS.md | преди 5 години | |
COPYING | преди 6 години | |
G_LANGUAGE.md | преди 5 години | |
MM0.md | преди 5 години | |
Makefile | преди 5 години | |
README.md | преди 5 години | |
README.old.md | преди 5 години | |
create_diskfs.py | преди 5 години | |
create_partition.py | преди 6 години | |
extract_debugfs.py | преди 5 години |
asmc
, a bootstrapping OS with minimal binary seedasmc
is an extremely minimal operating system, whose binary seed
(the compiled machine code that is fed to the CPU when the system
boots) is very small (around 6 KiB). Such compiled code is enough to
contain a minimal IO environment and compiler, which is used to
compile a more powerful environment and compiler, further used to
compile even more powerful things, until a fully running system is
bootstrapped. In this way nearly any code running in your computer is
compiled during the boot procedure, except the the initial seed that
ideally is kept as small as possible.
This at least is the plan; from the moment we are not yet at the point where we manage to compile a real system environment. However, work is ongoing (and you can contribute!).
The name asmc
indicates two of the most prominents languages used in
this process: Assembly (with which the initial seed is written) and C
(one of the first targets we aim to). The initial plan was to embed an
Assembly compiler in the binary seed and then use Assembly to produce
a C compiler. In the end a different path was devised: the initial
seed is written in Assembly and embeds a G compiler (where G is a
custom language, sitting something between Assembly and C, conceived
to be very easy to compile); the G compiler is then use to produce a C
compiler. Assembly is never directly used in this chain, although it
is of course continuously behind the curtains.
asmc
is currently able to:
boot from a minimal seed of around 6 KiB;
compile and execute source code written in the G language (see below);
initialize a minimal environment, including a terminal (writing to serial port and to monitor), dynamic memory allocation, simple data structures, disk access and a simple virtual file system;
compile a basic assembler and a basic C compiler, both of which are written in G;
use them to compile a lightly patched version of tinycc, using a custom basic C standard library;
use tinycc to compile another copy of itself (in line of principle that can be repeated as long as you want);
use the second tinycc to compile a patched and customized version of iPXE;
use iPXE to initialize the network card and be ready to download more source code from the network.
Hopefully asmc
will eventually be able to:
download the Linux source code, patch it and compile via a custom build script;
compile a minimal userspace environment, which includes at least a copy of tinycc;
boot Linux with this userspace, and be ready to compile other source code to continue bootstrapping.
You should use Linux to compile asmc
, although some parts of it can
also be built on macOS. If you use Debian, install the prerequisites
with
sudo apt-get install build-essential nasm qemu-system-x86 python3 qemu-utils
If you cloned the GIT repository, you will probably want to checkout the submodules as well:
git submodule init
git submodule update --recursive
Then just call make
in the toplevel directory of the repository. A
subdirectory named build
will be created, with all compilation
artifacts inside it. Before running the virtual machine you have to
start the web server that serves code for the later stages of asmc
:
open another terminal, enter the http
directory and run
python3 -m http.server 8080
Then:
qemu-system-i386 -m 256M -hda build/boot_asmg.x86.qcow2 -serial stdio -display none
(if your host system supports it, you can add -enable-kvm
to benefit
of KVM acceleration)
WARNING! ATTENTION! asmc
can also be ran on real hardware. However,
remember that you are giving full control over you hardware to an
experimental program whose author is not an expert operating system
programmer: it could have bugs and overwrite your disk, damage the
hardware and whatever. I only run it on an old otherwise unused laptop
that once belonged to my grandmother. No problem has ever become
apparent, but care is never too much!
Well, the first and most important reason was learning. So far I learnt how to write a basic boot loader, a basic operating system and a few language compilers (for Assembly, G and C). I learnt to write simple Assembly and I invented a G language, that I found pretty satisfying for the specific domain it was written for (more on this below).
Other than that, it bothers me that the fine art of programming is currently based on a misconception: that there are two worlds, the "source world" and the "executable world", and that given the source you can build the executable. This is not completely true: to pass from the source to the executable you need another executable (the compiler). And to compile the compiler, you most often need the compiler itself. In the current situation, if all the executable binary code in the world were erased by some magic power and only the source code remained, we would not be able to rebuild the executable code, because we would not have working compilers.
The aim of the Bootstrappable project is to recover from this situation, i.e., produce a path to rebootstrap all the executable world from the source world that we already have. Source code is knowledge, executable code is a way to use this knowledge, but it is not knowledge itself. It should be derivable for knowledge without having to depend on anything else.
See the site of the Boostrappable project for additional practical and
phylosophical reasons. The asmc
project is my personal contribution
to Bootstrappable.
Of course it is not possible to remove completely the dependency on
some executable code for bootstrapping, becuase at some point you have
to power up your CPU and you will have to feed it some executable code
(which is called the "seed"). The target is to reduce this seed as
much as possible, so that it can be directly inspected. Currently
asmc
is seeded by around 15 KiB of code (plus, unfortunately, the
BIOS and the various microcodes and firmwares in the computer, which
are not even open in most cases), which is pretty good. Maybe in the
future I'll be able to shrink it even more (there is some room for
optimization). At some point I would also like to convert it to a free
architecture, like RISC-V, but this will require major rewriting of
code generation for all compilers and assemblers. I am not aware of
completely free and Linux-capable RISC-V implementations, so for the
moment I am concentrating on Intel processors.
Beside the Bootstrappable projects (many are listed in the wiki
page), one great
inspiration for asmc
was
TCCBOOT, by Fabrice Bellard
(the same original author of tinycc). TCCBOOT uses tinycc to build a
stripped down version of the Linux kernel at boot time and then
executes it, which is kind of what asmc
is trying to do, expect that
asmc
is trying to compile the compiler as well.
Ideally the system seed written in Assembly should be as simple and small as possible. Since that is the part that must be already build when the system boots, it should be verifiable by hand, i.e., it should be so simple that a human being can take a printout of the code and of the binary hex dump and check opcode by opcode that the code was translated correctly. This is very tedious, so everything that is not strictly necessary for building later stages should be moved to later stages.
All other design criteria are of smaller concern: in particular efficiency is not a target (all first stages compilers are definitely not efficient, both in terms of their execution time and of the generated code; however, ideally they are meant to be dropped as soon as a real compiler is built).
Also coding style is very inhomogeneus, mostly because I am working with languages with which I had very small prior experience before starting this project (I had never written more than a few Assembly lines together; the G language did not even exist when I started this, because I invented it, so I could not possibly have prior experience). During writing I established my own style, but never went back to fix already written code. So in theory looking at the style you can probably reconstruct the order in which a wrote code.
My initial idea, when I begun working on asmc
, was to embed an
assembler in the initial seed and then use Assembly to write a C
compiler. At some point I realized that bridging the gap between
Assembly and C with just one jump is very hard: Assembly is very low
level, and C, when you try to write its compiler, is much higher level
that one would expect in the beginning. At the same time, Assembly is
harder to compile properly then I initially expected: there are quite
some syntax variations and opcode encoding can be rather quirky. So it
is not the ideal thing to put in the binary seed.
Then I set out to invent a language which could offer a somewhat C like writing experience, but that was as simple as possible to parse and compile (without, of course, pretending to do any optimization whatsoever). What eventually came out is the G language.
In my experience, and once you have ditched the optimization part, the
two difficult things to do to compile C are parsing the input (a lot
of operators, different priorities and syntaxes) and handling the type
system (functions, pointers, arrays, type decaying, lvalues, ...). So
I decided that G had to be C without types and without complicated
syntax. So G has just one type, which is the equivalent of a 32-bits
int
in C and is also used for storing pointers, and expressions are
written in reverse Polish notation, so it is basically a stack
machine. Of course G is very tied to the i386 architecture, but it is
not meant to do much else.
The syntax of the G language is exaplained in a dedicated document.
Some older or more technical documentation is mantained in another file.
Most of the original code I wrote is covered by the GNU General Public License, version 3 or later. Code that was imported from other projects, with or without modifications, is covered by their own licenses, which are usually either the GPL again or very liberal licenses. Therefore, I believe that the combined project is again distributable under the terms of the GPL-3+ license.
Individual files' headers detail the licensing conditions for that specific file. Having taken material from many different sources, I tried my best to respect all the necessary conditions. Please contact me if you become aware of some mistake on my side.
Giovanni Mascellani gio@debian.org