README.old.md 16 KB

Older documentation for asmc

This file contains additional documentation for asmc. Some of the things here might not be completely up-to-date, but they can be useful anyway to figure out what's happening inside asmc.

How to interpret all the writings

For the full story, read below and the code. However, just to have an idea of what is happening, if you use the command above to boot boot_asmg.x86 the following will happen:

  • The first log lines are written by the bootloader. At this point it is mostly concerned with loading to RAM the actual kernel, enabling some obscure features of the PC platform used to boot properly and entering the CPU's protected mode.

  • At some point asmc kernel is finally ran, and it writes Hello, asmc! to the log. There is where the asmc binary seed first enters execution. It will just initialize some data structures and then invoke its embedded G compiler to compile the file main.g and then call the main routine.

  • This is the point where for the first time code that has just been compiled is fed to the CPU, so in a sense the binary seed is not in control of the main program any more (but still gets called as a library, for example to compile other G sources). The message Hello, G! is written to the log and immediately after other G sources are compiled, first to introduce some library code (like malloc/free, code for handling dynamic vectors and maps, some basic disk/filesystem driver and other utilities) and then to compile an assembler and a C compiler. These two compilers are not meant to be complete: they are just enough to build the following step, which is tinycc.

  • Then a suite of C test programs is compiled and executed. They test part of the C compiler and the C standard library. In line of principle all the test should pass and all malloc-s should be free-ed.

  • After all tests have passed, tinycc is finally compiled. This takes a bit (around 20 seconds on my machine, my KVM enable), because the previous C compiler is quite inefficient. During preprocessing progress is indicated by open and closed square brackets, which indicate when a new file is included or finished to include. During compilation (which consists of three stages), progress is indicated by dots, where each dot correspons to a thousands tokens processed.

  • At last, tinycc is ran, by mean of its libtcc interface. A small test program is compiled and executed, showing a glimpse of the third level of compiled code from the beginning of the asmc run.

  • In the end some statistics are printed, hopefully showing that all allocated memory have been deallocated (not that it matters much, since the machine is going to be powered off anyway, but I like resources to be deinitialized properly).

How to fiddle with flags

asmg's behaviour can be customized thanks to some flags at the beginning of the asmg/main.g file. There you can enable:

  • RUN_ASM: it will compile an assembler and then run it on the file test/test.asm. The assembler is not complete at all, but ideally it should be more or less enough to assemble a lightly patched version of FASM (see next point).

  • RUN_FASM (currently unmaintained): compile the assembler and then use it to assemble FASM, as mentioned above. In theory it should work, but in practice it does not: the assembled FASM crashes for some reason I could not understand. There is definitely a bug in my assembler (or at least some unmet FASM assumption), but I could not find it so far. However, the bulk of the project is not here.

  • RUN_C: it will compile the assembler and the C compiler and then use them to compile the program in diskfs/tests/test.c. In the source code there are flags to dump debugging information, including a dump of the machine code itself. It is useful to debug the C compiler itself. Also, feel free to edit the test program to test your own C programs (but expect frequent breakages!).

  • RUN_MESCC (currently unmaintained; only this port is unmaintained, the original project is going on): it will compile a port of the mescc/M2-Planet toolchain, which is basically an indepdendent C compiler with different features and bugs than mine. This port just tracks the upstream program, no original development is done here. See below from more precise links. The test program in test/test_mes.c will then be compiled and executed.

  • RUN_MCPP (currently unmaintained): it will compile the assembler and the C compiler and then use them to try compiling a lightly patched version of mcpp, which is a complete C preprocessor. Since the preprocessor embedded in asmc's C compiler is rather lacking, the idea is that mcpp could be used instead to compile C sources that require deep preprocessing capabilities. However, at this point, mcpp itself does not compile, so at some point asmc with die with a failed assertion. Also, it nowadays seems that asmc is able to preprocess tinycc by itself, so there is no point anymore in going forward with this subproject.

  • RUN_TINYCC: here is where the juice stays! This will compile the assembler and the C compiler, and then compile tinycc, as mentioned above. Then it will use tinycc to compile and execute a little C program. In the future the bootstrapping chain will continue here.

  • TEST_MAP: there are three implementation of an associative array in asmc, of increasing complexity (see below). This tests the implementation, and was used in the past to check new implementations for correctness.

  • TEST_INT64: implementing 64 bits integers on a 32 bits platform is somewhat tricky. The G language itself only supports 32 bits numbers, so some additional Assembly code was required to implement 64 bits operations. Also, the division code is particularly tricky. However, 64 bits integers are required by tinycc, which needs support for long long types, so they were implemented at some point. This enables some tests on the resulting implemntation.

  • TEST_C: it will compile the C compiler and run the test suite.

By default thre three TEST_* flags and RUN_TINYCC are enabled in asmc.

There is also another pack of flags that control which malloc implementation asmg is going to use. There are four at this point. All of them gather memory with the platform_allocate call (see below), which is similary to UNIX' brk (and does not permit to release memory back).

  • USE_TRIVIAL_MALLOC: just map malloc to platform_allocate and discard free. Very quick, but wastes all free-ed memory.

  • USE_SIMPLE_MALLOC: a simple freelist implementation, ported from here, which is probably rather memory efficient, but can be linear in time, so it easily becomes a bottleneck.

  • USE_CHECKED_MALLOC: somewhat similar to USE_TRIVIAL_MALLOC, but checks that your program uses malloc and free correctly (i.e., that you not overflow or underflow your allocations, that you do not double free, or use after free). As a result it is very slow and memory-inefficient, but if your program runs with it it most probably means that it is correctly allocating and deallocating memory. It is a kind of valgrind checker.

  • USE_KMALLOC: a port (with some modifications, mainly due to the fact that there is no paging in asmc) of kmalloc. Very quick and rather memory-efficient. Basically to better option currently available in asmc (unless you want to debug memory allocation), so also the default one.

A third pack of flags is for controlling the associative array (map) implementation used by asmc.

  • USE_SIMPLE_MAP: the original map implementation, based on lineary arrays, which require a full trasversal of the array for basically every operation. Very slow.

  • USE_AVL_MAP: a new implementation based on AVL trees. In the end it was never finished (because at some point I decided to switch to red-black trees), so it implements a binary search tree, but without rebalancing. Not guaranteed to be balanced, but probably, since most of the times data arrive in random order, it ususally is. Practical performance are comparable with red-black trees.

  • USE_RB_MAP: the final and default implementation, using properly balanced red-black trees.

In theory all three of them should work, with different performances. In practice, only the red-black tree is routinely used and thus tested.

What is inside this repository

  • lib contains a very small kernel, designed to run on a i386 CPU in protected mode (with a flat memory model and without memory paging). The kernel offers an interface for writing to the console and to the serial port, a simple read-only ramdisk and some library routines for later stages.

The kernel can be booted with multiboot, or can simply be loaded at 1 MB and jumped in at its very beginning. The ramdisk must be appended to it in the ar format.

The kernel must be compiled with payload, inside which it jumps after loading. Three payloads are provided, detailed later.

In this directory there are also some C files that in theory enable you to host a payload directly in a Linux process. They were mostly useful in the beginning to test the code in a more friendly environment than a virtual machine, but they are far from being perfect and not very useful nowadays.

  • asmg is an G compiler written in Assembly. G is a custom language I invented for this project, described below in more details. As soon as it is ready, it compiles the file main.g and jumps to main. Here is where most of the development is concentrated nowadays. asmg can be compiled by asmasm. See above for what is implemented in the G environment.

  • asmg0 is an effort at reducing even more asmg binary seed, by introducing a smaller language called G0 between the binary seed and the G language. It is currently a very experimental effort (even more experimental than the rest) and it does not work at all.

  • boot contains a simple bootloader that can be used to boot all of the above (in the minimalistic style of the rest of the project). For the moment is cannot be compiled with asmasm, because it must use some system level opcodes that are not supported by asmasm, so you have to use NASM. It works under QEMU and in line of principle it also work under bare metal, at least those that I tried (old computers that I had around). As already outlined above, this is not tested software, you should never run it on computer that you cannot afford to be erased.

  • attic contains some earlier test code, that is not used any more and it is also probably broken. They are probably not very interesting to most users, and might be removed altogether at some point.

  • test contains some test programs for the Assembly and C compilers contained in asmg.

  • diskfs contains the file that are made available to the virtual file system in asmg.

The platform interface exposed by the kernel

The kernel and library in the directory lib offer some simple API to later stages, which is described here. All calls follow the usual cdecl ABI (arguments pushed right to left; caller cleans up; return value in EAX or EDX:EAX; stack aligned to 4; EAX, ECX and EDX are caller-saved and the other registers are callee-saved; objects are returned via additional first argument).

  • platform_exit() Exit successfully; it will never return.

  • platform_panic() Exit unsuccessfully, writing a panic error message; it will never return. In the earlier stages there is nearly no error diagnosing facility, so if the program terminates with a panic message you are on your own finding the problem. Next time you want an easy life please write in Java.

  • platform_write_char(int fd, int c) Write character c in file fd. Writing on a filesystem is not supported yet. There are only two virtual files: file 0, which writes in memory, at the address contained in write_mem_ptr, and then increment the address; and file 1, which writes on the console and on the serial port. There is also file 2, which just maps to 1.

  • platform_log(int fd, char *s) Write a NULL terminated string s into fd, by repeatedly calling platform_write_char.

  • platform_open_file(char *fname) Open file fname for reading, returning the associated fd number. Opened files cannot be closed, for the moment.

  • platform_read_char(int fd) Read a char and return from file fd. Return -1 (i.e., 0xffffffff) at EOF.

  • platform_reset_file(int fd) Seek back to the beginning of the file. Other seeks are not supported.

  • platform_allocate(int size) Simple memory allocator returning a pointer to a memory region of at least size bytes. It works in a similar way to sbrk on UNIX platforms, so you cannot return a memory region to the pool, unless it is the last one that was allocated. But you can implement your own malloc/free on top of it, as it is actually later done in G.

  • platform_get_symbol(char *name, int *arity) Return the address of symbol name, panicking if it does not exist. If arity is not NULL, return there the symbol arity (i.e., the number of parameters, which is relevant for the G language, but not for Assembly). The number -1 (0xffffffff) is returned if arity is undefined.

  • platform_setjmp(void *env) Copy the content of the general purpose registers in the buffer pointed by env (which must be at least 26 bytes long). This is used to implement the setjmp call in the C compiler.

  • platform_longjmp(void *env, int status) Restore the content of the general purpose registers from the buffer pointed by env, except EAX which is set to status. This is used to implement the longjmp call in the C compiler.

Another routine is provided when compiling the kernel with asmg:

  • platform_g_compile(char *filename) Compile the G program in filename, panicking if an error is found.

Symbols generated by any of the two compilers can be recovered with platform_get_symbol.

The G compiler also exports a few internal calls to give the G program a little introspection capabilities, used to generate stack traces on assertions. They are not documented and are not to be used other that in these debugging utilities.

Ported programs

This repository contains the following code ported to G:

Other programs are used by mean of Git submodules (see the contrib directory), so their exact version is encoded in the Git repository itself and it is not repeated here.