|
@@ -0,0 +1,81 @@
|
|
|
+Here's an overview of v86's workings. For details, check the
|
|
|
+[source](https://github.com/copy/v86/tree/master/src).
|
|
|
+
|
|
|
+The major limitations of WebAssembly are (for the purpose of making emulators with jit):
|
|
|
+
|
|
|
+- structured control flow (no arbitrary jumps)
|
|
|
+- no control over registers (you can't keep hardware registers in wasm locals across functions)
|
|
|
+- no mmap (paging needs to be fully emulated)
|
|
|
+- no patching
|
|
|
+- module generation is fairly slow, but at least it's asynchronous, so other things can keep running
|
|
|
+- there is some memory overhead per module, so you can't generate more than a few thousand
|
|
|
+
|
|
|
+v86 has an interpreted mode, which collects entry points (targets of function
|
|
|
+calls and indirect jumps). It also measures the hotness per page, so that
|
|
|
+compilation is focused on code that is often executed. Once a page is
|
|
|
+considered hot, code is generated for the entire page and up to `MAX_PAGES`
|
|
|
+that are directly reachable from it.
|
|
|
+
|
|
|
+v86 generates a single function with a big switch statement (brtable), to
|
|
|
+ensure that all functions and targets of indirect jumps are reachable from
|
|
|
+other modules. The remaining control flow is handled using the "stackifier"
|
|
|
+algorithm (well-explained in
|
|
|
+[this blog post](https://medium.com/leaningtech/solving-the-structured-control-flow-problem-once-and-for-all-5123117b1ee2)).
|
|
|
+At the moment, there is no linking of wasm modules. The current module is
|
|
|
+exited, and the main loop detects if a new module can be entered.
|
|
|
+
|
|
|
+In practice, I found that browsers don't handle this structure (deep brtables,
|
|
|
+with locals being used across the entire function) very well, and `MAX_PAGES`
|
|
|
+has to be set to fairly low, otherwise memory usage blows up. It's likely that
|
|
|
+improvements are possible (generating fewer entry points, splitting code across
|
|
|
+multiple functions).
|
|
|
+
|
|
|
+Code-generation happens in two passes. The first pass finds all basic block
|
|
|
+boundaries, the second generates code for each basic block. Instruction
|
|
|
+decoding is generated by a [set of
|
|
|
+scripts](https://github.com/copy/v86/tree/master/gen) from a [table of
|
|
|
+instructions](https://github.com/copy/v86/blob/master/gen/x86_table.js). It's
|
|
|
+also used to [generate
|
|
|
+tests](https://github.com/copy/v86/blob/master/tests/nasm/create_tests.js).
|
|
|
+
|
|
|
+To handle paging, v86 generates code similar to this (see `gen_safe_read`):
|
|
|
+
|
|
|
+```
|
|
|
+entry <- tlb[addr >> 12 << 2]
|
|
|
+if entry & MASK == TLB_VALID && (addr & 0xFFF) <= 0xFFC - bytes: goto fast
|
|
|
+entry <- safe_read_jit_slow(addr, instruction_pointer)
|
|
|
+if page_fault: goto exit-with-pagefault
|
|
|
+fast: mem[(entry & ~0xFFF) ^ addr]
|
|
|
+```
|
|
|
+
|
|
|
+There is a 4 MB cache that acts like a tlb. It contains the physical address,
|
|
|
+read-only bit, whether the page contains code (in order to invalidate it on
|
|
|
+write) and whether the page points to mmio. Any of those cases are handled in
|
|
|
+the slow path (`safe_read_jit_slow`), as well as walking the page tables and
|
|
|
+triggering page faults. The fast path is taken in the vast majority of times.
|
|
|
+
|
|
|
+The remaining code generation is mostly a straight-forward, 1-to-1 translation
|
|
|
+of x86 to wasm. The only analysis done is to optimise generation of condional
|
|
|
+jumps immediately after arithmetic instructions, e.g.:
|
|
|
+
|
|
|
+```
|
|
|
+cmp eax, 52
|
|
|
+setb eax
|
|
|
+```
|
|
|
+
|
|
|
+becomes:
|
|
|
+
|
|
|
+```
|
|
|
+... // code for cmp
|
|
|
+eax <- eax < 52
|
|
|
+```
|
|
|
+
|
|
|
+A lazy flag mechanism is used to speed arithmetic (applies to both jit and
|
|
|
+interpreted mode, see
|
|
|
+[`arith.rs`](https://github.com/copy/v86/blob/master/src/rust/cpu/arith.rs) and
|
|
|
+[`misc_instr.rs`](https://github.com/copy/v86/blob/master/src/rust/cpu/misc_instr.rs)).
|
|
|
+There's a wip that tries to elide most lazy-flags updates:
|
|
|
+https://github.com/copy/v86/pull/466
|
|
|
+
|
|
|
+FPU instructions are emulated using softfloat (very slow, but unfortunately
|
|
|
+some code relies on 80 bit floats).
|