|
- <html>
- <title>
- data
- </title>
- <body BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#330088" ALINK="#FF0044">
- <H1>How to Use the Plan 9 C Compiler
- </H1>
- <DL><DD><I>Rob Pike<br>
- rob@plan9.bell-labs.com<br>
- </I></DL>
- <H4>Introduction
- </H4>
- <P>
- The C compiler on Plan 9 is a wholly new program; in fact
- it was the first piece of software written for what would
- eventually become Plan 9 from Bell Labs.
- Programmers familiar with existing C compilers will find
- a number of differences in both the language the Plan 9 compiler
- accepts and in how the compiler is used.
- </P>
- <P>
- The compiler is really a set of compilers, one for each
- architecture ­ MIPS, SPARC, Motorola 68020, Intel 386, etc. ­
- that accept a dialect of ANSI C and efficiently produce
- fairly good code for the target machine.
- There is a packaging of the compiler that accepts strict ANSI C for
- a POSIX environment, but this document focuses on the
- native Plan 9 environment, that in which all the system source and
- almost all the utilities are written.
- </P>
- <H4>Source
- </H4>
- <P>
- The language accepted by the compilers is the core ANSI C language
- with some modest extensions,
- a greatly simplified preprocessor,
- a smaller library that includes system calls and related facilities,
- and a completely different structure for include files.
- </P>
- <P>
- Official ANSI C accepts the old (K&R) style of declarations for
- functions; the Plan 9 compilers
- are more demanding.
- Without an explicit run-time flag
- (<TT>-B</TT>)
- whose use is discouraged, the compilers insist
- on new-style function declarations, that is, prototypes for
- function arguments.
- The function declarations in the libraries' include files are
- all in the new style so the interfaces are checked at compile time.
- For C programmers who have not yet switched to function prototypes
- the clumsy syntax may seem repellent but the payoff in stronger typing
- is substantial.
- Those who wish to import existing software to Plan 9 are urged
- to use the opportunity to update their code.
- </P>
- <P>
- The compilers include an integrated preprocessor that accepts the familiar
- <TT>#include</TT>,
- <TT>#define</TT>
- for macros both with and without arguments,
- <TT>#undef</TT>,
- <TT>#line</TT>,
- <TT>#ifdef</TT>,
- <TT>#ifndef</TT>,
- and
- <TT>#endif</TT>.
- It
- supports neither
- <TT>#if</TT>
- nor
- <TT>##</TT>,
- although it does
- honor a few
- <TT>#pragmas</TT>.
- The
- <TT>#if</TT>
- directive was omitted because it greatly complicates the
- preprocessor, is never necessary, and is usually abused.
- Conditional compilation in general makes code hard to understand;
- the Plan 9 source uses it sparingly.
- Also, because the compilers remove dead code, regular
- <TT>if</TT>
- statements with constant conditions are more readable equivalents to many
- <TT>#ifs</TT>.
- To compile imported code ineluctably fouled by
- <TT>#if</TT>
- there is a separate command,
- <TT>/bin/cpp</TT>,
- that implements the complete ANSI C preprocessor specification.
- </P>
- <P>
- Include files fall into two groups: machine-dependent and machine-independent.
- The machine-independent files occupy the directory
- <TT>/sys/include</TT>;
- the others are placed in a directory appropriate to the machine, such as
- <TT>/mips/include</TT>.
- The compiler searches for include files
- first in the machine-dependent directory and then
- in the machine-independent directory.
- At the time of writing there are thirty-one machine-independent include
- files and two (per machine) machine-dependent ones:
- <TT><ureg.h></TT>
- and
- <TT><u.h></TT>.
- The first describes the layout of registers on the system stack,
- for use by the debugger.
- The second defines some
- architecture-dependent types such as
- <TT>jmp_buf</TT>
- for
- <TT>setjmp</TT>
- and the
- <TT>va_arg</TT>
- and
- <TT>va_list</TT>
- macros for handling arguments to variadic functions,
- as well as a set of
- <TT>typedef</TT>
- abbreviations for
- <TT>unsigned</TT>
- <TT>short</TT>
- and so on.
- </P>
- <P>
- Here is an excerpt from
- <TT>/68020/include/u.h</TT>:
- <DL><DT><DD><TT><PRE>
- #define nil ((void*)0)
- typedef unsigned short ushort;
- typedef unsigned char uchar;
- typedef unsigned long ulong;
- typedef unsigned int uint;
- typedef signed char schar;
- typedef long long vlong;
- typedef long jmp_buf[2];
- #define JMPBUFSP 0
- #define JMPBUFPC 1
- #define JMPBUFDPC 0
- </PRE></TT></DL>
- Plan 9 programs use
- <TT>nil</TT>
- for the name of the zero-valued pointer.
- The type
- <TT>vlong</TT>
- is the largest integer type available; on most architectures it
- is a 64-bit value.
- A couple of other types in
- <TT><u.h></TT>
- are
- <TT>u32int</TT>,
- which is guaranteed to have exactly 32 bits (a possibility on all the supported architectures) and
- <TT>mpdigit</TT>,
- which is used by the multiprecision math package
- <TT><mp.h></TT>.
- The
- <TT>#define</TT>
- constants permit an architecture-independent (but compiler-dependent)
- implementation of stack-switching using
- <TT>setjmp</TT>
- and
- <TT>longjmp</TT>.
- </P>
- <P>
- Every Plan 9 C program begins
- <DL><DT><DD><TT><PRE>
- #include <u.h>
- </PRE></TT></DL>
- because all the other installed header files use the
- <TT>typedefs</TT>
- declared in
- <TT><u.h></TT>.
- </P>
- <P>
- In strict ANSI C, include files are grouped to collect related functions
- in a single file: one for string functions, one for memory functions,
- one for I/O, and none for system calls.
- Each include file is protected by an
- <TT>#ifdef</TT>
- to guarantee its contents are seen by the compiler only once.
- Plan 9 takes a different approach. Other than a few include
- files that define external formats such as archives, the files in
- <TT>/sys/include</TT>
- correspond to
- <I>libraries.</I>
- If a program is using a library, it includes the corresponding header.
- The default C library comprises string functions, memory functions, and
- so on, largely as in ANSI C, some formatted I/O routines,
- plus all the system calls and related functions.
- To use these functions, one must
- <TT>#include</TT>
- the file
- <TT><libc.h></TT>,
- which in turn must follow
- <TT><u.h></TT>,
- to define their prototypes for the compiler.
- Here is the complete source to the traditional first C program:
- <DL><DT><DD><TT><PRE>
- #include <u.h>
- #include <libc.h>
- void
- main(void)
- {
- print("hello world\n");
- exits(0);
- }
- </PRE></TT></DL>
- The
- <TT>print</TT>
- routine and its relatives
- <TT>fprint</TT>
- and
- <TT>sprint</TT>
- resemble the similarly-named functions in Standard I/O but are not
- attached to a specific I/O library.
- In Plan 9
- <TT>main</TT>
- is not integer-valued; it should call
- <TT>exits</TT>,
- which takes a string argument (or null; here ANSI C promotes the 0 to a
- <TT>char*</TT>).
- All these functions are, of course, documented in the Programmer's Manual.
- </P>
- <P>
- To use
- <TT>printf</TT>,
- <TT><stdio.h></TT>
- must be included to define the function prototype for
- <TT>printf</TT>:
- <DL><DT><DD><TT><PRE>
- #include <u.h>
- #include <libc.h>
- #include <stdio.h>
- void
- main(int argc, char *argv[])
- {
- printf("%s: hello world; argc = %d\n", argv[0], argc);
- exits(0);
- }
- </PRE></TT></DL>
- In practice, Standard I/O is not used much in Plan 9. I/O libraries are
- discussed in a later section of this document.
- </P>
- <P>
- There are libraries for handling regular expressions, raster graphics,
- windows, and so on, and each has an associated include file.
- The manual for each library states which include files are needed.
- The files are not protected against multiple inclusion and themselves
- contain no nested
- <TT>#includes</TT>.
- Instead the
- programmer is expected to sort out the requirements
- and to
- <TT>#include</TT>
- the necessary files once at the top of each source file. In practice this is
- trivial: this way of handling include files is so straightforward
- that it is rare for a source file to contain more than half a dozen
- <TT>#includes</TT>.
- </P>
- <P>
- The compilers do their own register allocation so the
- <TT>register</TT>
- keyword is ignored.
- For different reasons,
- <TT>volatile</TT>
- and
- <TT>const</TT>
- are also ignored.
- </P>
- <P>
- To make it easier to share code with other systems, Plan 9 has a version
- of the compiler,
- <TT>pcc</TT>,
- that provides the standard ANSI C preprocessor, headers, and libraries
- with POSIX extensions.
- <TT>Pcc</TT>
- is recommended only
- when broad external portability is mandated. It compiles slower,
- produces slower code (it takes extra work to simulate POSIX on Plan 9),
- eliminates those parts of the Plan 9 interface
- not related to POSIX, and illustrates the clumsiness of an environment
- designed by committee.
- <TT>Pcc</TT>
- is described in more detail in
- APE­The ANSI/POSIX Environment,
- by Howard Trickey.
- </P>
- <H4>Process
- </H4>
- <P>
- Each CPU architecture supported by Plan 9 is identified by a single,
- arbitrary, alphanumeric character:
- <TT>k</TT>
- for SPARC,
- <TT>q</TT>
- for Motorola Power PC 630 and 640,
- <TT>v</TT>
- for MIPS,
- <TT>1</TT>
- for Motorola 68000,
- <TT>2</TT>
- for Motorola 68020 and 68040,
- <TT>5</TT>
- for Acorn ARM 7500,
- <TT>6</TT>
- for Intel 960,
- <TT>7</TT>
- for DEC Alpha,
- <TT>8</TT>
- for Intel 386, and
- <TT>9</TT>
- for AMD 29000.
- The character labels the support tools and files for that architecture.
- For instance, for the 68020 the compiler is
- <TT>2c</TT>,
- the assembler is
- <TT>2a</TT>,
- the link editor/loader is
- <TT>2l</TT>,
- the object files are suffixed
- <TT>.2</TT>,
- and the default name for an executable file is
- <TT>2.out</TT>.
- Before we can use the compiler we therefore need to know which
- machine we are compiling for.
- The next section explains how this decision is made; for the moment
- assume we are building 68020 binaries and make the mental substitution for
- <TT>2</TT>
- appropriate to the machine you are actually using.
- </P>
- <P>
- To convert source to an executable binary is a two-step process.
- First run the compiler,
- <TT>2c</TT>,
- on the source, say
- <TT>file.c</TT>,
- to generate an object file
- <TT>file.2</TT>.
- Then run the loader,
- <TT>2l</TT>,
- to generate an executable
- <TT>2.out</TT>
- that may be run (on a 680X0 machine):
- <DL><DT><DD><TT><PRE>
- 2c file.c
- 2l file.2
- 2.out
- </PRE></TT></DL>
- The loader automatically links with whatever libraries the program
- needs, usually including the standard C library as defined by
- <TT><libc.h></TT>.
- Of course the compiler and loader have lots of options, both familiar and new;
- see the manual for details.
- The compiler does not generate an executable automatically;
- the output of the compiler must be given to the loader.
- Since most compilation is done under the control of
- <TT>mk</TT>
- (see below), this is rarely an inconvenience.
- </P>
- <P>
- The distribution of work between the compiler and loader is unusual.
- The compiler integrates preprocessing, parsing, register allocation,
- code generation and some assembly.
- Combining these tasks in a single program is part of the reason for
- the compiler's efficiency.
- The loader does instruction selection, branch folding,
- instruction scheduling,
- and writes the final executable.
- There is no separate C preprocessor and no assembler in the usual pipeline.
- Instead the intermediate object file
- (here a
- <TT>.2</TT>
- file) is a type of binary assembly language.
- The instructions in the intermediate format are not exactly those in
- the machine. For example, on the 68020 the object file may specify
- a MOVE instruction but the loader will decide just which variant of
- the MOVE instruction ­ MOVE immediate, MOVE quick, MOVE address,
- etc. ­ is most efficient.
- </P>
- <P>
- The assembler,
- <TT>2a</TT>,
- is just a translator between the textual and binary
- representations of the object file format.
- It is not an assembler in the traditional sense. It has limited
- macro capabilities (the same as the integral C preprocessor in the compiler),
- clumsy syntax, and minimal error checking. For instance, the assembler
- will accept an instruction (such as memory-to-memory MOVE on the MIPS) that the
- machine does not actually support; only when the output of the assembler
- is passed to the loader will the error be discovered.
- The assembler is intended only for writing things that need access to instructions
- invisible from C,
- such as the machine-dependent
- part of an operating system;
- very little code in Plan 9 is in assembly language.
- </P>
- <P>
- The compilers take an option
- <TT>-S</TT>
- that causes them to print on their standard output the generated code
- in a format acceptable as input to the assemblers.
- This is of course merely a formatting of the
- data in the object file; therefore the assembler is just
- an
- ASCII-to-binary converter for this format.
- Other than the specific instructions, the input to the assemblers
- is largely architecture-independent; see
- ``A Manual for the Plan 9 Assembler'',
- by Rob Pike,
- for more information.
- </P>
- <P>
- The loader is an integral part of the compilation process.
- Each library header file contains a
- <TT>#pragma</TT>
- that tells the loader the name of the associated archive; it is
- not necessary to tell the loader which libraries a program uses.
- The C run-time startup is found, by default, in the C library.
- The loader starts with an undefined
- symbol,
- <TT>_main</TT>,
- that is resolved by pulling in the run-time startup code from the library.
- (The loader undefines
- <TT>_mainp</TT>
- when profiling is enabled, to force loading of the profiling start-up
- instead.)
- </P>
- <P>
- Unlike its counterpart on other systems, the Plan 9 loader rearranges
- data to optimize access. This means the order of variables in the
- loaded program is unrelated to its order in the source.
- Most programs don't care, but some assume that, for example, the
- variables declared by
- <DL><DT><DD><TT><PRE>
- int a;
- int b;
- </PRE></TT></DL>
- will appear at adjacent addresses in memory. On Plan 9, they won't.
- </P>
- <H4>Heterogeneity
- </H4>
- <P>
- When the system starts or a user logs in the environment is configured
- so the appropriate binaries are available in
- <TT>/bin</TT>.
- The configuration process is controlled by an environment variable,
- <TT></TT><I>cputype</I><TT>,
- with value such as
- </TT><TT>mips</TT><TT>,
- </TT><TT>68020</TT><TT>,
- </TT><TT>386</TT><TT>,
- or
- </TT><TT>sparc</TT><TT>.
- For each architecture there is a directory in the root,
- with the appropriate name,
- that holds the binary and library files for that architecture.
- Thus
- </TT><TT>/mips/lib</TT><TT>
- contains the object code libraries for MIPS programs,
- </TT><TT>/mips/include</TT><TT>
- holds MIPS-specific include files, and
- </TT><TT>/mips/bin</TT><TT>
- has the MIPS binaries.
- These binaries are attached to
- </TT><TT>/bin</TT><TT>
- at boot time by binding
- </TT><TT>/</TT><TT>cputype/bin</TT><TT>
- to
- </TT><TT>/bin</TT><TT>,
- so
- </TT><TT>/bin</TT><TT>
- always contains the correct files.
- </P>
- </TT><P>
- The MIPS compiler,
- <TT>vc</TT>,
- by definition
- produces object files for the MIPS architecture,
- regardless of the architecture of the machine on which the compiler is running.
- There is a version of
- <TT>vc</TT>
- compiled for each architecture:
- <TT>/mips/bin/vc</TT>,
- <TT>/68020/bin/vc</TT>,
- <TT>/sparc/bin/vc</TT>,
- and so on,
- each capable of producing MIPS object files regardless of the native
- instruction set.
- If one is running on a SPARC,
- <TT>/sparc/bin/vc</TT>
- will compile programs for the MIPS;
- if one is running on machine
- <TT></TT><I>cputype</I><TT>,
- </TT><TT>/</TT><TT>cputype/bin/vc</TT><TT>
- will compile programs for the MIPS.
- </P>
- </TT><P>
- Because of the bindings that assemble
- <TT>/bin</TT>,
- the shell always looks for a command, say
- <TT>date</TT>,
- in
- <TT>/bin</TT>
- and automatically finds the file
- <TT>/</TT><I>cputype/bin/date</I><TT>.
- Therefore the MIPS compiler is known as just
- </TT><TT>vc</TT><TT>;
- the shell will invoke
- </TT><TT>/bin/vc</TT><TT>
- and that is guaranteed to be the version of the MIPS compiler
- appropriate for the machine running the command.
- Regardless of the architecture of the compiling machine,
- </TT><TT>/bin/vc</TT><TT>
- is
- </TT><I>always</I><TT>
- the MIPS compiler.
- </P>
- </TT><P>
- Also, the output of
- <TT>vc</TT>
- and
- <TT>vl</TT>
- is completely independent of the machine type on which they are executed:
- <TT>.v</TT>
- files compiled (with
- <TT>vc</TT>)
- on a SPARC may be linked (with
- <TT>vl</TT>)
- on a 386.
- (The resulting
- <TT>v.out</TT>
- will run, of course, only on a MIPS.)
- Similarly, the MIPS libraries in
- <TT>/mips/lib</TT>
- are suitable for loading with
- <TT>vl</TT>
- on any machine; there is only one set of MIPS libraries, not one
- set for each architecture that supports the MIPS compiler.
- </P>
- <H4>Heterogeneity and <TT>mk</TT>
- </H4>
- <P>
- Most software on Plan 9 is compiled under the control of
- <TT>mk</TT>,
- a descendant of
- <TT>make</TT>
- that is documented in the Programmer's Manual.
- A convention used throughout the
- <TT>mkfiles</TT>
- makes it easy to compile the source into binary suitable for any architecture.
- </P>
- <P>
- The variable
- <TT></TT>cputype<TT>
- is advisory: it reports the architecture of the current environment, and should
- not be modified. A second variable,
- </TT><TT></TT><I>objtype</I><TT>,
- is used to set which architecture is being
- </TT><I>compiled</I><TT>
- for.
- The value of
- </TT><TT></TT><TT>objtype</TT><TT>
- can be used by a
- </TT><TT>mkfile</TT><TT>
- to configure the compilation environment.
- </P>
- </TT><P>
- In each machine's root directory there is a short
- <TT>mkfile</TT>
- that defines a set of macros for the compiler, loader, etc.
- Here is
- <TT>/mips/mkfile</TT>:
- <DL><DT><DD><TT><PRE>
- </sys/src/mkfile.proto
- CC=vc
- LD=vl
- O=v
- AS=va
- </PRE></TT></DL>
- The line
- <DL><DT><DD><TT><PRE>
- </sys/src/mkfile.proto
- </PRE></TT></DL>
- causes
- <TT>mk</TT>
- to include the file
- <TT>/sys/src/mkfile.proto</TT>,
- which contains general definitions:
- <DL><DT><DD><TT><PRE>
- #
- # common mkfile parameters shared by all architectures
- #
- OS=v486xq7
- CPUS=mips 386 power alpha
- CFLAGS=-FVw
- LEX=lex
- YACC=yacc
- MK=/bin/mk
- </PRE></TT></DL>
- <TT>CC</TT>
- is obviously the compiler,
- <TT>AS</TT>
- the assembler, and
- <TT>LD</TT>
- the loader.
- <TT>O</TT>
- is the suffix for the object files and
- <TT>CPUS</TT>
- and
- <TT>OS</TT>
- are used in special rules described below.
- </P>
- <P>
- Here is a
- <TT>mkfile</TT>
- to build the installed source for
- <TT>sam</TT>:
- <DL><DT><DD><TT><PRE>
- </<I>objtype/mkfile
- OBJ=sam.</I>O address.<I>O buffer.</I>O cmd.<I>O disc.</I>O error.<I>O \
- file.</I>O io.<I>O list.</I>O mesg.<I>O moveto.</I>O multi.<I>O \
- plan9.</I>O rasp.<I>O regexp.</I>O string.<I>O sys.</I>O xec.<I>O
- </I>O.out: <I>OBJ
- </I>LD <I>OBJ
- install: </I>O.out
- cp <I>O.out /</I>objtype/bin/sam
- installall:
- for(objtype in <I>CPUS) mk install
- %.</I>O: %.c
- <I>CC </I>CFLAGS <I>stem.c
- </I>OBJ: sam.h errors.h mesg.h
- address.<I>O cmd.</I>O parse.<I>O xec.</I>O unix.<I>O: parse.h
- clean:V:
- rm -f [</I>OS].out *.[<I>OS] y.tab.?
- </PRE></TT></DL>
- (The actual
- </I><TT>mkfile</TT><I>
- imports most of its rules from other secondary files, but
- this example works and is not misleading.)
- The first line causes
- </I><TT>mk</TT><I>
- to include the contents of
- </I><TT>/</TT><I>objtype/mkfile</I><TT>
- in the current
- </TT><TT>mkfile</TT><TT>.
- If
- </TT><TT></TT><I>objtype</I><TT>
- is
- </TT><TT>mips</TT><TT>,
- this inserts the MIPS macro definitions into the
- </TT><TT>mkfile</TT><TT>.
- In this case the rule for
- </TT><TT></TT><TT>O.out</TT><TT>
- uses the MIPS tools to build
- </TT><TT>v.out</TT><TT>.
- The
- </TT><TT>%.</TT><I>O</I><TT>
- rule in the file uses
- </TT><TT>mk</TT><TT>'s
- pattern matching facilities to convert the source files to the object
- files through the compiler.
- (The text of the rules is passed directly to the shell,
- </TT><TT>rc</TT><TT>,
- without further translation.
- See the
- </TT><TT>mk</TT><TT>
- manual if any of this is unfamiliar.)
- Because the default rule builds
- </TT><TT></TT><TT>O.out</TT><TT>
- rather than
- </TT><TT>sam</TT><TT>,
- it is possible to maintain binaries for multiple machines in the
- same source directory without conflict.
- This is also, of course, why the output files from the various
- compilers and loaders
- have distinct names.
- </P>
- </TT><P>
- The rest of the
- <TT>mkfile</TT>
- should be easy to follow; notice how the rules for
- <TT>clean</TT>
- and
- <TT>installall</TT>
- (that is, install versions for all architectures) use other macros
- defined in
- <TT>/</TT><I>objtype/mkfile</I><TT>.
- In Plan 9,
- </TT><TT>mkfiles</TT><TT>
- for commands conventionally contain rules to
- </TT><TT>install</TT><TT>
- (compile and install the version for
- </TT><TT></TT><TT>objtype</TT><TT>),
- </TT><TT>installall</TT><TT>
- (compile and install for all
- </TT><TT></TT><I>objtypes</I><TT>),
- and
- </TT><TT>clean</TT><TT>
- (remove all object files, binaries, etc.).
- </P>
- </TT><P>
- The
- <TT>mkfile</TT>
- is easy to use. To build a MIPS binary,
- <TT>v.out</TT>:
- <DL><DT><DD><TT><PRE>
- % objtype=mips
- % mk
- </PRE></TT></DL>
- To build and install a MIPS binary:
- <DL><DT><DD><TT><PRE>
- % objtype=mips
- % mk install
- </PRE></TT></DL>
- To build and install all versions:
- <DL><DT><DD><TT><PRE>
- % mk installall
- </PRE></TT></DL>
- These conventions make cross-compilation as easy to manage
- as traditional native compilation.
- Plan 9 programs compile and run without change on machines from
- large multiprocessors to laptops. For more information about this process, see
- ``Plan 9 Mkfiles'',
- by Bob Flandrena.
- </P>
- <H4>Portability
- </H4>
- <P>
- Within Plan 9, it is painless to write portable programs, programs whose
- source is independent of the machine on which they execute.
- The operating system is fixed and the compiler, headers and libraries
- are constant so most of the stumbling blocks to portability are removed.
- Attention to a few details can avoid those that remain.
- </P>
- <P>
- Plan 9 is a heterogeneous environment, so programs must
- <I>expect</I>
- that external files will be written by programs on machines of different
- architectures.
- The compilers, for instance, must handle without confusion
- object files written by other machines.
- The traditional approach to this problem is to pepper the source with
- <TT>#ifdefs</TT>
- to turn byte-swapping on and off.
- Plan 9 takes a different approach: of the handful of machine-dependent
- <TT>#ifdefs</TT>
- in all the source, almost all are deep in the libraries.
- Instead programs read and write files in a defined format,
- either (for low volume applications) as formatted text, or
- (for high volume applications) as binary in a known byte order.
- If the external data were written with the most significant
- byte first, the following code reads a 4-byte integer correctly
- regardless of the architecture of the executing machine (assuming
- an unsigned long holds 4 bytes):
- <DL><DT><DD><TT><PRE>
- ulong
- getlong(void)
- {
- ulong l;
- l = (getchar()&0xFF)<<24;
- l |= (getchar()&0xFF)<<16;
- l |= (getchar()&0xFF)<<8;
- l |= (getchar()&0xFF)<<0;
- return l;
- }
- </PRE></TT></DL>
- Note that this code does not `swap' the bytes; instead it just reads
- them in the correct order.
- Variations of this code will handle any binary format
- and also avoid problems
- involving how structures are padded, how words are aligned,
- and other impediments to portability.
- Be aware, though, that extra care is needed to handle floating point data.
- </P>
- <P>
- Efficiency hounds will argue that this method is unnecessarily slow and clumsy
- when the executing machine has the same byte order (and padding and alignment)
- as the data.
- The CPU cost of I/O processing
- is rarely the bottleneck for an application, however,
- and the gain in simplicity of porting and maintaining the code greatly outweighs
- the minor speed loss from handling data in this general way.
- This method is how the Plan 9 compilers, the window system, and even the file
- servers transmit data between programs.
- </P>
- <P>
- To port programs beyond Plan 9, where the system interface is more variable,
- it is probably necessary to use
- <TT>pcc</TT>
- and hope that the target machine supports ANSI C and POSIX.
- </P>
- <H4>I/O
- </H4>
- <P>
- The default C library, defined by the include file
- <TT><libc.h></TT>,
- contains no buffered I/O package.
- It does have several entry points for printing formatted text:
- <TT>print</TT>
- outputs text to the standard output,
- <TT>fprint</TT>
- outputs text to a specified integer file descriptor, and
- <TT>sprint</TT>
- places text in a character array.
- To access library routines for buffered I/O, a program must
- explicitly include the header file associated with an appropriate library.
- </P>
- <P>
- The recommended I/O library, used by most Plan 9 utilities, is
- <TT>bio</TT>
- (buffered I/O), defined by
- <TT><bio.h></TT>.
- There also exists an implementation of ANSI Standard I/O,
- <TT>stdio</TT>.
- </P>
- <P>
- <TT>Bio</TT>
- is small and efficient, particularly for buffer-at-a-time or
- line-at-a-time I/O.
- Even for character-at-a-time I/O, however, it is significantly faster than
- the Standard I/O library,
- <TT>stdio</TT>.
- Its interface is compact and regular, although it lacks a few conveniences.
- The most noticeable is that one must explicitly define buffers for standard
- input and output;
- <TT>bio</TT>
- does not predefine them. Here is a program to copy input to output a byte
- at a time using
- <TT>bio</TT>:
- <DL><DT><DD><TT><PRE>
- #include <u.h>
- #include <libc.h>
- #include <bio.h>
- Biobuf bin;
- Biobuf bout;
- main(void)
- {
- int c;
- Binit(&bin, 0, OREAD);
- Binit(&bout, 1, OWRITE);
- while((c=Bgetc(&bin)) != Beof)
- Bputc(&bout, c);
- exits(0);
- }
- </PRE></TT></DL>
- For peak performance, we could replace
- <TT>Bgetc</TT>
- and
- <TT>Bputc</TT>
- by their equivalent in-line macros
- <TT>BGETC</TT>
- and
- <TT>BPUTC</TT>
- but
- the performance gain would be modest.
- For more information on
- <TT>bio</TT>,
- see the Programmer's Manual.
- </P>
- <P>
- Perhaps the most dramatic difference in the I/O interface of Plan 9 from other
- systems' is that text is not ASCII.
- The format for
- text in Plan 9 is a byte-stream encoding of 16-bit characters.
- The character set is based on the Unicode Standard and is backward compatible with
- ASCII:
- characters with value 0 through 127 are the same in both sets.
- The 16-bit characters, called
- <I>runes</I>
- in Plan 9, are encoded using a representation called
- UTF,
- an encoding that is becoming accepted as a standard.
- (ISO calls it UTF-8;
- throughout Plan 9 it's just called
- UTF.)
- UTF
- defines multibyte sequences to
- represent character values from 0 to 65535.
- In
- UTF,
- character values up to 127 decimal, 7F hexadecimal, represent themselves,
- so straight
- ASCII
- files are also valid
- UTF.
- Also,
- UTF
- guarantees that bytes with values 0 to 127 (NUL to DEL, inclusive)
- will appear only when they represent themselves, so programs that read bytes
- looking for plain ASCII characters will continue to work.
- Any program that expects a one-to-one correspondence between bytes and
- characters will, however, need to be modified.
- An example is parsing file names.
- File names, like all text, are in
- UTF,
- so it is incorrect to search for a character in a string by
- <TT>strchr(filename,</TT>
- <TT>c)</TT>
- because the character might have a multi-byte encoding.
- The correct method is to call
- <TT>utfrune(filename,</TT>
- <TT>c)</TT>,
- defined in
- <A href="/magic/man2html/2/rune"><I>rune</I>(2),
- </A>which interprets the file name as a sequence of encoded characters
- rather than bytes.
- In fact, even when you know the character is a single byte
- that can represent only itself,
- it is safer to use
- <TT>utfrune</TT>
- because that assumes nothing about the character set
- and its representation.
- </P>
- <P>
- The library defines several symbols relevant to the representation of characters.
- Any byte with unsigned value less than
- <TT>Runesync</TT>
- will not appear in any multi-byte encoding of a character.
- <TT>Utfrune</TT>
- compares the character being searched against
- <TT>Runesync</TT>
- to see if it is sufficient to call
- <TT>strchr</TT>
- or if the byte stream must be interpreted.
- Any byte with unsigned value less than
- <TT>Runeself</TT>
- is represented by a single byte with the same value.
- Finally, when errors are encountered converting
- to runes from a byte stream, the library returns the rune value
- <TT>Runeerror</TT>
- and advances a single byte. This permits programs to find runes
- embedded in binary data.
- </P>
- <P>
- <TT>Bio</TT>
- includes routines
- <TT>Bgetrune</TT>
- and
- <TT>Bputrune</TT>
- to transform the external byte stream
- UTF
- format to and from
- internal 16-bit runes.
- Also, the
- <TT>%s</TT>
- format to
- <TT>print</TT>
- accepts
- UTF;
- <TT>%c</TT>
- prints a character after narrowing it to 8 bits.
- The
- <TT>%S</TT>
- format prints a null-terminated sequence of runes;
- <TT>%C</TT>
- prints a character after narrowing it to 16 bits.
- For more information, see the Programmer's Manual, in particular
- <A href="/magic/man2html/6/utf"><I>utf</I>(6)
- </A>and
- <A href="/magic/man2html/2/rune"><I>rune</I>(2),
- </A>and the paper,
- ``Hello world, or
- Καλημέρα κόσμε, or
- こんにちは 世界'',
- by Rob Pike and
- Ken Thompson;
- there is not room for the full story here.
- </P>
- <P>
- These issues affect the compiler in several ways.
- First, the C source is in
- UTF.
- ANSI says C variables are formed from
- ASCII
- alphanumerics, but comments and literal strings may contain any characters
- encoded in the native encoding, here
- UTF.
- The declaration
- <DL><DT><DD><TT><PRE>
- char *cp = "abcÿ";
- </PRE></TT></DL>
- initializes the variable
- <TT>cp</TT>
- to point to an array of bytes holding the
- UTF
- representation of the characters
- <TT>abcÿ.</TT>
- The type
- <TT>Rune</TT>
- is defined in
- <TT><u.h></TT>
- to be
- <TT>ushort</TT>,
- which is also the `wide character' type in the compiler.
- Therefore the declaration
- <DL><DT><DD><TT><PRE>
- Rune *rp = L"abcÿ";
- </PRE></TT></DL>
- initializes the variable
- <TT>rp</TT>
- to point to an array of unsigned short integers holding the 16-bit
- values of the characters
- <TT>abcÿ</TT>.
- Note that in both these declarations the characters in the source
- that represent
- <TT>abcÿ</TT>
- are the same; what changes is how those characters are represented
- in memory in the program.
- The following two lines:
- <DL><DT><DD><TT><PRE>
- print("%s\n", "abcÿ");
- print("%S\n", L"abcÿ");
- </PRE></TT></DL>
- produce the same
- UTF
- string on their output, the first by copying the bytes, the second
- by converting from runes to bytes.
- </P>
- <P>
- In C, character constants are integers but narrowed through the
- <TT>char</TT>
- type.
- The Unicode character
- <TT>ÿ</TT>
- has value 255, so if the
- <TT>char</TT>
- type is signed,
- the constant
- <TT>'ÿ'</TT>
- has value -1 (which is equal to EOF).
- On the other hand,
- <TT>L'ÿ'</TT>
- narrows through the wide character type,
- <TT>ushort</TT>,
- and therefore has value 255.
- </P>
- <P>
- Finally, although it's not ANSI C, the Plan 9 C compilers
- assume any character with value above
- <TT>Runeself</TT>
- is an alphanumeric,
- so α is a legal, if non-portable, variable name.
- </P>
- <H4>Arguments
- </H4>
- <P>
- Some macros are defined
- in
- <TT><libc.h></TT>
- for parsing the arguments to
- <TT>main()</TT>.
- They are described in
- <A href="/magic/man2html/2/arg"><I>arg</I>(2)
- </A>but are fairly self-explanatory.
- There are four macros:
- <TT>ARGBEGIN</TT>
- and
- <TT>ARGEND</TT>
- are used to bracket a hidden
- <TT>switch</TT>
- statement within which
- <TT>ARGC</TT>
- returns the current option character (rune) being processed and
- <TT>ARGF</TT>
- returns the argument to the option, as in the loader option
- <TT>-o</TT>
- <TT>file</TT>.
- Here, for example, is the code at the beginning of
- <TT>main()</TT>
- in
- <TT>ramfs.c</TT>
- (see
- <A href="/magic/man2html/1/ramfs"><I>ramfs</I>(1))
- </A>that cracks its arguments:
- <DL><DT><DD><TT><PRE>
- void
- main(int argc, char *argv[])
- {
- char *defmnt;
- int p[2];
- int mfd[2];
- int stdio = 0;
- defmnt = "/tmp";
- ARGBEGIN{
- case 'i':
- defmnt = 0;
- stdio = 1;
- mfd[0] = 0;
- mfd[1] = 1;
- break;
- case 's':
- defmnt = 0;
- break;
- case 'm':
- defmnt = ARGF();
- break;
- default:
- usage();
- }ARGEND
- </PRE></TT></DL>
- </P>
- <H4>Extensions
- </H4>
- <P>
- The compiler has several extensions to ANSI C, all of which are used
- extensively in the system source.
- First,
- <I>structure</I>
- <I>displays</I>
- permit
- <TT>struct</TT>
- expressions to be formed dynamically.
- Given these declarations:
- <DL><DT><DD><TT><PRE>
- typedef struct Point Point;
- typedef struct Rectangle Rectangle;
- struct Point
- {
- int x, y;
- };
- struct Rectangle
- {
- Point min, max;
- };
- Point p, q, add(Point, Point);
- Rectangle r;
- int x, y;
- </PRE></TT></DL>
- this assignment may appear anywhere an assignment is legal:
- <DL><DT><DD><TT><PRE>
- r = (Rectangle){add(p, q), (Point){x, y+3}};
- </PRE></TT></DL>
- The syntax is the same as for initializing a structure but with
- a leading cast.
- </P>
- <P>
- If an
- <I>anonymous</I>
- <I>structure</I>
- or
- <I>union</I>
- is declared within another structure or union, the members of the internal
- structure or union are addressable without prefix in the outer structure.
- This feature eliminates the clumsy naming of nested structures and,
- particularly, unions.
- For example, after these declarations,
- <DL><DT><DD><TT><PRE>
- struct Lock
- {
- int locked;
- };
- struct Node
- {
- int type;
- union{
- double dval;
- double fval;
- long lval;
- }; /* anonymous union */
- struct Lock; /* anonymous structure */
- } *node;
- void lock(struct Lock*);
- </PRE></TT></DL>
- one may refer to
- <TT>node->type</TT>,
- <TT>node->dval</TT>,
- <TT>node->fval</TT>,
- <TT>node->lval</TT>,
- and
- <TT>node->locked</TT>.
- Moreover, the address of a
- <TT>struct</TT>
- <TT>Node</TT>
- may be used without a cast anywhere that the address of a
- <TT>struct</TT>
- <TT>Lock</TT>
- is used, such as in argument lists.
- The compiler automatically promotes the type and adjusts the address.
- Thus one may invoke
- <TT>lock(node)</TT>.
- </P>
- <P>
- Anonymous structures and unions may be accessed by type name
- if (and only if) they are declared using a
- <TT>typedef</TT>
- name.
- For example, using the above declaration for
- <TT>Point</TT>,
- one may declare
- <DL><DT><DD><TT><PRE>
- struct
- {
- int type;
- Point;
- } p;
- </PRE></TT></DL>
- and refer to
- <TT>p.Point</TT>.
- </P>
- <P>
- In the initialization of arrays, a number in square brackets before an
- element sets the index for the initialization. For example, to initialize
- some elements in
- a table of function pointers indexed by
- ASCII
- character,
- <DL><DT><DD><TT><PRE>
- void percent(void), slash(void);
- void (*func[128])(void) =
- {
- ['%'] percent,
- ['/'] slash,
- };
- </PRE></TT></DL>
- </P>
- <br> <br>
- A similar syntax allows one to initialize structure elements:
- <DL><DT><DD><TT><PRE>
- Point p =
- {
- .y 100,
- .x 200
- };
- </PRE></TT></DL>
- These initialization syntaxes were later added to ANSI C, with the addition of an
- equals sign between the index or tag and the value.
- The Plan 9 compiler accepts either form.
- <P>
- Finally, the declaration
- <DL><DT><DD><TT><PRE>
- extern register reg;
- </PRE></TT></DL>
- (<I>this</I>
- appearance of the register keyword is not ignored)
- allocates a global register to hold the variable
- <TT>reg</TT>.
- External registers must be used carefully: they need to be declared in
- <I>all</I>
- source files and libraries in the program to guarantee the register
- is not allocated temporarily for other purposes.
- Especially on machines with few registers, such as the i386,
- it is easy to link accidentally with code that has already usurped
- the global registers and there is no diagnostic when this happens.
- Used wisely, though, external registers are powerful.
- The Plan 9 operating system uses them to access per-process and
- per-machine data structures on a multiprocessor. The storage class they provide
- is hard to create in other ways.
- </P>
- <H4>The compile-time environment
- </H4>
- <P>
- The code generated by the compilers is `optimized' by default:
- variables are placed in registers and peephole optimizations are
- performed.
- The compiler flag
- <TT>-N</TT>
- disables these optimizations.
- Registerization is done locally rather than throughout a function:
- whether a variable occupies a register or
- the memory location identified in the symbol
- table depends on the activity of the variable and may change
- throughout the life of the variable.
- The
- <TT>-N</TT>
- flag is rarely needed;
- its main use is to simplify debugging.
- There is no information in the symbol table to identify the
- registerization of a variable, so
- <TT>-N</TT>
- guarantees the variable is always where the symbol table says it is.
- </P>
- <P>
- Another flag,
- <TT>-w</TT>,
- turns
- <I>on</I>
- warnings about portability and problems detected in flow analysis.
- Most code in Plan 9 is compiled with warnings enabled;
- these warnings plus the type checking offered by function prototypes
- provide most of the support of the Unix tool
- <TT>lint</TT>
- more accurately and with less chatter.
- Two of the warnings,
- `used and not set' and `set and not used', are almost always accurate but
- may be triggered spuriously by code with invisible control flow,
- such as in routines that call
- <TT>longjmp</TT>.
- The compiler statements
- <DL><DT><DD><TT><PRE>
- SET(v1);
- USED(v2);
- </PRE></TT></DL>
- decorate the flow graph to silence the compiler.
- Either statement accepts a comma-separated list of variables.
- Use them carefully: they may silence real errors.
- For the common case of unused parameters to a function,
- leaving the name off the declaration silences the warnings.
- That is, listing the type of a parameter but giving it no
- associated variable name does the trick.
- </P>
- <H4>Debugging
- </H4>
- <P>
- There are two debuggers available on Plan 9.
- The first, and older, is
- <TT>db</TT>,
- a revision of Unix
- <TT>adb</TT>.
- The other,
- <TT>acid</TT>,
- is a source-level debugger whose commands are statements in
- a true programming language.
- <TT>Acid</TT>
- is the preferred debugger, but since it
- borrows some elements of
- <TT>db</TT>,
- notably the formats for displaying values, it is worth knowing a little bit about
- <TT>db</TT>.
- </P>
- <P>
- Both debuggers support multiple architectures in a single program; that is,
- the programs are
- <TT>db</TT>
- and
- <TT>acid</TT>,
- not for example
- <TT>vdb</TT>
- and
- <TT>vacid</TT>.
- They also support cross-architecture debugging comfortably:
- one may debug a 68020 binary on a MIPS.
- </P>
- <P>
- Imagine a program has crashed mysteriously:
- <DL><DT><DD><TT><PRE>
- % X11/X
- Fatal server bug!
- failed to create default stipple
- X 106: suicide: sys: trap: fault read addr=0x0 pc=0x00105fb8
- %
- </PRE></TT></DL>
- When a process dies on Plan 9 it hangs in the `broken' state
- for debugging.
- Attach a debugger to the process by naming its process id:
- <DL><DT><DD><TT><PRE>
- % acid 106
- /proc/106/text:mips plan 9 executable
- /sys/lib/acid/port
- /sys/lib/acid/mips
- acid:
- </PRE></TT></DL>
- The
- <TT>acid</TT>
- function
- <TT>stk()</TT>
- reports the stack traceback:
- <DL><DT><DD><TT><PRE>
- acid: stk()
- At pc:0x105fb8:abort+0x24 /sys/src/ape/lib/ap/stdio/abort.c:6
- abort() /sys/src/ape/lib/ap/stdio/abort.c:4
- called from FatalError+#4e
- /sys/src/X/mit/server/dix/misc.c:421
- FatalError(s9=#e02, s8=#4901d200, s7=#2, s6=#72701, s5=#1,
- s4=#7270d, s3=#6, s2=#12, s1=#ff37f1c, s0=#6, f=#7270f)
- /sys/src/X/mit/server/dix/misc.c:416
- called from gnotscreeninit+#4ce
- /sys/src/X/mit/server/ddx/gnot/gnot.c:792
- gnotscreeninit(snum=#0, sc=#80db0)
- /sys/src/X/mit/server/ddx/gnot/gnot.c:766
- called from AddScreen+#16e
- /n/bootes/sys/src/X/mit/server/dix/main.c:610
- AddScreen(pfnInit=0x0000129c,argc=0x00000001,argv=0x7fffffe4)
- /sys/src/X/mit/server/dix/main.c:530
- called from InitOutput+0x80
- /sys/src/X/mit/server/ddx/brazil/brddx.c:522
- InitOutput(argc=0x00000001,argv=0x7fffffe4)
- /sys/src/X/mit/server/ddx/brazil/brddx.c:511
- called from main+0x294
- /sys/src/X/mit/server/dix/main.c:225
- main(argc=0x00000001,argv=0x7fffffe4)
- /sys/src/X/mit/server/dix/main.c:136
- called from _main+0x24
- /sys/src/ape/lib/ap/mips/main9.s:8
- </PRE></TT></DL>
- The function
- <TT>lstk()</TT>
- is similar but
- also reports the values of local variables.
- Note that the traceback includes full file names; this is a boon to debugging,
- although it makes the output much noisier.
- </P>
- <P>
- To use
- <TT>acid</TT>
- well you will need to learn its input language; see the
- ``Acid Manual'',
- by Phil Winterbottom,
- for details. For simple debugging, however, the information in the manual page is
- sufficient. In particular, it describes the most useful functions
- for examining a process.
- </P>
- <P>
- The compiler does not place
- information describing the types of variables in the executable,
- but a compile-time flag provides crude support for symbolic debugging.
- The
- <TT>-a</TT>
- flag to the compiler suppresses code generation
- and instead emits source text in the
- <TT>acid</TT>
- language to format and display data structure types defined in the program.
- The easiest way to use this feature is to put a rule in the
- <TT>mkfile</TT>:
- <DL><DT><DD><TT><PRE>
- syms: main.O
- <I>CC -a main.c > syms
- </PRE></TT></DL>
- Then from within
- </I><TT>acid</TT><I>,
- <DL><DT><DD><TT><PRE>
- acid: include("sourcedirectory/syms")
- </PRE></TT></DL>
- to read in the relevant definitions.
- (For multi-file source, you need to be a little fancier;
- see
- <A href="/magic/man2html/1/2c"></I><I>2c</I><I>(1)).
- </A>This text includes, for each defined compound
- type, a function with that name that may be called with the address of a structure
- of that type to display its contents.
- For example, if
- </I><TT>rect</TT><I>
- is a global variable of type
- </I><TT>Rectangle</TT><I>,
- one may execute
- <DL><DT><DD><TT><PRE>
- Rectangle(*rect)
- </PRE></TT></DL>
- to display it.
- The
- </I><TT>*</TT><I>
- (indirection) operator is necessary because
- of the way
- </I><TT>acid</TT><I>
- works: each global symbol in the program is defined as a variable by
- </I><TT>acid</TT><I>,
- with value equal to the
- </I><I>address</I><I>
- of the symbol.
- </P>
- </I><P>
- Another common technique is to write by hand special
- <TT>acid</TT>
- code to define functions to aid debugging, initialize the debugger, and so on.
- Conventionally, this is placed in a file called
- <TT>acid</TT>
- in the source directory; it has a line
- <DL><DT><DD><TT><PRE>
- include("sourcedirectory/syms");
- </PRE></TT></DL>
- to load the compiler-produced symbols. One may edit the compiler output directly but
- it is wiser to keep the hand-generated
- <TT>acid</TT>
- separate from the machine-generated.
- </P>
- <P>
- To make things simple, the default rules in the system
- <TT>mkfiles</TT>
- include entries to make
- <TT>foo.acid</TT>
- from
- <TT>foo.c</TT>,
- so one may use
- <TT>mk</TT>
- to automate the production of
- <TT>acid</TT>
- definitions for a given C source file.
- </P>
- <P>
- There is much more to say here. See
- <TT>acid</TT>
- manual page, the reference manual, or the paper
- ``Acid: A Debugger Built From A Language'',
- also by Phil Winterbottom.
- </P>
- <br> <br>
- <A href=http://www.lucent.com/copyright.html>
- Copyright</A> © 2004 Lucent Technologies Inc. All rights reserved.
- </body></html>
|