123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265 |
- Keeping data small
- When many applets are compiled into busybox, all rw data and
- bss for each applet are concatenated. Including those from libc,
- if static busybox is built. When busybox is started, _all_ this data
- is allocated, not just that one part for selected applet.
- What "allocated" exactly means, depends on arch.
- On NOMMU it's probably bites the most, actually using real
- RAM for rwdata and bss. On i386, bss is lazily allocated
- by COWed zero pages. Not sure about rwdata - also COW?
- In order to keep busybox NOMMU and small-mem systems friendly
- we should avoid large global data in our applets, and should
- minimize usage of libc functions which implicitly use
- such structures.
- Small experiment to measure "parasitic" bbox memory consumption:
- here we start 1000 "busybox sleep 10" in parallel.
- busybox binary is practically allyesconfig static one,
- built against uclibc. Run on x86-64 machine with 64-bit kernel:
- bash-3.2# nmeter '%t %c %m %p %[pn]'
- 23:17:28 .......... 168M 0 147
- 23:17:29 .......... 168M 0 147
- 23:17:30 U......... 168M 1 147
- 23:17:31 SU........ 181M 244 391
- 23:17:32 SSSSUUU... 223M 757 1147
- 23:17:33 UUU....... 223M 0 1147
- 23:17:34 U......... 223M 1 1147
- 23:17:35 .......... 223M 0 1147
- 23:17:36 .......... 223M 0 1147
- 23:17:37 S......... 223M 0 1147
- 23:17:38 .......... 223M 1 1147
- 23:17:39 .......... 223M 0 1147
- 23:17:40 .......... 223M 0 1147
- 23:17:41 .......... 210M 0 906
- 23:17:42 .......... 168M 1 147
- 23:17:43 .......... 168M 0 147
- This requires 55M of memory. Thus 1 trivial busybox applet
- takes 55k of memory on 64-bit x86 kernel.
- On 32-bit kernel we need ~26k per applet.
- Script:
- i=1000; while test $i != 0; do
- echo -n .
- busybox sleep 30 &
- i=$((i - 1))
- done
- echo
- wait
- (Data from NOMMU arches are sought. Provide 'size busybox' output too)
- Example 1
- One example how to reduce global data usage is in
- archival/libarchive/decompress_gunzip.c:
- /* This is somewhat complex-looking arrangement, but it allows
- * to place decompressor state either in bss or in
- * malloc'ed space simply by changing #defines below.
- * Sizes on i386:
- * text data bss dec hex
- * 5256 0 108 5364 14f4 - bss
- * 4915 0 0 4915 1333 - malloc
- */
- #define STATE_IN_BSS 0
- #define STATE_IN_MALLOC 1
- (see the rest of the file to get the idea)
- This example completely eliminates globals in that module.
- Required memory is allocated in unpack_gz_stream() [its main module]
- and then passed down to all subroutines which need to access 'globals'
- as a parameter.
- Example 2
- In case you don't want to pass this additional parameter everywhere,
- take a look at archival/gzip.c. Here all global data is replaced by
- single global pointer (ptr_to_globals) to allocated storage.
- In order to not duplicate ptr_to_globals in every applet, you can
- reuse single common one. It is defined in libbb/ptr_to_globals.c
- as struct globals *const ptr_to_globals, but the struct globals is
- NOT defined in libbb.h. You first define your own struct:
- struct globals { int a; char buf[1000]; };
- and then declare that ptr_to_globals is a pointer to it:
- #define G (*ptr_to_globals)
- ptr_to_globals is declared as constant pointer.
- This helps gcc understand that it won't change, resulting in noticeably
- smaller code. In order to assign it, use SET_PTR_TO_GLOBALS macro:
- SET_PTR_TO_GLOBALS(xzalloc(sizeof(G)));
- Typically it is done in <applet>_main(). Another variation is
- to use stack:
- int <applet>_main(...)
- {
- #undef G
- struct globals G;
- memset(&G, 0, sizeof(G));
- SET_PTR_TO_GLOBALS(&G);
- Now you can reference "globals" by G.a, G.buf and so on, in any function.
- bb_common_bufsiz1
- There is one big common buffer in bss - bb_common_bufsiz1. It is a much
- earlier mechanism to reduce bss usage. Each applet can use it for
- its needs. Library functions are prohibited from using it.
- 'G.' trick can be done using bb_common_bufsiz1 instead of malloced buffer:
- #define G (*(struct globals*)&bb_common_bufsiz1)
- Be careful, though, and use it only if globals fit into bb_common_bufsiz1.
- Since bb_common_bufsiz1 is BUFSIZ + 1 bytes long and BUFSIZ can change
- from one libc to another, you have to add compile-time check for it:
- if (sizeof(struct globals) > sizeof(bb_common_bufsiz1))
- BUG_<applet>_globals_too_big();
- Drawbacks
- You have to initialize it by hand. xzalloc() can be helpful in clearing
- allocated storage to 0, but anything more must be done by hand.
- All global variables are prefixed by 'G.' now. If this makes code
- less readable, use #defines:
- #define dev_fd (G.dev_fd)
- #define sector (G.sector)
- Finding non-shared duplicated strings
- strings busybox | sort | uniq -c | sort -nr
- gcc's data alignment problem
- The following attribute added in vi.c:
- static int tabstop;
- static struct termios term_orig __attribute__ ((aligned (4)));
- static struct termios term_vi __attribute__ ((aligned (4)));
- reduces bss size by 32 bytes, because gcc sometimes aligns structures to
- ridiculously large values. asm output diff for above example:
- tabstop:
- .zero 4
- .section .bss.term_orig,"aw",@nobits
- - .align 32
- + .align 4
- .type term_orig, @object
- .size term_orig, 60
- term_orig:
- .zero 60
- .section .bss.term_vi,"aw",@nobits
- - .align 32
- + .align 4
- .type term_vi, @object
- .size term_vi, 60
- gcc doesn't seem to have options for altering this behaviour.
- gcc 3.4.3 and 4.1.1 tested:
- char c = 1;
- // gcc aligns to 32 bytes if sizeof(struct) >= 32
- struct {
- int a,b,c,d;
- int i1,i2,i3;
- } s28 = { 1 }; // struct will be aligned to 4 bytes
- struct {
- int a,b,c,d;
- int i1,i2,i3,i4;
- } s32 = { 1 }; // struct will be aligned to 32 bytes
- // same for arrays
- char vc31[31] = { 1 }; // unaligned
- char vc32[32] = { 1 }; // aligned to 32 bytes
- -fpack-struct=1 reduces alignment of s28 to 1 (but probably
- will break layout of many libc structs) but s32 and vc32
- are still aligned to 32 bytes.
- I will try to cook up a patch to add a gcc option for disabling it.
- Meanwhile, this is where it can be disabled in gcc source:
- gcc/config/i386/i386.c
- int
- ix86_data_alignment (tree type, int align)
- {
- #if 0
- if (AGGREGATE_TYPE_P (type)
- && TYPE_SIZE (type)
- && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
- && (TREE_INT_CST_LOW (TYPE_SIZE (type)) >= 256
- || TREE_INT_CST_HIGH (TYPE_SIZE (type))) && align < 256)
- return 256;
- #endif
- Result (non-static busybox built against glibc):
- # size /usr/srcdevel/bbox/fix/busybox.t0/busybox busybox
- text data bss dec hex filename
- 634416 2736 23856 661008 a1610 busybox
- 632580 2672 22944 658196 a0b14 busybox_noalign
- Keeping code small
- Use scripts/bloat-o-meter to check whether introduced changes
- didn't generate unnecessary bloat. This script needs unstripped binaries
- to generate a detailed report. To automate this, just use
- "make bloatcheck". It requires busybox_old binary to be present,
- use "make baseline" to generate it from unmodified source, or
- copy busybox_unstripped to busybox_old before modifying sources
- and rebuilding.
- Set CONFIG_EXTRA_CFLAGS="-fno-inline-functions-called-once",
- produce "make bloatcheck", see the biggest auto-inlined functions.
- Now, set CONFIG_EXTRA_CFLAGS back to "", but add NOINLINE
- to some of these functions. In 1.16.x timeframe, the results were
- (annotated "make bloatcheck" output):
- function old new delta
- expand_vars_to_list - 1712 +1712 win
- lzo1x_optimize - 1429 +1429 win
- arith_apply - 1326 +1326 win
- read_interfaces - 1163 +1163 loss, leave w/o NOINLINE
- logdir_open - 1148 +1148 win
- check_deps - 1148 +1148 loss
- rewrite - 1039 +1039 win
- run_pipe 358 1396 +1038 win
- write_status_file - 1029 +1029 almost the same, leave w/o NOINLINE
- dump_identity - 987 +987 win
- mainQSort3 - 921 +921 win
- parse_one_line - 916 +916 loss
- summarize - 897 +897 almost the same
- do_shm - 884 +884 win
- cpio_o - 863 +863 win
- subCommand - 841 +841 loss
- receive - 834 +834 loss
- 855 bytes saved in total.
- scripts/mkdiff_obj_bloat may be useful to automate this process: run
- "scripts/mkdiff_obj_bloat NORMALLY_BUILT_TREE FORCED_NOINLINE_TREE"
- and select modules which shrank.
|