keep_data_small.txt 8.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265
  1. Keeping data small
  2. When many applets are compiled into busybox, all rw data and
  3. bss for each applet are concatenated. Including those from libc,
  4. if static busybox is built. When busybox is started, _all_ this data
  5. is allocated, not just that one part for selected applet.
  6. What "allocated" exactly means, depends on arch.
  7. On NOMMU it's probably bites the most, actually using real
  8. RAM for rwdata and bss. On i386, bss is lazily allocated
  9. by COWed zero pages. Not sure about rwdata - also COW?
  10. In order to keep busybox NOMMU and small-mem systems friendly
  11. we should avoid large global data in our applets, and should
  12. minimize usage of libc functions which implicitly use
  13. such structures.
  14. Small experiment to measure "parasitic" bbox memory consumption:
  15. here we start 1000 "busybox sleep 10" in parallel.
  16. busybox binary is practically allyesconfig static one,
  17. built against uclibc. Run on x86-64 machine with 64-bit kernel:
  18. bash-3.2# nmeter '%t %c %m %p %[pn]'
  19. 23:17:28 .......... 168M 0 147
  20. 23:17:29 .......... 168M 0 147
  21. 23:17:30 U......... 168M 1 147
  22. 23:17:31 SU........ 181M 244 391
  23. 23:17:32 SSSSUUU... 223M 757 1147
  24. 23:17:33 UUU....... 223M 0 1147
  25. 23:17:34 U......... 223M 1 1147
  26. 23:17:35 .......... 223M 0 1147
  27. 23:17:36 .......... 223M 0 1147
  28. 23:17:37 S......... 223M 0 1147
  29. 23:17:38 .......... 223M 1 1147
  30. 23:17:39 .......... 223M 0 1147
  31. 23:17:40 .......... 223M 0 1147
  32. 23:17:41 .......... 210M 0 906
  33. 23:17:42 .......... 168M 1 147
  34. 23:17:43 .......... 168M 0 147
  35. This requires 55M of memory. Thus 1 trivial busybox applet
  36. takes 55k of memory on 64-bit x86 kernel.
  37. On 32-bit kernel we need ~26k per applet.
  38. Script:
  39. i=1000; while test $i != 0; do
  40. echo -n .
  41. busybox sleep 30 &
  42. i=$((i - 1))
  43. done
  44. echo
  45. wait
  46. (Data from NOMMU arches are sought. Provide 'size busybox' output too)
  47. Example 1
  48. One example how to reduce global data usage is in
  49. archival/libarchive/decompress_gunzip.c:
  50. /* This is somewhat complex-looking arrangement, but it allows
  51. * to place decompressor state either in bss or in
  52. * malloc'ed space simply by changing #defines below.
  53. * Sizes on i386:
  54. * text data bss dec hex
  55. * 5256 0 108 5364 14f4 - bss
  56. * 4915 0 0 4915 1333 - malloc
  57. */
  58. #define STATE_IN_BSS 0
  59. #define STATE_IN_MALLOC 1
  60. (see the rest of the file to get the idea)
  61. This example completely eliminates globals in that module.
  62. Required memory is allocated in unpack_gz_stream() [its main module]
  63. and then passed down to all subroutines which need to access 'globals'
  64. as a parameter.
  65. Example 2
  66. In case you don't want to pass this additional parameter everywhere,
  67. take a look at archival/gzip.c. Here all global data is replaced by
  68. single global pointer (ptr_to_globals) to allocated storage.
  69. In order to not duplicate ptr_to_globals in every applet, you can
  70. reuse single common one. It is defined in libbb/ptr_to_globals.c
  71. as struct globals *const ptr_to_globals, but the struct globals is
  72. NOT defined in libbb.h. You first define your own struct:
  73. struct globals { int a; char buf[1000]; };
  74. and then declare that ptr_to_globals is a pointer to it:
  75. #define G (*ptr_to_globals)
  76. ptr_to_globals is declared as constant pointer.
  77. This helps gcc understand that it won't change, resulting in noticeably
  78. smaller code. In order to assign it, use SET_PTR_TO_GLOBALS macro:
  79. SET_PTR_TO_GLOBALS(xzalloc(sizeof(G)));
  80. Typically it is done in <applet>_main(). Another variation is
  81. to use stack:
  82. int <applet>_main(...)
  83. {
  84. #undef G
  85. struct globals G;
  86. memset(&G, 0, sizeof(G));
  87. SET_PTR_TO_GLOBALS(&G);
  88. Now you can reference "globals" by G.a, G.buf and so on, in any function.
  89. bb_common_bufsiz1
  90. There is one big common buffer in bss - bb_common_bufsiz1. It is a much
  91. earlier mechanism to reduce bss usage. Each applet can use it for
  92. its needs. Library functions are prohibited from using it.
  93. 'G.' trick can be done using bb_common_bufsiz1 instead of malloced buffer:
  94. #define G (*(struct globals*)&bb_common_bufsiz1)
  95. Be careful, though, and use it only if globals fit into bb_common_bufsiz1.
  96. Since bb_common_bufsiz1 is BUFSIZ + 1 bytes long and BUFSIZ can change
  97. from one libc to another, you have to add compile-time check for it:
  98. if (sizeof(struct globals) > sizeof(bb_common_bufsiz1))
  99. BUG_<applet>_globals_too_big();
  100. Drawbacks
  101. You have to initialize it by hand. xzalloc() can be helpful in clearing
  102. allocated storage to 0, but anything more must be done by hand.
  103. All global variables are prefixed by 'G.' now. If this makes code
  104. less readable, use #defines:
  105. #define dev_fd (G.dev_fd)
  106. #define sector (G.sector)
  107. Finding non-shared duplicated strings
  108. strings busybox | sort | uniq -c | sort -nr
  109. gcc's data alignment problem
  110. The following attribute added in vi.c:
  111. static int tabstop;
  112. static struct termios term_orig __attribute__ ((aligned (4)));
  113. static struct termios term_vi __attribute__ ((aligned (4)));
  114. reduces bss size by 32 bytes, because gcc sometimes aligns structures to
  115. ridiculously large values. asm output diff for above example:
  116. tabstop:
  117. .zero 4
  118. .section .bss.term_orig,"aw",@nobits
  119. - .align 32
  120. + .align 4
  121. .type term_orig, @object
  122. .size term_orig, 60
  123. term_orig:
  124. .zero 60
  125. .section .bss.term_vi,"aw",@nobits
  126. - .align 32
  127. + .align 4
  128. .type term_vi, @object
  129. .size term_vi, 60
  130. gcc doesn't seem to have options for altering this behaviour.
  131. gcc 3.4.3 and 4.1.1 tested:
  132. char c = 1;
  133. // gcc aligns to 32 bytes if sizeof(struct) >= 32
  134. struct {
  135. int a,b,c,d;
  136. int i1,i2,i3;
  137. } s28 = { 1 }; // struct will be aligned to 4 bytes
  138. struct {
  139. int a,b,c,d;
  140. int i1,i2,i3,i4;
  141. } s32 = { 1 }; // struct will be aligned to 32 bytes
  142. // same for arrays
  143. char vc31[31] = { 1 }; // unaligned
  144. char vc32[32] = { 1 }; // aligned to 32 bytes
  145. -fpack-struct=1 reduces alignment of s28 to 1 (but probably
  146. will break layout of many libc structs) but s32 and vc32
  147. are still aligned to 32 bytes.
  148. I will try to cook up a patch to add a gcc option for disabling it.
  149. Meanwhile, this is where it can be disabled in gcc source:
  150. gcc/config/i386/i386.c
  151. int
  152. ix86_data_alignment (tree type, int align)
  153. {
  154. #if 0
  155. if (AGGREGATE_TYPE_P (type)
  156. && TYPE_SIZE (type)
  157. && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
  158. && (TREE_INT_CST_LOW (TYPE_SIZE (type)) >= 256
  159. || TREE_INT_CST_HIGH (TYPE_SIZE (type))) && align < 256)
  160. return 256;
  161. #endif
  162. Result (non-static busybox built against glibc):
  163. # size /usr/srcdevel/bbox/fix/busybox.t0/busybox busybox
  164. text data bss dec hex filename
  165. 634416 2736 23856 661008 a1610 busybox
  166. 632580 2672 22944 658196 a0b14 busybox_noalign
  167. Keeping code small
  168. Use scripts/bloat-o-meter to check whether introduced changes
  169. didn't generate unnecessary bloat. This script needs unstripped binaries
  170. to generate a detailed report. To automate this, just use
  171. "make bloatcheck". It requires busybox_old binary to be present,
  172. use "make baseline" to generate it from unmodified source, or
  173. copy busybox_unstripped to busybox_old before modifying sources
  174. and rebuilding.
  175. Set CONFIG_EXTRA_CFLAGS="-fno-inline-functions-called-once",
  176. produce "make bloatcheck", see the biggest auto-inlined functions.
  177. Now, set CONFIG_EXTRA_CFLAGS back to "", but add NOINLINE
  178. to some of these functions. In 1.16.x timeframe, the results were
  179. (annotated "make bloatcheck" output):
  180. function old new delta
  181. expand_vars_to_list - 1712 +1712 win
  182. lzo1x_optimize - 1429 +1429 win
  183. arith_apply - 1326 +1326 win
  184. read_interfaces - 1163 +1163 loss, leave w/o NOINLINE
  185. logdir_open - 1148 +1148 win
  186. check_deps - 1148 +1148 loss
  187. rewrite - 1039 +1039 win
  188. run_pipe 358 1396 +1038 win
  189. write_status_file - 1029 +1029 almost the same, leave w/o NOINLINE
  190. dump_identity - 987 +987 win
  191. mainQSort3 - 921 +921 win
  192. parse_one_line - 916 +916 loss
  193. summarize - 897 +897 almost the same
  194. do_shm - 884 +884 win
  195. cpio_o - 863 +863 win
  196. subCommand - 841 +841 loss
  197. receive - 834 +834 loss
  198. 855 bytes saved in total.
  199. scripts/mkdiff_obj_bloat may be useful to automate this process: run
  200. "scripts/mkdiff_obj_bloat NORMALLY_BUILT_TREE FORCED_NOINLINE_TREE"
  201. and select modules which shrank.