Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack usage issues (due to inlining, etc.) #13

Open
pfalcon opened this issue Aug 3, 2018 · 7 comments
Open

Stack usage issues (due to inlining, etc.) #13

pfalcon opened this issue Aug 3, 2018 · 7 comments

Comments

@pfalcon
Copy link
Owner

pfalcon commented Aug 3, 2018

Master has issues with pretty big stack usage in some (many?) cases. This is due to fact that medern compilers prefer (only able to?) to allocate single big static stack frame for entire function at once, instead of growing/shrinking frame as they go thru various subblocks of a function. For example fo,

int i;
if (rare_condition) {
    char foo[1024];
    ...
}

1024 + 4 bytes of stack will be allocated always, even though, as suggested by the code, 1024 of those will be needed rarely. It would be smarter to allocate those 1024 only when control flow goes into the "if" body, but this seems to be the lost skill for modern gcc's and clang's. (Indeed, adjusting stack pointer in multiple places would take more code and more cycles, so they optimized it to allocate frame once at function start - ad absurdum, as the code above shows).

This is all aggravated by function inlining. As it's again the codesize saving measure, it happens not just for functions explicitly marked inline, but also may happen for static functions (for example, always would happen for a static function with a single caller).

Many times, this is a bit less serious than completely grave, because other stack-hungry functions aren't called from such functions. But they always can be, e.g. an interrupt can occur when inside such stack-hoarding function, or it may deal with user (Python) type and call Python-level method, leading to really huge stack usage.

@pfalcon
Copy link
Owner Author

pfalcon commented Aug 3, 2018

Related gcc option is -fstack-usage (produces .su file besides output .o). Also, possible to get warning on too big stack usage. e.g.: -Wstack-usage=256.

@pfalcon
Copy link
Owner Author

pfalcon commented Aug 3, 2018

A case study:
Case study (x86_64 figures):

objstr.su:

objstr.c:305:10:mp_obj_str_binary_op	304	dynamic,bounded

This happens because static str_modulo_format() gets inlined into its single caller mp_obj_str_binary_op(). It would seem that marking str_modulo_format() with MP_NOINLINE is the obvious fix. It helps,but the result not uncontroversial:

objstr.c:1390:17:str_modulo_format	288	dynamic,bounded
objstr.c:305:10:mp_obj_str_binary_op	144	static

So, if we call mp_obj_str_binary_op for something else than % operator, it helps, stack usage is down to 144. However, if we call it for %, the stack usage is now 432 instead of older 304.

This happens because gcc of course uses live range packing technique, reusing stack slots for different local vars, which can't be used at the same time. Inlining opens more opportunity for such packing, hence the results above.

So, there's no "absolute" rule of thumb how to fix it, it all compromises, as usual with optimization. However, some "absolute" conclusion can be made from above: the stack usage of mp_obj_str_binary_op() is really a problem. Even after deinlining str_modulo_format(), it's big. That's because it again packs too much different code (switch branches), some having big allocations, which apply only to a particular operation, but affecting stack allocation of the entire func.

@pfalcon
Copy link
Owner Author

pfalcon commented Aug 4, 2018

but this seems to be the lost skill for modern gcc's and clang's

Well... https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html :

-fconserve-stack

Attempt to minimize stack usage. The compiler attempts to use less stack space, even if that makes the program slower. This option implies setting the large-stack-frame parameter to 100 and the large-stack-frame-growth parameter to 400.

But, it has minimal effect on MicroPython codebase:

--- objstr.su	2018-08-04 21:02:10.000000000 +0300
+++ /home/pfalcon/projects/micropython/ports/unix/build/py/objstr.su	2018-08-04 21:03:39.948462750 +0300
@@ -31,7 +31,8 @@
 objstr.c:1811:17:str_lower	8	static
 objstr.c:1816:17:str_upper	8	static
 objstr.c:2070:10:mp_obj_new_str	8	static
-objstr.c:133:10:mp_obj_str_make_new	96	static
+objstr.c:133:10:mp_obj_str_make_new.part.2	80	static
+objstr.c:133:10:mp_obj_str_make_new	80	static
 objstr.c:1883:17:bytes_decode	32	static
 objstr.c:2084:10:mp_obj_str_intern	32	static
 objstr.c:2095:10:mp_obj_new_bytes	8	static
@@ -52,7 +53,8 @@
 objstr.c:2131:6:mp_obj_str_get_qstr	16	static
 objstr.c:2144:13:mp_obj_str_get_str	32	static
 objstr.c:2154:13:mp_obj_str_get_data	32	static
-objstr.c:303:10:mp_obj_str_binary_op	288	dynamic,bounded
+objstr.c:1388:17:str_modulo_format	256	dynamic,bounded
+objstr.c:303:10:mp_obj_str_binary_op	128	static
 objstr.c:495:10:mp_obj_str_split	96	static
 objstr.c:618:17:str_rsplit	112	static
 objstr.c:749:17:str_startswith	64	static

So, it chose not to inline str_modulo_format() into mp_obj_str_binary_op(), and also chose to split mp_obj_str_make_new() into 2 functions (which is unlikely a good move).

@pfalcon
Copy link
Owner Author

pfalcon commented Aug 5, 2018

Another issue in stack usage is ABI stack alignment. By the latest bloat fashion, this is usually times the native word size of a platform. x86_32 is particularly affected, with "modern ABI" having stack alignment of 16, 4 times the native word width.

Here's an example that such an alignment can easily cause ~2x more stack usage, diff of -fstack-usage result for -mpreferred-stack-boundary=2 (4 bytes stack alignment) vs -mpreferred-stack-boundary=4 (16 bytes stack alignment, default):

-objstr.c:2249:17:bytes_it_iternext	20	dynamic,bounded
+objstr.c:2249:17:bytes_it_iternext	48	dynamic,bounded
-objstr.c:1786:17:str_partitioner	68	dynamic,bounded
+objstr.c:1786:17:str_partitioner	112	dynamic,bounded

The GCC manual, e.g. https://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/i386-and-x86_002d64-Options.html itself says:

This extra alignment does consume extra stack space, and generally increases code size. Code that is sensitive to stack space usage, such as embedded systems and operating system kernels, may want to reduce the preferred alignment to -mpreferred-stack-boundary=2.

Unfortunately, -mpreferred-stack-boundary= is really a (sub)target-specific option, for example, using it for x86_64 leads to compile error that supported values are 4..12.

@pfalcon
Copy link
Owner Author

pfalcon commented Aug 5, 2018

Another thing is that -fomit-frame-pointer is apparently should be enabled. Again helps poor x86_32 to push less of ebp on the stack. But there could be peculiar arch for which -fomit-frame-pointer is not implemented, thus leading in build error.

@pfalcon
Copy link
Owner Author

pfalcon commented Aug 5, 2018

but this seems to be the lost skill for modern gcc's and clang's.

However, some "absolute" conclusion can be made from above: the stack usage of mp_obj_str_binary_op() is really a problem.

I played with few more gcc options related to stack, but none of them helped with the above.

So, the only solution seems to be avoiding definition of any variables in "dispatcher" functions like *_binary_op, and instead indeed just dispatch to a particular op handler, hoping for tail calls. This hoping works (for popular platforms). E.g.:

mp_obj_t mp_obj_str_binary_op(mp_binary_op_t op, mp_obj_t lhs_in, mp_obj_t rhs_in) {
+    switch (op) {
+        case MP_BINARY_OP_EQUAL: // This will be passed only for bytes, str is dealt with in mp_obj_equal()
+        case MP_BINARY_OP_LESS:
+        case MP_BINARY_OP_LESS_EQUAL:
+        case MP_BINARY_OP_MORE:
+        case MP_BINARY_OP_MORE_EQUAL:
+            return str_cmp_op(op, lhs_in, rhs_in);
+        default:
+            return MP_OBJ_NULL; // op not supported
+    }

}

leads to the good code like:

0806305d <mp_obj_str_binary_op>:
 806305d:       8b 44 24 04             mov    0x4(%esp),%eax
 8063061:       83 f8 04                cmp    $0x4,%eax
 8063064:       77 05                   ja     806306b <mp_obj_str_binary_op+0xe>
 8063066:       e9 66 ff ff ff          jmp    8062fd1 <str_cmp_op>
 806306b:       31 c0                   xor    %eax,%eax
 806306d:       c3                      ret

But doing it this way means duplicating validation/arg type dispatching code. So again, consideration of what's better in a particular case should be done.

In the case of mp_obj_str_binary_op, it's definitely makes sense to do that, because "%s" % foo is a common operation, it may result in calling __str__ on foo, which can lead to recursion and big stack usage.

@pfalcon
Copy link
Owner Author

pfalcon commented Aug 5, 2018

-fomit-frame-pointer

But it has controversial effect on uPy codebase. First of all, with unwind table bloat not disabled, it has horrendous effect on x86_32:

- 358639           4992    1396  365027   591e3 micropython.size-x86
+ 406591           4992    1396  412979   64d33 micropython.size-x86

With crap disabled, it's more tolerable, though still bloating:

- 279329           4996    1408  285733   45c25 micropython.size-x86
+ 280033           4996    1408  286437   45ee5 micropython.size-x86

Doesn't have effect on arm-linux-gnueabihf-gcc (because it's default?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant