Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSP and the Solo5 ABI (was Multiple definition error of __stack_chk_fail ...) #335

Open
niks3089 opened this issue Mar 11, 2019 · 11 comments
Labels
design Design / discussion. notabug Not a bug. Was "invalid", but that sounds too negative.

Comments

@niks3089
Copy link

When I am trying to run a test on InlcudeOS with solo5, I see the following error

ubuntu:~/projects/IncludeOS]$ ./test_solo5_hvt.sh
~/projects/IncludeOS/examples/demo_service ~/projects/IncludeOS
tap100 already exists
* <VMRunner>:  Building with cmake ([])
> -- using detected config file /home/nikhil/projects/IncludeOS/examples/demo_service/config.json
> -- Configuring done
> -- Generating done
> -- Build files have been written to: /home/nikhil/projects/IncludeOS/examples/demo_service/build
* <VMRunner>:  Building with 'make' (params=[])
/home/nikhil/includeos//x86_64/lib/solo5_hvt.o: In function `__stack_chk_fail':
/home/nikhil/.conan/data/solo5/0.4.1/includeos/test/build/5d88dc4b26402479b7aa0fcb29ca9b9c28da4c9f/solo5/bindings/crt.c:41: multiple definition of `__stack_chk_fail'
/home/nikhil/includeos//x86_64/lib/libc.a(__stack_chk_fail.o):/home/gonzo/.conan/data/musl/v1.1.18/includeos/test/build/b6ca6a0ffff110bf17b843d4258482a94281eb43/musl/arch/x86_64/atomic_arch.h:108: first defined here
make[2]: *** [demo.elf.bin] Error 1
make[1]: *** [CMakeFiles/demo.elf.bin.dir/all] Error 2
make: *** [all] Error 2
Exception while building:  Command '['make', '-j4']' returned non-zero exit status 2

[ BUILD_FAIL ] building with cmake failed


INSTALL FAILED ON COMMAND: boot --with-solo5-hvt .ubuntu:~/projects/IncludeOS]$ ./test_solo5_hvt.sh
~/projects/IncludeOS/examples/demo_service ~/projects/IncludeOS
> -- Configuring done
> -- Generating done
> -- Build files have been written to: /home/nikhil/projects/IncludeOS/examples/demo_service/build
* <VMRunner>:  Building with 'make' (params=[])
/home/nikhil/includeos//x86_64/lib/solo5_hvt.o: In function `__stack_chk_fail':
/home/nikhil/.conan/data/solo5/0.4.1/includeos/test/build/5d88dc4b26402479b7aa0fcb29ca9b9c28da4c9f/solo5/bindings/crt.c:41: multiple definition of `__stack_chk_fail'
/home/nikhil/includeos//x86_64/lib/libc.a(__stack_chk_fail.o):/home/gonzo/.conan/data/musl/v1.1.18/includeos/test/build/b6ca6a0ffff110bf17b843d4258482a94281eb43/musl/arch/x86_64/atomic_arch.h:108: first defined here
make[2]: *** [demo.elf.bin] Error 1
make[1]: *** [CMakeFiles/demo.elf.bin.dir/all] Error 2
make: *** [all] Error 2
Exception while building:  Command '['make', '-j4']' returned non-zero exit status 2

[ BUILD_FAIL ] building with cmake failed

IncludeOS uses musl as a C library and the above function is a part of the library. After discussing with IncludeOS as to what could be done to address this, we think the simplest solution is to make solo5's __stack_chk_fail weak so that musl's definition is prioritised. If you guys are ok with this, I will go ahead and make the change

@mato
Copy link
Member

mato commented Mar 11, 2019

Fundamentally, this is a layering issue. __stack_chk_fail is part of the C compiler runtime, of which there can be only one. If we make it a weak symbol, that solves your (IncludeOS) immediate problem but leaves us with two new problems:

  1. We have to maintain __stack_chk_fail as a weak symbol, forever.
  2. Someone, somewhere, will accidentally link in a __stack_chk_fail that does the wrong thing and be very unhappy.

Unless you have a better idea, my only suggestion at this time would be to patch your copy of musl so that it does not define this symbol.

@ricarkol
Copy link
Collaborator

ricarkol commented Mar 11, 2019

Someone, somewhere, will accidentally link in a __stack_chk_fail that does the wrong thing and be very unhappy.

Could that be checked on the unikernel build side? For example, includeos doesn't allow a user-defined __stack_chk_fail (as it's already defining it in musl). The issue is that the unikernel knows best how to handle the failure, like logging an appropriate error in the right place.

@mato
Copy link
Member

mato commented Mar 11, 2019

The issue is that the unikernel knows best how to handle the failure, like logging an appropriate error in the right place.

No it doesn't. If you land in __stack_chk_fail() your stack may have been corrupted by an attacker. There is nothing you can do safely at this point except print a message and abort with prejudice.

@niks3089
Copy link
Author

They might have a usecase where they have a cloud server, and you configure it to restart itself when something bad happens however, before that happens, it needs to log something and send a few messages. IncludeOS does that now, because when we panic we can run a few safe lines of code

@mato
Copy link
Member

mato commented Mar 11, 2019

They might have a usecase where they have a cloud server, and you configure it to restart itself when something bad happens

The Solo5 __stack_chk_fail() exits with the equivalent of solo5_abort(), which is exactly what this is designed for. If restart-after-abort is desired, it is the job of whatever higher-level stack is orchestrating Solo5 unikernels to detect that case and do so.

however, before that happens, it needs to log something and send a few messages. IncludeOS does that now, because when we panic we can run a few safe lines of code

I repeat, there is nothing safe you can do if you land in __stack_chk_fail(). Especially not "send a few messages".

@alfreb
Copy link

alfreb commented Mar 13, 2019

The way I understand it this is not a question of IncludeOS vs. Solo5, it's Solo5 vs. a major implementation of libc. We're just trying to link against both libraries.

As you're putting it @mato

__stack_chk_fail is part of the C compiler runtime, of which there can be only one.

Is it a goal for Solo5 to provide more of this runtime in the future or is going to be just this one function? This will be important when we decide if it's feasible for us to maintain our own patches to libc to be able to run on solo5.

@alfreb
Copy link

alfreb commented Mar 13, 2019

For example, one could argue that mmap / malloc should be provided by solo5, as solo5 is controlling virtual memory, the primary tool for providing security features such as heap randomization. We currently can't provide guard pages between heap allocated chunks of memory inside solo5 - which is one of the reasons for why we moved to musl in the first place (e.g. it doesn't use onlybrk but also mmap as backend for malloc)

@mato
Copy link
Member

mato commented Mar 13, 2019

@alfred-bratterud:

Is it a goal for Solo5 to provide more of this runtime in the future or is going to be just this one function? This will be important when we decide if it's feasible for us to maintain our own patches to libc to be able to run on solo5.

That's a good question, thanks for asking.

For the benefit of others not necessarily familiar with the details, first I want to step back a bit and explain why __stack_chk_fail ended up in the Solo5 ABI. Please bear with me if this repeats things that might seem obvious to you.

What are the Solo5 API and ABI, exactly?

To ensure we're on the same page regarding terminology, here are my definitions of these two terms, as seen by a libOS that wants to support Solo5:

  1. The Solo5 API is composed of the C types and functions declared in solo5.h.
  2. The Solo5 ABI is the CPU architecture-specific binary file format and calling convention as used by the Solo5 bindings (solo5_<target>.o) at link time. This is not actually specified anywhere, so I'll specify it here: ELF64 using the Linux userspace calling convention for the CPU architecture in question (x86-64) (arm64).

Note for completeness: The binary file format of the unikernel (i.e. the artefact resulting from linking Solo5 with a libOS) is generally the same as/based on that used by the Solo5 ABI, with slight target-specific variations (e.g. a multiboot header for virtio, a shared object for Genode vs a static executable for others).

What does Stack Smashing Protection have to do with it?

SSP modifies the calling convention used by the ABI at the compiler level. In order for the Solo5 ABI as defined above to actually work in the presence of an SSP-enabled compiler, both the libOS (caller) and bindings (callee) have to agree on the calling convention used across the ABI boundary, including the implementation details of SSP (which, as I found out the hard way while implementing #294, are not actually specified anywhere!).

This implies that both __stack_chk_guard (as a plain global, which is the convention I settled on for Solo5) and __stack_chk_fail must be provided by the Solo5 bindings as part of the ABI. Trying to support any other combination(s) on our side is out of the question.

What about supporting more of the C runtime?

To answer your original question, consistent with the goal of Solo5 being as minimalist as possible yet still useful for getting real work done with unikernels, I don't plan on adding any more of the "C runtime" to the Solo5 ABI unless it's absolutely necessary.

In an ideal world, Solo5 would not have to provide __stack_chk_fail OR any parts of the "C compiler runtime" at all. Unfortunately, we don't live in an ideal world, so I think we're stuck with it.

Why doesn't Solo5 support mmap / malloc?

(This is a different topic, but since you asked)

For example, one could argue that mmap / malloc should be provided by solo5, as solo5 is controlling virtual memory, the primary tool for providing security features such as heap randomization [,,,]

Security always involves some set of tradeoffs. Solo5 aims for portability of unikernels across a large set of different targets, while preserving as much isolation of the unikernel from its host as possible. Not allowing control over the guest's virtual memory space is a conscious design decision, which among other things significantly improves the isolation properties of an spt-targetted unikernel. For example, there have been exploits against the Linux kernel which only need access to clone() and munmap().

If we implement PIE/ASLR at some stage (see #304), then at least the location of the entire memory region provided to the unikernel for use as its "heap + stack" will be randomized, which should address some of your concern.

@mato mato changed the title Multiple definition error of __stack_chk_fail when trying to build IncludeOS with latest solo5 SSP and the Solo5 ABI (was Multiple definition error of __stack_chk_fail ...) Mar 13, 2019
@hannesm
Copy link
Contributor

hannesm commented Mar 16, 2019

as further reference on malloc, which used to be part of solo5, please take a look at the discussion in #223 and references in there

@alfreb
Copy link

alfreb commented Mar 16, 2019

Thanks @mato / @hannesm.

I think it's problematic that the solo5 hvt is now directly incompatible with libc; e.g. both musl, glibc and bsd libc (which iirc you're patching post build with an objcopy). I might need some further education as to your reasoning for this; e.g. I see you're mentioning missing standardisation for calling conventions wrt. SSP, but I don't see how linking against solo5 hvt should be different from linking to any other prebuilt library? The usual procedure is to link against libc / libc equivalent for getting e.g.__stack_chck_fail and the rest of crt, and into any other library for other stuff, regardless of whether it's compiled with the same compiler. Would be great if you could help me understand the difference in this case.

Cheers.

@mato
Copy link
Member

mato commented Mar 21, 2019

@alfred-bratterud Sorry, I spent half a day writing the above text, I'm surprised that it's not clear.

Quoting the important part:

In order for the Solo5 ABI as defined above to actually work in the presence of an SSP-enabled compiler, both the libOS (caller) and bindings (callee) have to agree on the calling convention used across the ABI boundary, including the implementation details of SSP.

So, the core issue is:

  1. The Solo5 ABI is based on "plain C function calls".
  2. In order for a "plain C function call" to work in the presence of SSP, both sides (caller, in your case includeOS and callee, i.e. Solo5 bindings) have to use the same SSP ABI for that function call.
  3. The same SSP ABI means: the same locations of __stack_chk_fail (which is causing you grief) and __stack_chk_guard (which is not causing you grief, but will cause you grief as soon as you "fix" your first problem).

If you want the details, read up on these references cribbed from #293:

Alternatively, build a copy of Solo5 master and read through objdump --disassemble tests/test_hello/test_hello.hvt at your leisure, concentrating on the use of __stack_chk_guard.

As to what can be done about this:

I thought of various workarounds like not exporting __stack_chk_{guard, fail} from the Solo5 bindings and instead having it (in the MirageOS case) defined one layer down, so in ocaml-freestanding, but ultimately it boils down to the three points I made above -- it won't work. Feel free to prove me wrong.

The only other option that I can see is re-designing the ABI and defining our own, non-standard, calling convention for all the solo5 functions. That might work, and might actually have other interesting advantages, but it's a lot of work to get right across all the supported targets, tenders and CPU architectures.

@mato mato added notabug Not a bug. Was "invalid", but that sounds too negative. design Design / discussion. labels Mar 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Design / discussion. notabug Not a bug. Was "invalid", but that sounds too negative.
Projects
None yet
Development

No branches or pull requests

5 participants