Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Save symbolic data from *.ktest files as a flat binary file #1681

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

tkuchta
Copy link
Contributor

@tkuchta tkuchta commented Jan 8, 2024

Summary:

This PR implements a new feature enabled by the --write-bin-tests flag.
When the flag is used, for every *.ktest file, KLEE stores a corresponding *.ktest.buffer file which contains only the data of the symbolic variables stored as a flat binary file.

The use case for these binary files is the following: imagine we want to use KLEE in a hybrid mode with a fuzzer on the same test harness. Currently, if we want to extract data from KLEE and use it with fuzzer, we need to parse *.ktest files which contain metadata in addition to the pure symbolic variable contents.
The new feature allows to take the *.ktest.buffer files and feed them directly to a fuzzer (e.g. AFL++).

Checklist:

  • [ x ] The PR addresses a single issue. If it can be divided into multiple independent PRs, please do so.
  • [ x ] The PR is divided into a logical sequence of commits OR a single commit is sufficient.
  • [ x ] There are no unnecessary commits (e.g. commits fixing issues in a previous commit in the same PR).
  • [ x ] Each commit has a meaningful message documenting what it does.
  • [ x ] All messages added to the codebase, all comments, as well as commit messages are spellchecked.
  • [ x ] The code is commented OR not applicable/necessary.
  • [ x ] The patch is formatted via clang-format OR not applicable (if explicitly overridden leave unchecked and explain).
  • [ x ] There are test cases for the code you added or modified OR no such test cases are required.

@tkuchta
Copy link
Contributor Author

tkuchta commented Feb 13, 2024

Do you think this feature might be useful?

Copy link

codecov bot commented Feb 13, 2024

Codecov Report

Merging #1681 (7e51d43) into master (c966cc6) will increase coverage by 0.00%.
The diff coverage is 80.00%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1681   +/-   ##
=======================================
  Coverage   70.18%   70.19%           
=======================================
  Files         162      162           
  Lines       19397    19407   +10     
  Branches     4632     4635    +3     
=======================================
+ Hits        13614    13622    +8     
- Misses       3743     3744    +1     
- Partials     2040     2041    +1     
Files Coverage Δ
tools/klee/main.cpp 64.95% <80.00%> (+0.23%) ⬆️

@ccadar
Copy link
Contributor

ccadar commented Feb 16, 2024

@tkuchta thanks for the contribution and sorry for the delayed answer. We are prioritising PRs for an upcoming KLEE release.

The feature is self-contained enough, so we could add it -- do others have a different opinion?

@251
Copy link
Contributor

251 commented Feb 16, 2024

I think the approach is a bit odd since it dumps all objects into a single file in arbitrary order (e.g. even stat structs, symbolic arguments, ...). In my opinion this should either be a feature of ktest-tool (which is already able to dump single objects into a file) or be refactored to just dump the content of concretised file inputs.

@tkuchta
Copy link
Contributor Author

tkuchta commented Feb 19, 2024

@251 @ccadar thank you for your comments! My intention was to dump the concrete values of symbolic objects in the order of object creation, e.g.

klee_make_symbolic(&x, sizeof(x), "x");
klee_make_symbolic(&y, sizeof(y), "y"); 

that would create a file with the following contents:

<concrete bytes of x><concrete bytes of y>

The reason for doing that is when we use a fuzzer and substitute klee_make_symbolic calls with a call that takes data from a fuzzer we can leverage easily test cases generated by KLEE. For example, we can copy these files straight into the input directory of afl.

I can see two advantages of doing that operation in KLEE:

  1. speed: w do it when the test case is generated in C++ code, while ktest-tool is a python program, which can make a difference if there are a lot of test cases and a lot of symbolic objects

  2. avoiding a second step: we have a data ready to be fed into a fuzzer just when the test case is generated.

it dumps all objects into a single file in arbitrary order
@251 my intention was to dump the objects sequentially as I noted above. Perhaps my code does something else by mistake - please let me know if that is the case.

even stat structs, symbolic arguments
@251 Could you please provide more details about these cases? I probably only considered programs with file input so likely I haven't come across this scenario. Thank you!

Thank you again for the comments!

@danielschemmel
Copy link
Contributor

I have to admit that to me it seems as if this approach is too much to take in "because, why not", but not enough to properly integrate with a fuzzer.

@tkuchta
Copy link
Contributor Author

tkuchta commented Feb 26, 2024

Hi @danielschemmel , thank you for the comment. Could you possibly let me know what would be missing for the fuzzer integration, please?
I use the files stored like that directly as AFL inputs - maybe I am missing something important here.
Thanks!

@danielschemmel
Copy link
Contributor

For one, I think that for fuzzer integration it is necessary to specify the output folder (e.g., AFL's queue), instead of just using klee-last.

Secondly, I think there are a few important special cases to consider. The most important one is stdin:

ktest-tool test000001.ktest
ktest file : 'test000001.ktest'
args       : ['stdin.bc', '--sym-stdin', '8']
num objects: 2
object 0: name: 'stdin'
object 0: size: 8
object 0: data: b'\xef\xbe\xad\xde\xff\xff\xff\xff'
object 0: hex : 0xefbeaddeffffffff
object 0: int : -559038737
object 0: uint: 18446744073150512879
object 0: text: ........
object 1: name: 'stdin_stat'
object 1: size: 144
object 1: data: b'%\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\xa4\x81\x00\x00\xe8\x03\x00\x00\xe8\x03\x00\x00\xff\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\x00\x10\x00\x00\x00\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff.*\x1eb\x00\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\xbd\xc2\xdce\x00\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\xbd\xc2\xdce\x00\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
object 1: hex : 0x250000000000000001000000000000000100000000000000a4810000e8030000e8030000ffffffff0000000000000000ffffffffffffffff0010000000000000ffffffffffffffff2e2a1e6200000000ffffffffffffffffbdc2dc6500000000ffffffffffffffffbdc2dc6500000000ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
object 1: text: %........................................................................*.b...............e...............e....................................

Should ideally yield a buffer file that contains exactly b'\xef\xbe\xad\xde\xff\xff\xff\xff'. However, since there are two objects in the ktest file, the output is less sensible:

00000000: efbe adde ffff ffff 2500 0000 0000 0000  ........%.......
00000010: 0100 0000 0000 0000 0100 0000 0000 0000  ................
00000020: a481 0000 e803 0000 e803 0000 ffff ffff  ................
00000030: 0000 0000 0000 0000 ffff ffff ffff ffff  ................
00000040: 0010 0000 0000 0000 ffff ffff ffff ffff  ................
00000050: 2e2a 1e62 0000 0000 ffff ffff ffff ffff  .*.b............
00000060: bdc2 dc65 0000 0000 ffff ffff ffff ffff  ...e............
00000070: bdc2 dc65 0000 0000 ffff ffff ffff ffff  ...e............
00000080: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00000090: ffff ffff ffff ffff                      ........

This means, that there is a non-obvious issue with this example:

$ klee --libc=uclibc --posix-runtime --write-bin-tests stdin.bc --sym-stdin 8
$ ./stdin < klee-last/test000001.ktest.buffer

@tkuchta
Copy link
Contributor Author

tkuchta commented Feb 26, 2024

Thank you for the comment @danielschemmel

Indeed, I only considered the file input.

I was wondering if stdin would be meaningful in case of the fuzzer; in other words if this is possible to feed AFL with more inputs externally when it assumes getting the data from stdin rather than from a file? Would it also take the data from the queue directory and treat it as stdin?

I agree we could add the output files destination and perhaps default it to klee-last.

@251
Copy link
Contributor

251 commented Feb 26, 2024

Hi @tkuchta,

Indeed, I only considered the file input.

Symbolic files have the same issue (that's what I meant with "stat structs" above):

object 3: name: 'A_data'
object 3: size: 8
...
object 4: name: 'A_data_stat'
object 4: size: 144
...

Also, the order of objects can be dependent on the external environment and can create non-determinism when "replayed" (or executed via AFL). Of course, for fuzzing this might be less important.

I'd also prefer more sophisticated support for fuzzers or would just use a ktest-tool hack + watchexec (or similar) in your case.

@tkuchta
Copy link
Contributor Author

tkuchta commented Feb 26, 2024

Hi @251,
Thanks for the comment!

Symbolic files have the same issue (that's what I meant with "stat structs" above):
I wasn't clear enough. In my use case I use klee_make_symbolic in the symbolic part. This works better for me as I do not need to specify the input size up-front. But I can see that for symbolic files there is the "stats" part, got it.

Also, the order of objects can be dependent on the external environment and can create non-determinism when "replayed" (or executed via AFL).
Could you please provide an example in which situation would that happen? Do you mean multi-threaded apps?

@danielschemmel
Copy link
Contributor

One design that I could imagine here is to somehow™ mark the structure belonging to stdin and symbolic files, and to put them into their own binary files.

That would nicely circumvent any non-determinism issues: The variables file contains all custom variables in the order in which they were created (by klee_make_symbolic) in the source program, a stdin file for stdin (if it is symbolic) and a file for each symbolic file. These could then have appropriate names, e.g., test000001.variables, test000001.stdin and test000001.$FILENAME.

AFL (and other fuzzers) usually provide a stream of symbolic data on stdin or as a file (path). This most closely maps to the symbolic stdin/symbolic files mechanism in KLEE, which means that we should support it like this for a general fuzzer integration.

@251
Copy link
Contributor

251 commented Feb 26, 2024

Could you please provide an example in which situation would that happen?

When you rely on the environment:

if (isThursday || diskIsFull || tmpDirIsEmpty || machineNameLength > 8 || ...)
  klee_make_symbolic(&a, ...)
else
  klee_make_symbolic(&b, ...)

@tkuchta
Copy link
Contributor Author

tkuchta commented Feb 27, 2024

@danielschemmel , @251 Thank you for the comments!
My current solution implements the case with klee_make_symbolic on the KLEE side and input files on the fuzzer side (this is my use case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants