Skip to content

Latest commit

 

History

History
372 lines (308 loc) · 14.3 KB

CallGraph.md

File metadata and controls

372 lines (308 loc) · 14.3 KB

Call Graph Generator

Overview

A call graph is a directed graph that represents relationships between functions of a program. The nodes (vertices) of the graph are functions defined within the program (or its dependencies). A directed edge from some node1 to some node2 represents that node1 calls node2 at some point in the program.

In addition, the call graph generated by MIRAI includes type information as part of the edges of the graph. Supposing node1 calls node2 with two arguments, having types t1 and t2, respectively, there will be two edges in the call graph between node and node2, each edge labeled with one of the two types.

MIRAI derives the call graph of a program through static analysis (see this documentation for an overview of MIRAI). As such the derived call graph is not guaranteed to be precise; it may not always be possible to determine which functions are called in the presence of indirection. However, benchmarking (see the section below titled "Resolved calls") indicates the generated call graph is highly precise.

After the target program is statically analyzed, MIRAI supports output of the generated call graph in two output formats:

  • dot: for use with Graphviz to create a visualization of the call graph.
  • ddlog: the call graph in the form of Datalog input relations.

Please see their respective sections below for more information.

Resolved calls

MIRAI's call graph generation is highly precise; correctly resolving calls to all of the tests in the call graph generator benchmark.

Below we present the results of MIRAI on the benchmark:

Call type Resolved %
static dispatch 100%
dynamic dispatch 100%
generic 100%
function pointer 100%
macro 100%
conditionally compiled 50%

In all cases but "conditionally compiled", MIRAI is able to resolve all of the calls. For the "conditionally compiled", MIRAI receives a 50% score because conditional compilation is performed by Cargo before MIRAI is invoked (see the Architecture documentation). MIRAI does not have access to the functions that were not included in the conditional compilation. For the calls that are included, MIARI does resolve them.

Configuration

MIRAI's call graph generator output can be configured with the CLI option --call_graph_config <path_to_config>. A valid JSON configuration file must exist at that path using the following schema:

{
    "call_sites_output_path": "path/to/call_sites.json",
    "dot_output_path": "path/to/graph.dot",
    "reductions": [
        {"Slice": "function name"},
        "Fold",
        "Clean",
        "Deduplicate",
    ],
    "included_crates": ["crate_name"],
    "datalog_config": {
        "ddlog_output_path": "path/to/graph.dat" | "path/to/datalog/",
        "type_map_output_path": "path/to/types.json",
        "type_relations_path": "path/to/type_relations.json",
        "datalog_backend": "DifferentialDatalog" | "Souffle"
    },
}

Below, we explain each field in detail:

  • "call_sites_output_path": (Optional) Path where call site output of graph will be saved, if provided. See the section below on "Call site output".
  • "dot_output_path": (Optional) Path where dot output of graph will be saved, if provided. See the section below on "Dot output".
  • "reductions": Possibly empty list of reductions to apply to the call graph. See the subsection below on "Graph reductions".
  • "included_crates": List of crate names to include in the graph, if the Fold reduction is used. Can be empty if the Fold reduction is not being used.
  • "datalog_config": (Optional) Configuration for Datalog output (see below). If left unspecified, no Datalog output will be generated.

Datalog configuration:

  • "ddlog_output_path": Path where Datalog output of graph will be saved. See the section below on "Datalog output".
  • "type_map_output_path": Path where type map output of graph will be saved. See the section below on "Type Map output".
  • "type_relations_path": (Optional) Path to file where manually-added type relations will read from, if provided. See the section below on "Type relations"
  • "datalog_backend": The Datalog output backend to use. Currently Differential Datalog and Soufflé are supported. Note that if Soufflé is used "ddlog_output_path" should be the path to a directory rather than a file.

Graph reductions

A number of graph reductions are supported for the generated call graph. These reductions reduce the size of the graph by excluding nodes and edges that may not be relevant to the user in order to make the graph easier to read and analyze.

Four call graph reductions are supported:

  1. Slice(function_name): This reduction returns a sub-graph of nodes that are reachable from the node associated with the given function name. Expected configuration format is: {"Slice": "function name"}.
  2. Fold: This reduction removes all nodes that do not belong to at least one of the crates specified in "included_crates". If there exists a path from an included node to another included node that goes through one or more excluded nodes, that path is preserved through the creation of a new edge connecting the included nodes. Expected configuration format is "Fold".
  3. Deduplicate: This reduction reduces the set of edges in the graph such that each there exists at most one edge connecting any two nodes. This has the effect of removing type information from the graph, and makes it more suitable for visualization. Expected configuration format is "Deduplicate".
  4. Clean: This reduction removes orphan nodes from the graph. These are nodes that have no incoming or outgoing edges. Expected configuration format is "Clean".

Reductions may be specified in any order (and even multiple times), and they are applied sequentially in the order specified.

Call site output

This is a JSON encoded file providing a list of call sites where each call site is contains a source location, the caller and the callee. The source location consists of an index into the vector contents of the files property, along with a 1-based start line number and start column of the call expression. The caller and callee values are indices into the vector contents of the callables property.

File entries are paths: Relative paths from the crate root directory for files that are compiled by this build. Absolute (but "virtual") paths for files from prebuild libraries, such as the Rust standard library.

Callable entries are fully qualified names of functions, where the first qualifier is the name of the crate in which the callable is defined. The other qualifiers are module names and type names. Qualified names are unique.

When a call site is inside a macro invocation (such as println!(..)), the source location is that of the macro invocation rather than the source file where the macro is defined. This has the consequence that there may be more than one entry in the call_sites property with the same source location.

Dot output

The call graph generator supports dot output of the call graph. This output format is intended for use with Graphviz, which ingests the dot file format to produce a visualization of the graph in multiple formats.

In this form nodes are labeled with their corresponding function name while edges are not labeled. It is recommended to use the Deduplicate reduction to remove extra edges between nodes.

Here is an example of the dot output:

digraph {
    0 [ label = "\"static[8787]::main\"" ]
    1 [ label = "\"static[8787]::fn1\"" ]
    2 [ label = "\"static[8787]::fn2\"" ]
    3 [ label = "\"static[8787]::fn3\"" ]
    0 -> 1 [ ]
    1 -> 2 [ ]
    2 -> 3 [ ]
}

Supposing this is stored in a file called graph.dot, this can be used to produce a PDF of the graph using the Graphviz dot tool:

$ dot -Tpdf graph.dot -o graph.pdf

Datalog output

The call graph generator also supports Datalog output of the call graph. Datalog is a logic programming language that can be used, in conjunction with the generated call graph, to write an analysis that operates on the call graph.

Here is an example of the Differential Datalog output for the prior graph:

start;
insert Edge(0,0,1);
insert Edge(1,1,2);
insert Edge(2,2,3);
insert EdgeType(0,0);
insert EdgeType(1,0);
insert EdgeType(2,0);
commit;

The call graph generator generates two types of relations:

  1. Graph relations: Specify properties about the graph itself.
  2. Type relations: Specify properties about the types of the edges of the graph.

These relations are explained in detail below.

Note: To preserve type information it is recommended to not use the Deduplicate reduction.

Graph relations

There are three graph relations output by MIRAI:

  1. Edge(id,n1,n2): This specifies that there is a directed (call) edge from n1 to n2. The edge has the identifier id.
  2. EdgeType(id,t): This specifies that the edge with the identifier id has type t.
  3. Dom(n1,n2): This specifies that call to node n1 dominates the call to node n2 in the call graph of the function body where n1 and n2are called; the call to n1 occurs before n2 on all paths through the function body.

Note that all identifiers of these relations, e.g., id, n1, t, are output as u32 indexes. Node indexes can be traced back to their associated function name by looking at the dot file output. Type indexes can be traced back to an associated type string but looking at the output Type Map.

Type Map output

The output Datalog relations use indexes (unique identifiers) to refer to nodes of the graph and types of edges. As explained above, node identifers can be traced back to their function name by looking at the dot output of the call graph generator.

Type indexes can be traced back to their associated type string by looking at the output Type Map. Here is an example type map:

{
    "0": "u32",
    "1": "u8",
}

This map encodes that if relation is found that refers to the type 0, e.g., EdgeType(1,0), that relation is referring to the (Rust) type u32.

Type relations

There are two type relations supported by MIRAI:

  1. EqType(t1,t2): This encodes that types t1 and t2 are equivalent for the purpose of analysis.
  2. Member(t1,t2): This encodes that type t2 is a "member" of type t1. This is more general than subtyping. For example, we might say that T is a "member" of Vec<T>.

Type relations are determined in two ways. First, the call graph generator includes heuristics for deriving type relations for some common patterns. For example, Member(Vec<T>, T) is always derived when a type Vec<T> is encountered (as long as type T by itself is used elsewhere in the program).

Second, type relations may be manually input via the path specified in the config file: type_relations_path. This should be a path to a JSON file with the following schema:

{
    "relations": [
        {
            "kind": "Member" | "Eq",
            "type1": "T1",
            "type2": "T2"
        },
        ...
    ]    
}

A list of relations is provided. For each relation, its kind ("Member" or "Eq") is specified and the two types of the relation are specified. Concrete types (non-generic) must be used for the generated type relations to be included because MIRAI produces concrete type information.

Example: Reachability analysis

Below, we show a simple Datalog analysis that finds all nodes reachable from a given node. In this example we use the Differential Datalog implementation of Datalog.

The Datalog analysis saved in reachable.dl:

input relation Edge(id: u32, node1: u32, node2: u32)
input relation EdgeType(id: u32, type_id: u32)

output relation Reachable(node1: u32, node2: u32)
Reachable(node1, node2) :- Edge(_, node1, node2).
Reachable(node1, node3) :- Edge(_, node1, node2), Reachable(node2, node3).

The call graph Datalog output generated by MIRAI (in graph.dat):

start;
insert Edge(0,0,1);
insert Edge(1,1,2);
insert Edge(2,2,3);
insert EdgeType(0,0);
insert EdgeType(1,0);
insert EdgeType(2,0);
commit;
dump Reachable;

To run the analysis with Differential Datalog:

ddlog -i reachable.dl &&
(cd base_ddlog && cargo build --release) &&
./reachable_ddlog/target/release/reachable_cli < ../graph.dat

Finally, the output:

Reachable{.node1 = 0, .node2 = 1}
Reachable{.node1 = 0, .node2 = 2}
Reachable{.node1 = 0, .node2 = 3}
Reachable{.node1 = 1, .node2 = 2}
Reachable{.node1 = 1, .node2 = 3}
Reachable{.node1 = 2, .node2 = 3}

As we can see, there is an output relation for every pair of nodes n1, n2 such that n2 is reachable from n1.

This example could also be done with Soufflé Datalog with minor changes to the syntax:

.decl Edge(id: number, node1: number, node2: number)
.input Edge(io=file, delimiter=",")
.decl EdgeType(id: number, type_id: number)
.input EdgeType(io=file, delimiter=",")

.decl Reachable(node1: number, node2: number)
Reachable(node1, node2) :- Edge(_, node1, node2).
Reachable(node1, node3) :- Edge(_, node1, node2), Reachable(node2, node3).

.output Reachable(io=file, delimiter=",")

The input relations will also need to be formatted differently; each relation need to be placed into a separate file titled by the relation:

  • Edge.facts
    0,0,1
    1,1,2
    2,2,3
    
  • EdgeType.facts
    0,0
    1,0
    2,0
    

To run the analysis with Soufflé: souffle reachable.dl. The output will be stored into a file called Reachable.csv and will match the results we saw before.

An interesting feature of Soufflé is the ability to see proof trees for the output. To do this, first run Soufflé's explain command: souffle -t explain reachable. Then, a particular output relation can be explained:

Enter command > explain Reachable(1,3)
              Edge(2, 2, 3)   
              -----------(R1) 
Edge(1, 1, 2) Reachable(2, 3) 
--------------------------(R2)
       Reachable(1, 3)        

We can see how Reachable(1, 3) was derived transitively.