Dynamic Control-flow Graph Generation with PinPlay*
A control-flow graph (CFG) is a fundamental structure used in computer science and engineering for describing and analyzing the structure of an algorithm or program. A dynamic control-flow graph (DCFG) is a specialized CFG that adds data from a specific execution of a program. We provide a tool for generating a DCFG based on the Pin binary-instrumentation package. We also provide an application-programmer interface (API) to access the DCFG data from within another Pin tool or a standalone program. More details follow.
We start with a classical control-flow graph (CFG) definition as described by Frances Allen in a 1970 ACM SIGPLAN article:
- A control-flow graph is defined as a directed graph in which the nodes represent basic blocks and the edges represent control-flow paths.
- A directed graph is connected if any node in the graph can be obtained (reached) from any other node.
- A basic block is a linear sequence of program instructions having one entry point (the first instruction executed) and one exit point (the last instruction executed).
Typically, a CFG is defined statically, i.e., there is exactly one CFG for a given binary, it contains no information about the execution path of any particular workload, and the nodes and edges are determined by all the possible reachable code paths in the binary. It does not typically include edges created by unexpected exceptions and other non-control-flow instructions.
We define a DCFG as a CFG with the following differences:
- A DCFG contains a start node that has no predecessor nodes and whose successor node contains the first instruction executed per thread. It also contains an end node that has no successor nodes and whose predecessor node contains the last instruction executed per thread.
- Each edge of a DCFG is augmented with a dynamic count to indicate how many times it was traversed per thread by a given program execution. Except for the start and end nodes, the dynamic count of any node is equal to the sum of the counts of all of its incoming edges, which is also equal to the sum of all of its outgoing edges.
- A DCFG need not contain nodes or edges that are not executed. A DCFG is allowed to contain nodes and edges that are not executed; they will simply have counts of zero.
- A DCFG contains edges representing all code paths actually executed, even if they are due to non-control-flow instructions. For example, a floating-point instruction that causes an unmasked exception during execution may create an edge to the exception-handling code.
The blocks in a DCFG can be combined into higher-level constructs such as loops, routines, and binary images. Dynamic data such as loop iteration counts can be deduced from the underlying edge counts.
Even though a DCFG contains dynamic information such as edge counts, in general, it does not allow one to recreate the order in which edges were executed. This additional information is needed for many types of dynamic analysis. So, in addition to the DCFG, we define a DCFG-Trace as the sequence of edges taken during execution of the workload. The exact sequence of basic blocks, routines, and loops taken by a workload can then be recreated from the combination of its DCFG and DCFG-Trace.
Creating DCFG Data
A DCFG and/or a DCFG-Trace can be created using a PinPlay tool that is linked with the DCFG PinPlay library “dcfg-pinplay.so” (for Linux). Linking with this library provides a number of command-line knobs. The “-dcfg” knob creates a DCFG datafile and adding the “-dcfg:write_trace” knob creates a DCFG-Trace datafile. Both of these files are in JSON format. See the help messages for other knobs beginning with “-dcfg:” for other options.
Using DCFG Data
A DCFG and/or a DCFG-Trace can be used in a variety of ways. For example:
- A program can be written in any language that is capable of parsing JSON files to input and process the raw DCFG data.
- A standalone C++ can be written using the C++ DCFG APIs to input the DCFG data and conveniently access the data.
- A PinPlay tool can be written using the C++ DCFG APIs to not only access the recorded data but also instrument code during replay based on the data.
Documentation is available for the following:
- DCFG format description:A description of the DCFG and DCFG-Trace JSON format.
- DCFG API description: The API classes and their methods to activate DCFG creation and access the resulting data via C++ code.
DCFG library/example-tools are distributed with the PinPlay kit for PLDI2016 (pinplay-drdebug-pldi2016-*).
For a description of the tools, see PinPlay Tutorial at PLDI 2016 [Slides].
- Graph-matching-based simulation-region selection for multiple binaries, Yount, C., Patil, H. Islam, M.S., Srikanth, A.; Performance Analysis of Systems and Software (ISPASS), 2015 IEEE International Symposium on , vol., no., pp.52,61, 29-31 March 2015. [Slides] [Paper]