LLVMPin Instrumentation Framework

The LLVMPin framework is designed to make instrumenting programs with LLVM easy. The framework borrows the notion of instrumentation callback routines from Pin / ATOM. The framework works in two phases. The first phase (called the instrumentation phase), modifies the LLVM IR of a compiled program by inserting function calls to analysis routines. This is performed offline and can make use of static analysis. The second phase is the analysis phase, which occurs when the modified program is executed. In general, the program runs as normal but also executes the various analysis routines that were inserted during the instrumentation phase. The analysis phase does not have access to the traditional LLVM APIs nor is capable of performing static analysis.

Installation

The LLLVMPin files are currently located on sdr (sdr.cs.colorado.edu). To use the framework, make a copy of the llvmpin directory and then compile the framework using make. You should also modify the setup.sh or setup.csh file and update the LLVMPIN variable to point to the copied location. The setup file also includes a reference to a setup file for the general LLVM distribution itself, which I have setup in my public directory. If you want to use a different LLVM (ie. one you installed), feel free to comment out this line and update the script as appropriate. When done with all modifications, you can then source the appropriate file whenever you want to use LLVMPin, or add the relevant bits to your profile. eg:

      > cp -r /home/esl/blomsted/public/llvmpin .
      > cd llvmpin
      > make
      > (Modify setup.sh / setup.csh)
      > source setup.sh
      > (Enjoy!)
    

Getting Started

Using LLVMPin is a lot like using LLVM, since LLVMPin only provides an API aimed at adding new instrumentation features to LLVM rather than wrapping the entire LLVM API in a Pin-compatible layer. Where possible, the existing LLVM API is used in instrumentation tools. As such, you should have some basic familarity with LLVM. I suggest you completely read through the LLVM Programmer's Manual and the Writing an LLVM Pass tutorial. Also, keep the following around for reference: the LLVM Language Reference Manual and the LLVM API doxygen . Those are really the only LLVM docs that matter, and the doxygen is where you'll spend most of your time on a daily basis.

Using the LLVMPin framework consists of writing an LLVMPin tool (which should be located in the Tools directory of the llvmpin directory) and then using the provided 'llvmpin-gen' and 'llvmpin' scripts to compile and use your tool. Let's teach by example. The following code is a tool which counts the number of dynamic instructions executed during a run and prints the results when the program terminates. (Note: For LLVMPin, the notion of "instruction" corresponds to a single LLVM IR instruction, not native instructions).

// icount.cpp
#ifdef BEGIN_ANALYSIS
#include "LLVMPin.h"
#include <stdint.h>
#include <iostream>

static uint64_t icount;

ANALYSIS void A_Ins() {
  icount++;
}

ANALYSIS void Fini(uint32_t code, void *v) {
  std::cout << icount << "\n";
}

void AnalysisInit() {
  PIN_AddFiniFunction(Fini, 0);
}
#endif // END_ANALYSIS

#ifdef BEGIN_INSTRUMENTATION
#include "LLVMPin.h"
#include "llvm/Module.h"
#include "llvm/Support/InstIterator.h"

using namespace llvmpin;
using namespace llvm;

REQUIRED(AnalysisUsage &AU) {
}

INSTRUMENT(Pass *P, Module &M) {
  for(Module::iterator fi = M.begin(), fi_end = M.end(); fi != fi_end; ++fi) {
    if(fi->isDeclaration())
      continue;

    for(Function::iterator bi = fi->begin(), bi_end = fi->end(); bi != bi_end; ++bi) {
      for(BasicBlock::iterator ii = bi->begin(), ii_end = bi->end(); ii != ii_end; ++ii) {
        Instruction *ins = &*ii;
        INS_InsertCall(ins, IPOINT_BEFORE, "A_Ins", IARG_END);
      }
    }
  }
}
#endif // END_INSTRUMENTATION

Using the 'llvmpin-gen' and 'llvmpin' scripts, the above tool can be compiled and used to apply instrumentation to a program that was compiled into LLVM bitcode. The 'llvmpin' script will generate both a new bitcode file with the instrumentation applied, as well as a native executable generated from the instrumented bitcode. For example:

    > llvm-g++ -O0 -emit-llvm -c test.c
    > llvmpin-gen icount
    > llvmpin icount test.o
    Generated test_icount.bc
    Generated test_icount
    > ./test_icount
    

As shown above, LLVMPin tools consist of instrumention code and analysis code. When writing a tool, the analysis and instrumentation code can either be included in a single C++ file with appropriate #ifdefs (as above), or can be be split into two files with the naming tool_i.cpp and tool_a.cpp. The llvmpin scripts automatically determine the appropriate files and therefore the same commands can be enetered into your shell regardless of which method you chose. To make things clear, the above example could have been broken into two seperate files as follows:

// icount_a.cpp
#include "LLVMPin.h"
#include <stdint.h>
#include <iostream>

static uint64_t icount;

ANALYSIS void A_Ins() {
  icount++;
}

ANALYSIS void Fini(uint32_t code, void *v) {
  std::cout << icount << "\n";
}

void AnalysisInit() {
  PIN_AddFiniFunction(Fini, 0);
}
// icount_i.cpp
#include "LLVMPin.h"
#include "llvm/Module.h"
#include "llvm/Support/InstIterator.h"

using namespace llvmpin;
using namespace llvm;

REQUIRED(AnalysisUsage &AU) {
}

INSTRUMENT(Pass *P, Module &M) {
  for(Module::iterator fi = M.begin(), fi_end = M.end(); fi != fi_end; ++fi) {
    if(fi->isDeclaration())
      continue;

    for(Function::iterator bi = fi->begin(), bi_end = fi->end(); bi != bi_end; ++bi) {
      for(BasicBlock::iterator ii = bi->begin(), ii_end = bi->end(); ii != ii_end; ++ii) {
        Instruction *ins = &*ii;
        INS_InsertCall(ins, IPOINT_BEFORE, "A_Ins", IARG_END);
      }
    }
  }
}

Tool Explanation

Your instrumentation code must include both the REQUIRED and INSTRUMENT routines. The REQUIRED routine can be used to list existing LLVM analysis passes that your instrumentation code requires, such as DominatorTree analysis, AliasAnalyais, etc. More details can be found in the "Writing an LLVM Pass" tutorial linked above. In this example, the REQUIRED routine is empty as we don't require anything. The INSTRUMENT routine is the main function for your instrumentation code, and is in charge of inserting calls to analysis functions using the INS_InsertCall API. In this example, we use the LLVM API to iterate over all of the functions in the module, all of the basic blocks within each function, and then all the instructions within each basic block and insert an analysis call for each instruction to the A_Ins routine. The isDeclaration check is used to distinguish internal functions from external functions. It's not possible to iterate over the basic blocks/instructions of external functions (for obvious reasons). This is a limitation of the LLVMPin approach. Unlike Pin, you can only instrument code which is compiled into LLVM bitcode. System libraries and external calls do not count. The INS_InsertCall API is the main component of LLVMPin and will be discussed later in this document.

Your analysis code must include the AnalysisInit routine as well as any analysis functions that you referenced in your instrumentation code. The AnalysisInit routine is executed before the main function of the instrumented program, and can be used to setup your analysis data structures, open file descriptors, etc. You can also use the PIN_AddFiniFunction within the AnalysisInit routine in order to register a callback that is exited when the instrumented program ends. By default, there is no fini callback unless you register one. Otherwise, your analysis code is standard C++ code.

LLVMPin API

At present, the LLVMPin API is very basic. There are plans to add to the API over time, but the existing API already allows for a variety of useful profiling tools (as will be shown in the examples section further down).

void INS_InsertCall(Instruction *ins, IPOINT pos, string fname, IARG iargs...)

The INS_InsertCall routine is used during instrumentation to add a callback to an analysis routine. The first parameter is a pointer to the LLVM Instruction that you want to instrument, the second parameter determines when the call should occur relative to the instruction (IPOINT_BEFORE, IPOINT_AFTER, or IPOINT_ANYWHERE), and the third parameter is the name of the analysis function. At present, not all IPOINTs are handled correctly so you are encouraged to use IPOINT_ANYWHERE and write tools in a way such that the position is not important.

The fourth parameter is a list of argments that you want LLVMPin to pass into your analysis routine. This is a variable length parameter that must be terminated with IARG_END. The usefulness of this framework steems from the various IARGs that can be used. The table below lists the currently supported IARGs.

IARG Type Description
IARG_INST_PTR uint32_t Under Pin this corresponds to ip address of the instruction. Under LLVMPin this is a unique id corresponding to an LLVM IR instruction in the module. LLVMPin uniquely numbers all instructions in a module before invoking INSTRUMENT.
IARG_MEMORY_EA uintptr_t For an instruction that references memory (directly or indirectly), this provides the effective address of the memory location referenced.
IARG_MEMORY_SIZE uintptr_t For an instruction that references memory (directly or indirectly), this provides the size in bytes of the memory reference.
IARG_MY_FUNADDR uintptr_t Provides the address of the function that the instruction is located within. This address corresponds to the actual native address as compiled. You can use the RTN APIs discussed below to map this address into useful information.
IARG_BRANCH_TARGET_ADDR uintptr_t For branch, call, and return instructions, this provides the address of the branch destination. For all instructions except for calls, this corresponds to the same unique id used by IARG_INST_PTR. For call instructions, this provides the native address of the called routine. Most tools will want to check if an instruction is a call or not and use two seperate analysis routines. One that logs ids for non-calls, and one that uses the address along with the RTN API to log information for calls. For BRANCH_TARGET_ADDR to work with return instructions, you must first call MISC_InitReturnStack in your INSTRUMENT routine. Enabling the return stack slows down profiling, so only do so if you need to profile return addresses.
IARG_UINT32 uint32_t Allows you to pass an arbitrary 32-bit value into an analysis routine. This IARG must be followed by the value you want to pass in. Eg. INS_InsertCall(..., IARG_UINT32, 42, ...).
IARG_PTR void* Allows you to pass an arbitrary void* value into an analysis routine. This IARG must be followed by the value you want to pass in. Eg. INS_InsertCall(..., IARG_PTR, ptr, ...).
IARG_ADDRINT uintptr_t Allows you to pass an arbitrary uintptr_t value into an analysis routine. This is an integer large enough to hold a pointer on your architecture. This IARG must be followed by the value you want to pass in. Eg. INS_InsertCall(..., IARG_ADDRINT, 42, ...).

void MISC_InitReturnStack()

Setups up instrumentation used to maintain the return stack, which is used to provide branch targets for return instructions. This slows things down a bit, and should only be used if you plan to use IARG_BRANCH_TARGET_ADDR within an analysis callback that profiles return instructions. This is called from within instrumentation code.

void PIN_AddFiniFunction(FiniTy func_ptr, void *arg)

Used within analysis code, usually the AnalysisInit routine, to setup a fini callback that is executed whenever the program terminates. The first parameter is a pointer to a function of the form void (uint32, void*), and the second paramater is an arbitrary value that will be passed in as the second value to the fini routine.

RTN RTN_FindByAddress(uintptr_t addr)

Used within analysis code to return a RTN object for the function located at the provided memory address. Generally used to map the values provided by IARG_BRANCH_TARGET_ADDR (for calls) and IARG_MY_FUNADDR to RTN objects so that additional information can be queried.

string RTN_Name(RTN rtn)

Used within analysis code, returns the name of the provided routine.

uint32_t RTN_FirstIns(RTN rtn)

Used within analysis code, returns the unique id of the first instruction within the routine.

bool RTN_IsExternal(RTN rtn)

Used within analysis code, returns true if the routine is external.

Examples

The following example prints out a memory trace of a program.

// Simple memory trace tool
// Logs memory accesses to the file memory.log
// Logs are of the form (<instruction id>, [write|read], <addr>, <size>)

#include "LLVMPin.h"

#ifdef BEGIN_ANALYSIS
#include <stdio.h>

FILE *fd_log;

void AnalysisFini(uint32_t code, void *v) {
  fclose(fd_log);
}

void AnalysisInit() {
  PIN_AddFiniFunction(AnalysisFini, 0);
  fd_log = fopen("memory.log", "w");
}

ANALYSIS void LogAccess(uint32_t id, uint32_t type, uintptr_t addr,
			uintptr_t size) {
  fprintf(fd_log, "(%d, %s, %x, %d)\n", id, type ? "write" : "read" , addr,
	  size);
}

#endif // END_ANALYSIS

#ifdef BEGIN_INSTRUMENTATION
using namespace llvm;
using namespace llvmpin;

REQUIRED(AnalysisUsage &AU) {
}

INSTRUMENT(Pass *P, Module &M) {
  for(Module::iterator fi = M.begin(), fi_end = M.end(); fi != fi_end; ++fi) {
    if(fi->isDeclaration())
      continue;

    for(Function::iterator bi = fi->begin(), bi_end = fi->end();
	bi != bi_end; ++bi) {
      for(BasicBlock::iterator ii = bi->begin(), ii_end = bi->end();
	  ii != ii_end; ++ii) {
	Instruction *ins = &*ii;
	if(isa<LoadInst>(ins)) {
	  INS_InsertCall(ins, IPOINT_BEFORE, "LogAccess",
			 IARG_INST_PTR,
			 IARG_UINT32, 0, 
			 IARG_MEMORY_EA,
			 IARG_MEMORY_SIZE,
			 IARG_END);
	}
	if(isa<StoreInst>(ins)) {
	  INS_InsertCall(ins, IPOINT_BEFORE, "LogAccess",
			 IARG_INST_PTR,
			 IARG_UINT32, 1,
			 IARG_MEMORY_EA,
			 IARG_MEMORY_SIZE,
			 IARG_END);
	}
      }
    }
  }
}
#endif // END_INSTRUMENTATION

The following example prints out the dynamic control flow of a program. Notice how calls are handled differently than other control instructions. Also demonstrates the use of LLVM's command line support library.

// control_i.cpp
/* Control Tracing Tool
   This tool generates either a control flow trace of a program or an
   edge profile depending on the command line passed to llvmpin.

   Trace:        llvmpin control <program-bitcode> -trace
   Edge Profile: llvmpin control <program-bitcode> -edgeprof

   The results are printed in the following form, depending on if the target is
   an instruction or a function. Of course, you could always modify the code
   to print the first instruction in a function rather than the function name if
   desired.

   ins(ID) -> ins(ID)
   ins(ID) -> function(NAME)

   For an edge profile, there is only a single edge printed along with the
   observed total edge count.

   The output file is control_trace.log for a trace, and control_edge.log for
   an edge profile.
*/

#include "LLVMPin.h"
#include "llvm/Support/CommandLine.h"

using namespace std;
using namespace llvmpin;
using namespace llvm;

cl::opt<bool> TraceKnob("trace", cl::desc("Requests a control flow trace"));
cl::opt<bool> EdgeKnob("edgeprof", cl::desc("Requests an edge profile"));

REQUIRED(AnalysisUsage &AU) {
}

INSTRUMENT(Pass *P, Module &M) {
  if(!(TraceKnob || EdgeKnob)) {
    llvm::cerr << "You must specific either -trace or -edgeprof.\n";
    exit(1);
  }

  // Setup return stack profiling. Necessary to track targets of return edges.
  MISC_InitReturnStack(&M);

  // Generates and initializes a global variable that can be accessed
  // in the analysis code. Used to pass the mode (trace vs edge) to the analysis
  // code.
  GLOBAL_Create(&M, "tool_mode",
		ConstantInt::get(Type::Int32Ty, EdgeKnob ? 1 : 0));

  const char *call_func = "Trace_LogCall";
  const char *branch_func = "Trace_LogBranch";
  if(EdgeKnob) {
    call_func = "Edge_LogCall";
    branch_func = "Edge_LogBranch";
  }

  for(Module::iterator fi = M.begin(), fi_end = M.end(); fi != fi_end; ++fi) {
    if(fi->isDeclaration())
      continue;

    for(Function::iterator bi = fi->begin(), bi_end = fi->end();
	bi != bi_end; ++bi) {
      for(BasicBlock::iterator ii = bi->begin(), ii_end = bi->end();
	  ii != ii_end; ++ii) {
	Instruction *ins = &*ii;

	// Note: ReturnInst are considered branches
	if(INS_IsBranchOrCall(ins)) {
	  if(INS_IsCall(ins))
	    INS_InsertCall(ins, IPOINT_ANYWHERE, call_func, IARG_INST_PTR,
			   IARG_BRANCH_TARGET_ADDR, IARG_END);
	  else
	    INS_InsertCall(ins, IPOINT_ANYWHERE, branch_func, IARG_INST_PTR,
			   IARG_BRANCH_TARGET_ADDR, IARG_END);
	}
      }
    }
  }
}
// control_a.cpp
#include "LLVMPin.h"
#include <stdio.h>
#include <map>
#include <utility>

using namespace std;

extern int tool_mode;

FILE *fd_log;
typedef map<pair<uint32_t, uintptr_t>, uint64_t> call_count_ty;
typedef map<pair<uint32_t, uint32_t>,  uint64_t> edge_count_ty;
call_count_ty call_count;
edge_count_ty edge_count;

ANALYSIS void Trace_LogCall(uint32_t src, uintptr_t faddr) {
  RTN rtn = RTN_FindByAddress(faddr);
  fprintf(fd_log, "ins(%d) -> function(%s)\n", src, RTN_Name(rtn).c_str());
}

ANALYSIS void Trace_LogBranch(uint32_t src, uint32_t dest) {
  fprintf(fd_log, "ins(%d) -> ins(%d)\n", src, dest);
}

ANALYSIS void Edge_LogCall(uint32_t src, uintptr_t faddr) {
  call_count[make_pair(src, faddr)]++;
}

ANALYSIS void Edge_LogBranch(uint32_t src, uint32_t dest) {
  edge_count[make_pair(src, dest)]++;
}

void AnalysisFini(uint32_t code, void *v) {
  if(tool_mode == 1) {
    // Print edge profile
    for(edge_count_ty::iterator i = edge_count.begin(), e = edge_count.end();
	i != e; ++i) {
      fprintf(fd_log, "ins(%d) -> ins(%d): %d\n",
	      i->first.first, i->first.second, i->second);
    }
    for(call_count_ty::iterator i = call_count.begin(), e = call_count.end();
	i != e; ++i) {
      RTN rtn = RTN_FindByAddress(i->first.second);
      fprintf(fd_log, "ins(%d) -> function(%s): %d\n", i->first.first,
	      RTN_Name(rtn).c_str(), i->second);
    }
  }
  fclose(fd_log);
}

void AnalysisInit() {
  PIN_AddFiniFunction(AnalysisFini, 0);
  if(tool_mode == 0)
    fd_log = fopen("control_trace.log", "w");   
  else
    fd_log = fopen("control_edge.log", "w");
}