From f81838660445c15a61a516bc7fe214760c105f69 Mon Sep 17 00:00:00 2001 From: Brian Gaeke Date: Thu, 26 Jun 2003 20:37:42 +0000 Subject: Here are the notes from our Reoptimizer meetings. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@6923 91177308-0d34-0410-b5e6-96231b3b80d8 --- docs/HistoricalNotes/2003-06-25-Reoptimizer1.txt | 137 +++++++++++++++++++++++ docs/HistoricalNotes/2003-06-26-Reoptimizer2.txt | 110 ++++++++++++++++++ 2 files changed, 247 insertions(+) create mode 100644 docs/HistoricalNotes/2003-06-25-Reoptimizer1.txt create mode 100644 docs/HistoricalNotes/2003-06-26-Reoptimizer2.txt (limited to 'docs/HistoricalNotes') diff --git a/docs/HistoricalNotes/2003-06-25-Reoptimizer1.txt b/docs/HistoricalNotes/2003-06-25-Reoptimizer1.txt new file mode 100644 index 0000000000..a745784639 --- /dev/null +++ b/docs/HistoricalNotes/2003-06-25-Reoptimizer1.txt @@ -0,0 +1,137 @@ +Wed Jun 25 15:13:51 CDT 2003 + +First-level instrumentation +--------------------------- + +We use opt to do Bytecode-to-bytecode instrumentation. Look at +back-edges and insert llvm_first_trigger() function call which takes +no arguments and no return value. This instrumentation is designed to +be easy to remove, for instance by writing a NOP over the function +call instruction. + +Keep count of every call to llvm_first_trigger(), and maintain +counters in a map indexed by return address. If the trigger count +exceeds a threshold, we identify a hot loop and perform second-level +instrumentation on the hot loop region (the instructions between the +target of the back-edge and the branch that causes the back-edge). We +do not move code across basic-block boundaries. + + +Second-level instrumentation +--------------------------- + +We remove the first-level instrumentation by overwriting the CALL to +llvm_first_trigger() with a NOP. + +The reoptimizer maintains a map between machine-code basic blocks and +LLVM BasicBlock*s. We only keep track of paths that start at the +first machine-code basic block of the hot loop region. + +How do we keep track of which edges to instrument, and which edges are +exits from the hot region? 3 step process. + +1) Do a DFS from the first machine-code basic block of the hot loop +region and mark reachable edges. + +2) Do a DFS from the last machine-code basic block of the hot loop +region IGNORING back edges, and mark the edges which are reachable in +1) and also in 2) (i.e., must be reachable from both the start BB and +the end BB of the hot region). + +3) Mark BBs which end in edges that exit the hot region; we need to +instrument these differently. + +Assume that there is 1 free register. On SPARC we use %g1, which LLC +has agreed not to use. Shift a 1 into it at the beginning. At every +edge which corresponds to a conditional branch, we shift 0 for not +taken and 1 for taken into a register. This uniquely numbers the paths +through the hot region. Silently fail if we need more than 64 bits. + +At the end BB we call countPath and increment the counter based on %g1 +and the return address of the countPath call. We keep track of the +number of iterations and the number of paths. We only run this +version 30 or 40 times. + +Find the BBs that total 90% or more of execution, and aggregate them +together to form our trace. But we do not allow more than 5 paths; if +we have more than 5 we take the ones that are executed the most. We +verify our assumption that we picked a hot back-edge in first-level +instrumentation, by making sure that the number of times we took an +exit edge from the hot trace is less than 10% of the number of +iterations. + +LLC has been taught to recognize llvm_first_trigger() calls and NOT +generate saves and restores of caller-saved registers around these +calls. + + +Phase behavior +-------------- + +We turn off llvm_first_trigger() calls with NOPs, but this would hide +phase behavior from us (when some funcs/traces stop being hot and +others become hot.) + +We have a SIGALRM timer that counts time for us. Every time we get a +SIGALRM we look at our priority queue of locations where we have +removed llvm_first_trigger() calls. Each location is inserted along +with a time when we will next turn instrumentation back on for that +call site. If the time has arrived for a particular call site, we pop +that off the prio. queue and turn instrumentation back on for that +call site. + + +Generating traces +----------------- + +When we finally generate an optimized trace we first copy the code +into the trace cache. This leaves us with 3 copies of the code: the +original code, the instrumented code, and the optimized trace. The +optimized trace does not have instrumentation. The original code and +the instrumented code are modified to have a branch to the trace +cache, where the optimized traces are kept. + +We copy the code from the original to the instrumentation version +by tracing the LLVM-to-Machine code basic block map and then copying +each machine code basic block we think is in the hot region into the +trace cache. Then we instrument that code. The process is similar for +generating the final optimized trace; we copy the same basic blocks +because we might need to put in fixup code for exit BBs. + +LLVM basic blocks are not typically used in the Reoptimizer except +for the mapping information. + +We are restricted to using single instructions to branch between the +original code, trace, and instrumented code. So we have to keep the +code copies in memory near the original code (they can't be far enough +away that a single pc-relative branch would not work.) Malloc() or +data region space is too far away. this impacts the design of the +trace cache. + +We use a dummy function that is full of a bunch of for loops which we +overwrite with trace-cache code. The trace manager keeps track of +whether or not we have enough space in the trace cache, etc. + +The trace insertion routine takes an original start address, a vector +of machine instructions representing the trace, index of branches and +their corresponding absolute targets, and index of calls and their +corresponding absolute targets. + +The trace insertion routine is responsible for inserting branches from +the beginning of the original code to the beginning of the optimized +trace. This is because at some point the trace cache may run out of +space and it may have to evict a trace, at which point the branch to +the trace would also have to be removed. It uses a round-robin +replacement policy; we have found that this is almost as good as LRU +and better than random (especially because of problems fitting the new +trace in.) + +We cannot deal with discontiguous trace cache areas. The trace cache +is supposed to be cache-line-aligned, but it is not page-aligned. + +We generate instrumentation traces and optimized traces into separate +trace caches. We keep the instrumented code around because you don't +want to delete a trace when you still might have to return to it +(i.e., return from a llvm_first_trigger() or countPath() call.) + + diff --git a/docs/HistoricalNotes/2003-06-26-Reoptimizer2.txt b/docs/HistoricalNotes/2003-06-26-Reoptimizer2.txt new file mode 100644 index 0000000000..ec4b93fea0 --- /dev/null +++ b/docs/HistoricalNotes/2003-06-26-Reoptimizer2.txt @@ -0,0 +1,110 @@ +Thu Jun 26 14:43:04 CDT 2003 + +Information about BinInterface +------------------------------ + +Take in a set of instructions with some particular register +allocation. It allows you to add, modify, or delete some instructions, +in SSA form (kind of like LLVM's MachineInstrs.) Then re-allocate +registers. It assumes that the transformations you are doing are safe. +It does not update the mapping information or the LLVM representation +for the modified trace (so it would not, for instance, support +multiple optimization passes; passes have to be aware of and update +manually the mapping information.) + +The way you use it is you take the original code and provide it to +BinInterface; then you do optimizations to it, then you put it in the +trace cache. + +The BinInterface tries to find live-outs for traces so that it can do +register allocation on just the trace, and stitch the trace back into +the original code. It has to preserve the live-ins and live-outs when +it does its register allocation. (On exits from the trace we have +epilogues that copy live-outs back into the right registers, but +live-ins have to be in the right registers.) + + +Limitations of BinInterface +--------------------------- + +It does copy insertions for PHIs, which it infers from the machine +code. The mapping info inserted by LLC is not sufficient to determine +the PHIs. + +It does not handle integer or floating-point condition codes and it +does not handle floating-point register allocation. + +It is not aggressively able to use lots of registers. + +There is a problem with alloca: we cannot find our spill space for +spilling registers, normally allocated on the stack, if the trace +follows an alloca(). What might be an acceptable solution would be to +disable trace generation on functions that have variable-sized +alloca()s. Variable-sized allocas in the trace would also probably +screw things up. + +Because of the FP and alloca limitations, the BinInterface is +completely disabled right now. + + +Demo +---- + +This is a demo of the Ball & Larus version that does NOT use 2-level +profiling. + +1. Compile program with llvm-gcc. +2. Run opt -lowerswitch -paths -emitfuncs on the bytecode. + -lowerswitch change switch statements to branches + -paths Ball & Larus path-profiling algorithm + -emitfuncs emit the table of functions +3. Run llc to generate SPARC assembly code for the result of step 2. +4. Use g++ to link the (instrumented) assembly code. + +We use a script to do all this: +------------------------------------------------------------------------------ +#!/bin/sh +llvm-gcc $1.c -o $1 +opt -lowerswitch -paths -emitfuncs $1.bc > $1.run.bc +llc -f $1.run.bc +LIBS=$HOME/llvm_sparc/lib/Debug +GXX=/usr/dcs/software/evaluation/bin/g++ +$GXX -g -L $LIBS $1.run.s -o $1.run.llc \ +$LIBS/tracecache.o \ +$LIBS/mapinfo.o \ +$LIBS/trigger.o \ +$LIBS/profpaths.o \ +$LIBS/bininterface.o \ +$LIBS/support.o \ +$LIBS/vmcore.o \ +$LIBS/transformutils.o \ +$LIBS/bcreader.o \ +-lscalaropts -lscalaropts -lanalysis \ +-lmalloc -lcpc -lm -ldl +------------------------------------------------------------------------------ + +5. Run the resulting binary. You will see output from BinInterface +(described below) intermixed with the output from the program. + + +Output from BinInterface +------------------------ + +BinInterface's debugging code prints out the following stuff in order: + +1. Initial code provided to BinInterface with original register +allocation. + +2. Section 0 is the trace prolog, consisting mainly of live-ins and +register saves which will be restored in epilogs. + +3. Section 1 is the trace itself, in SSA form used by BinInterface, +along with the PHIs that are inserted. +PHIs are followed by the copies that implement them. +Each branch (i.e., out of the trace) is annotated with the +section number that represents the epilog it branches to. + +4. All the other sections starting with Section 2 are trace epilogs. +Every branch from the trace has to go to some epilog. + +5. After the last section is the register allocation output. -- cgit v1.2.3