Introduction

Let’s assume we need to do the dataflow analysis in a particular execution path in a certain binary. In order to collect as much data as possible, we should single-step a certain execution path, save registers values in each step, and then do some analysis. If we have all registers values, we can deduce values assigned to/from memory locations, by looking at instructions semantics.

Available methods

Let’s focus on the first stage: single-stepping. We have the following methods:
Method 1. win32 API debugging facilities
We can do it in an “official” way, that is:

  • attaching a debugger
  • forcing single-stepping by setting TF bit in eflags
  • collecting register values each time on return from WaitForDebugEvent()

However, it is hopelessly slow, because a context switch is necessary after each instruction, and the debugger needs to issue a few system calls to retrieve context and resume execution.

Method 2. In-process EXCEPTION_SINGLE_STEP trapping
A better way is to trap EXCEPTION_SINGLE_STEP not in the debugger, but in the analyzed process itself. We can set up a SEH, and in the SEH handler collect necessary data, and later resume the execution. We can inject into a process a dll which will do the necessary preparations. The “sha1sum_test.exe” binary, if given a second argument, will execute the critical loop with TF set in eflags, and an exception handler will be called after each instruction.The speed gain is about x10 in comparison with the previous solution. Still, exception dispatching both in kernel and in userspace components imposes significant overhead.
Visit http://www.cybertech.net and you can find more advanced implementations.
Maybe it would be more efficient to implement a fast path in the kernel exception handler (just collect register values and resume execution).

A faster solution

Method 3. [purely in] User Mode Single Stepping
Why do we need TF at all ? If the instruction at address X is about to be executed, we can overwrite the next instruction with “jmp our_handler“. (we will need to make the .text segment writable first). our_handler should

  1. switch to a temporary stack; save the registers with pusha+pushf
  2. restore the overwritten instructions
  3. move the saved registers values to some storage
  4. compute where the current instruction transfer the execution; let it be the address X’
  5. overwrite X’ with “jmp our_handler
  6. restore registers with popf+popa; restore eriginal esp; return to the next instruction

The tough part is 4. We need the following:

  • for instructions which do not transfer control (so, anything besides jmp/jxx/call/loop/ret), we need to know an instruction length. It is easy: we can compute all instructions lengths *before* running a program, store it in some file, which will be subsequently mmapped accessible by our_handler.
  • for jmp/ret/loop/”call fixed_addr” we need to add the jump offset to the current address - easy.
  • for jxx instructions, we need to consult eflags whether the execution is altered or not - doable.
  • if we face a computed call/jump, we could disassemble it on the fly and deduce the target, but it is complicated due to variety of addressing modes of 386. The easier way is to trap to debugger, which will single-step the problematic instruction, and later resume software tracing. The overhead should be small because computed calls/jumps are relatively rare. And we can still simulate the most frequent cases, say “call eax”.
    Additionally, this approach helps when our disassembler cannot recognize a particular instruction.

Implementation

The above functionality has been implemented in “umss” project, in McAfee labs. The package contains the following components:

  • umss.cpp: it is supposed to write a map of instructions lengths. It uses the “boomerang” project (http://boomerang.sourceforge.net/). In fact, if we just need to get instructions lengths, any disassembly library would do; however, boomerang is unmatched when it comes to analyse instructions semantics (the said analysis is still to be implemented).
  • inject.dll: it is a library to be injected into any process. It implements single-stepping. If it does not know how to handle a particular instruction, it jumps to “\xcc”, and the attached debugger takes care of it.
  • tracer.cpp. It implements the rest of the required functionality.

In order to collect some benchmarks, a simple program was written which runs a loop a given number of times. It can be traced with umss, or, if given two arguments, trace itself with method II. Results:

  • native run (without any tracing):
    ret=-787054544, time=0.047312ms, loops/ms=211361.374858
  • tracing with EXCEPTION_SINGLE_STEP handler (two arguments given to the test program):
    ret=-787054544, time=1085.968872ms, loops/ms=9.208367
  • ordinary tracing with WaitForDebugEvent():
    ret=-787054544, time=9999.467773ms, loops/ms=1.000053
  • umss:
    ret=-787054544, time=95.365204ms, loops/ms=104.860050

As we see, umss method is about 10 times faster then exception handler, and over 100 faster than the ordinary debugging.
All the execution times were obtained with disabled storing of register values (only the overhead of tracing was important). Anyway, in umss the log file is memory mapped, so especially in case of a SMP (or dual-core) system the performance impact imposed by disk writes should be minimal.Additionally, in order to improve the efficiency, we do not want to trace through library calls (well, it should be configurable which dll we want to trace). If inject.dll observes that the execution leaves the .exe segment, it will overwrite the return address location with its own handler and execute t he library function without tracing; when the library function returns, tracing resumes.

Currently the umss package is in early stage, just enough to confirm usability of the approach and conduct benchmarks. It should be straightforward to implement simple enhancements:

  • implement more computed jump/call instructions
  • currently only a single executable section map is supported
  • implement injecting the dll upon LOAD_DLL_DEBUG_EVENT of a library we want to trace
  • perhaps optimize inject.dll better. The interesting part is that it should execute only ca 80 own instruction (per each instruction in the traced process) in a typical case, yet the performance hit is x2000. Probably the parallelism of Pentium is affected, as well as memory caches efficiency.
  • finally, implement the crucial part: flow analysis

The umss package can be downloaded from Sourceforge umss download page