Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Thursday, August 23, 2007

PyDbg hacks

Pedram just posted on his OpenRCE blog some awesome PyDBG hacks.

Wednesday, August 22, 2007

pefile 1.2.7

Just pushed out an updated version of pefile with some minor enhancements and fixes:
  • Added additional IMAGE_SUBSYSTEM_* flags
  • Added processing of the Optional Header's DllCharacteristics
  • Time/date fileds are now reported as UTC times
  • Added warning message for suspicious entry point addresses
  • Several minor parsing bugs fixed

Tuesday, August 21, 2007

Great Python overview

Alex Martelli gave a talk in the Baypiggies meeting providing a great overview of Python. Check out the video and slides.

Friday, August 10, 2007

Black Hat Slides

Although originally Halvar Flake and I were supposed to present together in a quick turbo-talk at Black Hat in Las Vegas, he unfortunately couldn't make it to the conference for reasons that have been already discussed.

I ended up sticking mostly to the original plan for the talk and presented some Python tools to automate reverse engineering and analysis processes.

I've just put the slides up here.

Friday, July 06, 2007

Scanning data for entropy anomalies II

Recently Phantal (aka Brian) left some comments on my blog and in OpenRCE on some calculations he did following up on my post Scanning data for entropy anomalies.

He develops the algorithm aiming at improving the execution speed of the entropy "scanner" example I had shown. I ran through his steps and arrived to the same conclusions he did on his latest comment. I just thought it'd be worth showing his work as a separate post rather than just a comment.

His idea is, by looking into the standard definition of entropy , to isolate all that doesn't change in the expression when the window slides and just update the entropy, instead of blindly recalculating it's value from scratch for each offset of the scan window.

Shannon' s entropy, usually represented as H, takes the following form if we work with the 256 possible byte values as the symbols :



where p(b) is the probability of the occurrence of a given byte.

H, the entropy, will tend towards its maximum value, 8, if the data has the maximum possible entropy. In such case the probability of each byte occurring would be the same which produces

Note that, although this is usually thought of as measuring the "amount of randomness", it is not that much the case. A sequence of bytes starting at 0 and increasing until 255 going through all the values in order would reach the maximum entropy value 8, even that it is all but random.

The probability of a given byte appearing in our window can also be expressed as . being the number of times the byte appears within the window and the width of the window.

The expression for the entropy can be expanded as follows


The entropy after sliding the window, , will have the same sum expansion except for two terms, the ones of the bytes going out and entering the window. We can then just update those and recalculate the expression by first removing the old values for the incoming and outgoing bytes and then adding the new values for both, after updating their count.



and that's all. Now on to some implementations in Mathematica and Python (but creating a Mathematica function with Pythonika). His implementation in C can be found in the comments of the previous post.



EntropyScan = Function[{Data, WindowScanSize},

  SummationTerm [Prob_] := If[Prob > 0, Prob Log[2, Prob], 0];

  (* Get the initial chunk and calculate the entropy *)
  CurrentChunk = Data[[ Range[1, WindowScanSize] ]];

  (* Calculate initial byte count and probabilities *)
  ByteCounts = Table[Count[CurrentChunk, i - 1], {i, 1, 256}];
  ByteProbs = Table[ByteCounts[[i]]/WindowScanSize, {i, 1, 256}];

  FilteredByteProbs = Select[ByteProbs, # > 0 &];
  H = - Total[
    Table[FilteredByteProbs[[i]] Log[2, FilteredByteProbs[[i]]],
    {i, 1, Length[FilteredByteProbs]}]];
  Entropies = {H};

  (* Slide the window and recalculate for incoming and outgoing bytes *)
  For[offset = 1, offset + WindowScanSize <= Length[Data], offset++,

    (* Get incoming and outgoing bytes *)
    ByteOut = Data[[offset]] + 1;
    ByteIn = Data[[offset + WindowScanSize]] + 1;

    (* Get the old probabilities *)
    OldValByteOut = SummationTerm[ByteProbs[[ByteOut]]];
    OldValByteIn = SummationTerm[ByteProbs[[ByteIn]]];

    (* Update counters and values *)
    ByteCounts[[ByteOut]]--;
    ByteCounts[[ByteIn]]++;
    ByteProbs[[ByteOut]] = ByteCounts[[ByteOut]]/WindowScanSize;
    ByteProbs[[ByteIn]] = ByteCounts[[ByteIn]]/WindowScanSize;

    (* Get the new probabilities *)
    ValByteOut = SummationTerm[ByteProbs[[ByteOut]]];
    ValByteIn = SummationTerm[ByteProbs[[ByteIn]]];

    (* Update the entropy *)
    H = H + OldValByteOut + OldValByteIn - ValByteIn - ValByteOut;

    Entropies = Append[Entropies, H];
  ];

  Entropies
];





Py["import math"]

EntropyScanPython = PyFunction["\<
def entropy_scan(args):
  data = args[0]
  window_size = float(args[1])

  summation_term = lambda p: p*math.log(p,2) if p>0 else 0

  current_chunk = data[:int(window_size)]
  byte_counts = [
    len(filter(lambda a:a==i, current_chunk))
    for i in range(256)]
  byte_probs = [byte_counts[i]/window_size for i in range(256)]

  H = -sum(
    [byte_probs[i]*math.log(byte_probs[i], 2)
      for i in range(256) if byte_probs[i]>0])
  entropies = [H]

  for offset in range(len(data)-window_size):
    byte_out, byte_in = data[offset], data[int(offset+window_size)]

    old_val_byte_out = summation_term(byte_probs[byte_out])
    old_val_byte_in = summation_term(byte_probs[byte_in])

    byte_counts[byte_out] -= 1;
    byte_counts[byte_in] += 1;
    byte_probs[byte_out] = byte_counts[byte_out]/window_size;
    byte_probs[byte_in] = byte_counts[byte_in]/window_size;

    val_byte_out = summation_term(byte_probs[byte_out])
    val_byte_in = summation_term(byte_probs[byte_in])

    H = H + old_val_byte_out + old_val_byte_in - val_byte_out - val_byte_in
    entropies.append(H)
  return entropies
\>"];

Friday, June 22, 2007

Powerset and the garden path

I've recently bumped again into Powerset. I had previously heard about them when they got some people from PARC (if I remember correctly) and went into attempting to build something that I had always dreamed about. The guys at Powerset are tackling one of the hardest and most interesting (in my opinion) problems currently known, that is, helping computers process and "understand" natural language and use those results to make information more accessible. From my humble amateur-linguistic-aficionado point of view, they are doing it a great work there. Soon I will have a chance to see it live, first hand, and I can't wait.

In one of their latest posts they discuss some ambiguities that arise from using words with several meanings in contexts where the least used of the meanings is taken into use, leading to misunderstandings.

To put it in other terms, the problems arise when using the less known meanings of words in a way that the brain is misled when starting to read a sentence and leads to misunderstand the subsequent words (which can also have several meanings which depend on how one understood the start of the sentence) .
Normally, once the sentence has been read several times, the brain finally "switches" into the right interpretation of the different meanings of those words in a way that the whole construct becomes coherent.
I personally see it as resembling the visual phenomena where the brain interprets specially crafted images in different ways, switching back and forth between their different interpretations, like in the Young Girl-Old Woman Illusion or the Rabbit-Duck one.

In the case of these garden path sentences, as they are commonly called, the brain gets confused because of the dependencies between the words and their meanings.

As the brain starts reading a sentence, it will attempt to predict what follows, and it's amazingly good at that. The trick is to throw it off track by using words with multiple meanings.

In the example that they have as their post title "Search Engines Leaking Oil for Holes" the brain is tricked by taking the most common meaning of the first two words (a composite noun or collocation) and attempting to interpret it in a way that later becomes rather confusing when reaching "leaking oil".



Re-reading the sentence can lead to a second interpretation



In their post they ask how hard would be to find an automated way of generating such garden path sentences and they describe a pseudo-algorithm like the following:


You can make your own garden path sentences by following a few simple heuristics (...). The trick is to choose words that can act as both nouns and verbs, or as both adjectives and nouns, words like store, search, and post. Then follow the ambiguous word by another word that can take on more than one form. The hard part is to then add on another noun phrase that makes sense with the less common interpretation of the second word.


Trying to follow their heuristics, the first thing to do would be to find sets of words that can be both a noun and a verb or and adjective and a noun. Thanks to WordNet, PyWordNet and the mash-up of those and more provided by the guys from NodeBox that's not such a hard task as it would have otherwise been without such toolset.

Sets of words fulfilling those requirements can be build in a few lines of Python.


# Collect nouns, verbs and adjectives
verbs = set( wordnet.V.keys() )
nouns = set( wordnet.N.keys() )
adjectives = set( wordnet.ADJ.keys() )

# Pick the ones that can work both as nouns and verbs or as nouns and adjectives
noun_verbs = verbs.intersection(nouns)
noun_adjectives = adjectives.intersection(nouns)

print 'Found % d words that are both verbs and nouns' % len(noun_verbs)
print 'Found % d words that are both adjectives and nouns' % len(noun_adjectives)

Found 4096 words that are both verbs and nouns
Found 3138 words that are both adjectives and nouns



I will also need to have some means of knowing which words are more likely to follow a given one. For that I will reach into some datasets I collected years ago for some computational linguistics experiments I did. Using a small corpora of 2.071.007 sentences built out of books from the Project Gutenberg and parsing it through some Python code I obtained 16.057.624 word pairs, 2.365.383 of them unique. That will provide me with some numbers on what words are likely to follow others.

I can now look for frequently used words that can be both nouns and verbs. In the following line "occurrences" is a list containing all the words and the number of times they appear. They are filtered to only show the ones that are both nouns and verbs.


print [word for word in occurrences[:300] if word[0] in noun_verbs]

{{"be", 10070}, {"have", 7827}, {"like", 6577}, {"will", 6201}, {"out", 5422}, {"still", 4136}, {"even", 4049}, {"man", 3957}, {"can", 3866}, {"down", 3376}, {"see", 3104}, {"do", 3097}, {"time", 2729}, {"people", 2663}, {"well", 2602}, {"last", 2581}, {"back", 2337}, {"white", 2250}, {"make", 2088}, {"till", 2083}, {"come", 2048}, {"black", 2021}, {"general", 2004}, {"found", 1935}, {"light", 1918}, {"round", 1910}, {"go", 1880}, {"better", 1815}, {"face", 1755}, {"saw", 1742}, {"lay", 1740}, {"work", 1682}, {"form", 1678}, {"let", 1673}, {"right", 1654}, {"set", 1647}, {"lord", 1621}, {"look", 1579}, {"take", 1577}, {"hand", 1574}, {"head", 1546}, {"full", 1544}, {"best", 1538}, {"put", 1534}, {"state", 1531}, {"party", 1522}, {"love", 1517}, {"place", 1493}, {"house", 1491}, {"say", 1440}, {"get", 1401}, {"part", 1386}, {"water", 1385}, {"name", 1384}, {"second", 1370}, {"give", 1344}, {"felt", 1342}, {"present", 1327}, {"fell", 1320}, {"land", 1319}, {"use", 1311}}



Now given a word it's possible to find other words that would often follow it and can also have several functions. For instance, lets see what comes out for "look":


# Pick words following 'look' that can be both nouns and verbs
succeeding_words = [p for p in word_sparse['look'].items () if p[0] in noun_verbs]
# Sort them by the most frequently used to the least
succeeding_words.sort ( lambda a, b : -1 if a[1] > b[1] else 0 if a[1] == b[0] else 1)
print succeeding_words[: 100]

"[('like', 255), ('out', 185), ('down', 124), ('back', 115), ('forward', 82), ('round', 49), ('well', 42), ('pale', 26), ('better', 16), ('black', 9), ('right', 7), ('full', 6), ('white', 5), ('blue', 5), ('grave', 5), ('even', 4), ('still', 4), ('double', 4), ('cross', 4), ('close', 3)]



And the results for "form"


[('name', 185), ('part', 18), ('can', 8), ('saint', 8), ('like', 7), ('see', 5), ('will', 5), ('state', 4), ('ice', 3), ('till', 3), ('lay', 3), ('french', 3), ('people', 3), ('found', 2), ('out', 2), ('put', 2), ('well', 2), ('note', 2), ('black', 2), ('starch', 2)]



Although not being a native English speaker makes this a tiny bit more challenging, I can see how one could play with combinations like "look, like", "look, still", "look, well", "form, name", "form, like", etc. to build slightly confusing sentences.

Collocations also are great to mislead the brain whenever one of the words has more than a meaning ("visitor center", "search engines", "meeting point") .
A quick hack to try to spot some automatically could be to look for pairs of words often appearing together and having the desired properties of fulfilling more than one function.
But given the low quality results in the shown next; one could, for instance, also take into account the relative frequency of a noun-noun compound as compared to other pairings of the nouns, to try to see how much more often those two words appear together than with others. There's extensive literature on how to improve this and this was meant as a short-ish blog post after all.



print [ p for p in word_pairs_occurrences[:10000] if en.is_noun(p[0][0]) and en.is_verb(p[0][0]) and en.is_noun(p[0][1]) and en.is_verb(p[0][1]) ]

{{will, be}, {can, be}, {labor, force}, {will, have}, {be, found}, {can, do}, {come, back}, {exchange, rate}, {will, do}, {will, make}, {can, read}, {prime, minister}, {will, go}, {will, give}, {have, come}, {come, out}, {go, back}, {right, hand}, {set, out}, {go, out}, {find, out}, {can, see}, {will, come}, {come, down}, {will, take}, {have, found}, {short, form}, {will, tell}, {birth, total}, {get, out}, {go, down}, {land, use}, {be, put}, {can, tell}, {father, brown}, {will, find}, {white, man}, {put, out}, {take, care}, {can, get}, {dare, say}, {will, see}, {can, make}, {be, well}, {short, time}, {can, have}, {found, out}, {lay, down}, {second, time}, {be, better}, {be, read}, {can, think}, {go, home}, {lord, will}, {birth, rate}, {hoist, side}, {meter, gauge}, {ftp, program}, {be, true}, {be, like}, {last, time}, {look, like}, {will, say}, {man, can}, {set, down}, {license, fee}, {come, home}, {can, find}, {make, out}, {put, down}, {give, notice}, {can, say}, {be, cut}, {take, place}, {low, voice}, {will, try}, {cast, out}, {get, index}, {have, put}, {lie, down}, {can, go}, {radio, relay}, {still, be}, {will, get}, {be, ready}, {well, be}, {wait, till}, {get, back}, {tax, return}, {free, copyright}, {fell, down}, {can, copy}, {set, bin}, {have, felt}, {look, out}, {be, out}, {form, name}, {satellite, earth}, {burst, out}, {will, keep}, {be, free}, {can, give}, {double, track}, {people, have}, {cut, down}, {will, show}, {fish, catch}, {turn, out}, {carry, out}, {well, have}, {work, force}, {be, set}, {have, set}, {miss, garland}, {will, put}, {can, take}, {do, well}, {let, go}, {mine, hand}, {earth, station}, {fell, back}, {take, heed}, {short, distance}, {air, force}, {can, help}, {will, help}, {cry, out}, {will, let}, {free, state}, {feel, like}, {will, cause}, {present, time}, {will, think}, {be, present}, {will, return}, {cast, down}, {black, man}, {narrow, gauge}, {bulletin, board}, {man, be}, {be, right}, {dry, tree}, {will, set}, {be, back}, {point, out}, {right, side}, {can, come}, {look, down}, {will, call}, {run, down}, {file, size}, {major, transport}, {labor, party}, {be, content}, {will, leave}, {man, will}, {will, look}, {can, use}, {need, be}}


Definitely the problem is very challenging with current tools, but it's always fun to give it a spin. With a few hours and limited tools I could only get to think of some ways to find good candidate words for garden path sentences. Definitely nowhere close to actually completing full sentences.

It would be great to expand on this toy research and make it actually useful and interesting. Using larger data sets (like this Google data set) from which to extract word relationships would be a good way to start. Having statistics for trigrams, fourgrams, etc. of words would make things better, having more contextual information would be possible to get more meaningful constructs by ensuring that the chosen words occur close within a small context.

I can think of more ways of improving it, most of them involving large datasets and lots of computational power... gosh, I'm getting carried away thinking about this...

Looking forward to Powerset letting people play with their tools, I' m sure that implementing ideas like the one discussed in this rant will become much easier.

Thursday, June 14, 2007

BinNavi: Simplifying code II. The implementation

Following up the ideas presented on a previous post, this one will discuss their implementation.

We wanted to write some code that would help us understand the meaning of certain operations encoded in assembly instructions. We wanted to do that by going through the instructions and building extended expressions representing the relations between the input and output values of a basic block.

In order to implement the desired functionality, we will write some code to replace parts of the operand trees with existing, known values, from the already analyzed code.
  • The code will iterate through the operand tree until it finds leafs (labeled as ’l’)
  • The leafs will then be substituted if their contents are known
  • The new, augmented, operand tree will be returned

The main part of this simple operand expression "builder" is to parse instructions and keep track of the values of the operands for the current instruction. If any values can be augmented by including the operations performed in previous instructions, those will be replaced by the result of such operations.
  • The function will take a dictionary (aka associative array) instance that will map the registers and memory addresses to their known contents
  • It will parse each instruction and update the dictionary according to the semantics of the opcode which in this example won't be totally accurate but will suffice to illustrate the process
  • It will return an updated dictionary with the results of tracking the values

This first function will retrieve an operand tree and return it as a nested list.
(The code presented in this post makes use of BinNavi's Python scripting to access operands, instructions, etc)


# Retrieve an operand as nested lists
#
def get_optree(op, parent=None):
    if not parent:
        return get_optree(op, op.get_roots()[0])[0]
    tree = []
    children = op.get_children(parent)
    for child in children:
        subtree = get_optree(op, child)
        if subtree:
            tree.append( [str(child), subtree] )
        else:
            tree.append( [’l’, str(child)] )
    return tree



And the result of running it on the following instructions will be:

004A4703: lea b4 ecx,b4 ss [ebp+4294967240]
004A4706: push b4 26
004A4708: sub b4 eax,b4 ecx
004A470A: pop b4 ecx
004A470B: add b4 eax,b4 13

[[’l’, ’ecx’],
    [’ss’, [[’[’, [
    [’+’, [[’l’, ’ebp’], [’l’, ’4294967240’]]]]]]]]
[[’l’, ’26’]]
[[’l’, ’eax’], [’l’, ’ecx’]]
[[’l’, ’ecx’]]
[[’l’, ’eax’], [’l’, ’13’]]



The following function is in charge of substituting any values in the tree for which their content is known. For example, if ecx is used as a source operand but in a previous instruction is was assigned a value of 10, we would substitute ecx in the tree with 10.


# Substitute the leafs in the operand tree
# with the last values obtained during the
# analysis
#
def substitute_leafs(tree, operand_trees):
    if not isinstance(tree, list):
        return tree
    if isinstance(tree[0], str) and tree[0]==’l’:
        return operand_trees.get(str(tree), tree)
    else:
        for idx, elm in enumerate(tree):
            tree[idx] = substitute_leafs(elm, operand_trees)
    return tree


The next function will be responsible for parsing the semantics of the instructions (in a merely illustrative way, it's far from implementing strictly the semantics of the Intel instructions used here). Instructions responsible for basic arithmetic and assignment operations will be processed and their results stored in the operand tree that we are processing.


# Parse instructions

def parse_instruction(insn, stack, operand_trees=dict()):
    # Process the MOV instruction
    if insn.mnem == 'mov':
        # Get the operands
        dst, src = map(get_optree, insn.operands)
        # Perform substitutions for any known values
        src = substitute_leafs(src, operand_trees)
        # Assing the source operand to the destination
        operand_trees[str(dst)] = src
    elif insn.mnem == 'add':
        dst, src = map(get_optree, insn.operands)
        dst_expr = operand_trees.get(str(dst), dst)
        src = substitute_leafs(src, operand_trees)
        src_expr = operand_trees.get(str(src), src)
        operand_trees[str(dst)] = ['+', [dst_expr, src_expr]]
    elif insn.mnem == 'sub':
        dst, src = map(get_optree, insn.operands)
        dst_expr = operand_trees.get(str(dst), dst)
        src = substitute_leafs(src, operand_trees)
        src_expr = operand_trees.get(str(src), src)
        operand_trees[str(dst)] = ['+', [dst_expr, ['-', [src_expr]]]]
    elif insn.mnem == 'lea':
        dst, src = map(get_optree, insn.operands)
        src = substitute_leafs(src, operand_trees)
        src_expr = operand_trees.get(str(src), src)
        operand_trees[str(dst)] = src[1][0][1][0]
    elif insn.mnem == 'push':
        dst, src = ['l', 'STACK_TOP'], get_optree(insn.operands[0])
        src = substitute_leafs(src, operand_trees)
        src_expr = operand_trees.get(str(src), src)
        stack.append(src)
    elif insn.mnem == 'pop':
        dst, src = get_optree(insn.operands[0]), ['l', 'STACK_TOP']
        src = stack.pop()
        operand_trees[str(dst)] = src
    elif insn.mnem == 'cdq':
        pass
    elif insn.mnem == 'idiv':
        src = get_optree(insn.operands[0])
        # dst is always eax
        dst = ['l', 'eax']
        src = substitute_leafs(src, operand_trees)
        src_expr = operand_trees.get(str(src), src)
        orig_src = operand_trees.get(str(dst), dst)
        operand_trees[str(dst)] = ['*',
            [orig_src, ['INV', [src_expr]]] ]
        operand_trees[str(['l', 'edx'])] = ['%',
            [orig_src, ['mod', [src_expr]]] ]
    return operand_trees


After running the previous functions on a simple basic block, we will obtain operand trees representing the operations performed by the instructions on their operands.



In the end we can ask the value of a register and we will see, in a single extended expression, the value of the register as a result of all operations taking place in the basic block.


>>> operand_trees["['l', 'ecx']"]
['l', '(0x1A, 26)']

>>> operand_trees["['l', 'al']"]
['ss',
  [['[',
    [['+',
      [['l', 'ebp'],
      ['%',
        [['+',
          [['+',
            [['l', 'eax'],
            ['-', [['+',
              [['l', 'ebp'],
              ['l', 'lowercase_chars']]]]]]],
          ['l', '13']]],
        ['mod', [['l', '(0x1A, 26)']]]]],
      ['l', 'lowercase_chars']]]]]]]




We can also graphically represent the result of the operand composition in BinNavi.
The following function will traverse the operand tree and add the nodes to the graph instance provided.


# Convert nested lists into a graph object
# that can be visualized
#
def tree_to_graph(tree, g=Graph(), parent=None):
    if isinstance(tree[1], list):
        node_str = tree[0]
        node = g.add_node(str(node_str))
        if parent:
            g.add_edge(parent, node)
        for child in tree[1]:
            tree_to_graph(child, g, node)
    else:
        leaf = g.add_node(str(tree[1]))
        if parent:
            g.add_edge(parent, leaf)
    return g



And running it on one of the composed operand trees results in:


tree_to_graph(operand_trees["['l', 'al']"], Graph()).show()





As a last step we can represent the expression textually and add it to the node.


# Convert nested lists into a single expression as
# a string
#

def tree_to_expression(tree):
    root = tree[0]
    children = tree[1]
    if isinstance(children, list):
        parsed_children = [tree_to_expression(child) for child in children]
    else:
        return children
    if root == '*':
        return '(' + '*'.join(parsed_children) + ')'
    elif root == '+':
        return '(' + '+'.join(parsed_children) + ')'
    elif root == '-':
        return '(-' + parsed_children[0] + ')'
    elif root == '[':
        return '[' + ' '.join(parsed_children) + ']'
    elif root == 'INV':
        return 'INV('+parsed_children[0]+')'
    elif root == 'mod':
        return parsed_children[0]
    elif root == '%':
        return '(' + parsed_children[0]+'%'+parsed_children[1] + ')'
    elif root == 'l':
        return children
    elif root == 'ss':
        return parsed_children[0]



And running the following in BinNavi's Python interpreter will generate the expression and add it to the basic block as a comment in BinNavi's view.


fgv = Graphview.get_open_views()[0]
expr = tree_to_expression(operand_trees["['l', 'al']"])
fgv.set_node_comment(0x4a4703L, expr)





The resulting expression represents the ROT13 operation for lowercase characters. (The code belongs to the Mydoom.D worm)
ebp+lowercase_chars points to a string containing the alphabet. eax is the pointer to the character to encode/decode within the alphabet, and its position is calculated by subtracting the beginning address of the alphabet from eax. Then 13 is added to it and wrapped around by getting the value modulo 26 and added again to the beginning of the alphabet to obtain the final decoded/encoded character.
This expression is a nice and condensed form of expressing the functionality implemented in the basic block.

This post illustrates the versatility of a properly abstracted representation of the operands and instructions and the level of automation that can be achieved in a relatively simple manner with the help of the interactive Python scripting in BinNavi.

Sunday, May 20, 2007

ida2sql, exporting IDA's dissasemblies to SQL

Because BinNavi nowadays reads all the disassembly information from a SQL database, we needed some means of exporting the information to it. ida2sql is the result, it is a monster set of Python scripts I wrote (all nicely wrapped in a couple of files for easy installation) that will export the information from an IDB (only Intel, ARM and PPC so far. The latter two in experimental mode) into a MySQL database.
It's available for download from my site together with installation and usage instructions. It needs the IDAPython plug-in to run.

Any feedback is welcome.

I posted a while ago about the database schema. One can do pretty neat things when having the dissasembly in such form...

Tuesday, May 15, 2007

BinNavi: Simplifying code

This in a slightly expanded example of what Halvar and I showed in the Sabre-Security trainings last October. With it I illustrate how to, in BinNavi, build a small dataflow analyzer in Python so that it can do some work for us.

Lets imagine that there's a basic block like the following, which is part of the ROT13 routine in the Mydoom worm.



Ideally we would like to have such tool what would spare us from having to figure out what the assembly code in that block does, regardless of how simple it might be in this case. Such functionality would make analysis much less tedious and less error prone.

Given that BinNavi offers an interactive Python interpreter with access to every single function, basic block, instruction, operand and expressions within an operand, this is a problem that we could try to address with some scripting.



The dataflow reconstruction


Assembly instructions move data around and modify it by performing different operations on it. Values are fetched from registers, memory or specified explicitly and the results are stored back in registers or memory.

So, if we would like to be able to transform the operations into something more readable we could track all the assignments and operations and compose some sort of combined expression. That's actually much easier than it might sound like.

Taking advantage of Python's dictionaries (aka hashes, aka associative arrays, aka maps) we could proceed as follows.

We process individually each instruction. As each has different semantics, we have to deal with them individually (some groups of similar instructions can be handled somehow generically tho.).

First the instruction might push or pop something from the stack and we could emulate that behaviour with a simple list.

Second the instruction might get data from a source register/memory/immediate and, upon operating on it, assign it to a destination operand. We will just consider these two cases for the time being, this is only an exercise after all ;)

In this case we could store in our dictionary the source item using as key the destination register or memory location. That would allow us to keep the current state of the registers and memory locations. For instance: lea eax, [edi+2]

The destination of the lea operation is eax the source/result is [edi+2]. Hence we could have an entry in our dictionary that maps eax to edi+2, there are no brackets here because lea will store the value of the addition, not its contents, so we don't need the brackets (which stand for memory-dereference)

If a register is accessed later on by another instruction and we know where the value comes from (because we stored the previous assignment operation in our dictionary) we could compose it together, having a combined expression that tracks all data. Following the example, imagine the instruction mov bl, [eax+esi], that will put into bl the value pointed to by [eax+esi], in our dictionary we would map bl to have the value [eax+esi]

But, if we would check if we have any of the expression parts already in our dictionary (remember that eax maps to edi+2) then we can expand [eax+esi] into [edi+2+esi]. One might be able to infer now that this procedure can be taken further, composing longer expressions that would reduce the amount and complexity of the assembly code to look at.

As another illustrative example, one could assign the the AL key in the dictionary the nested lists representing the expression tree for the memory reference as seen in the following figure.



As just seen, operands are treated as graphs by BinNavi. That allows for a versatile manipulation of their contents. We can transform the tree to nested lists and operate on those.


  • The form of the nested lists is

  • [root , [child_1 , child_2 , ...]]

  • Where each children has the same form


For instance, the following two operand trees



would take the following from as nested lists:

[b4, [ecx]]

[b4, [ss, ['[', [+, [ebp], [4294967240]]]]]

In short, in order to do basic reconstruction of the data flow the following operations need to be performed:

  • Keep track of all assigned values using a dictionary

  • Substituting leafs with values already in our dictionary

  • Perform substitutions and assign SRCs to DSTs hence updating the dictionary to the last state


In a following post, I'll show an implementation using BinNavi's embedded Python interpreter.

Saturday, May 12, 2007

Scanning data for entropy anomalies

l0re just asked the following question in the OpenRCE forums:

I'm currently searching for a tool that does an entropy analyse. I want it to use it for finding a RSA key in a binary file. I have seen a tool that could do this on a workshop but unfortunately I don't know the name of tool and I can't find it with help of google. Does any one know the name of the tool or a tool that could do this?

I'm don't know of such tool from the top of my head although PEiD and OllyDBG both do statistical tests in order to detect possibly compressed/packed executables.

But having to come up with such things is one of the reasons why I love Python and Mathematica+Pythonika. With both it's possible to put together, in a few minutes the desired functionality.

So, the idea is to spot the typical high entropy that should be exhibited by something like a RSA key stored in binary form. Assuming that it's stored within data with significantly lower entropy, such as a standard executable file (that is, not packed or compressed itself), it should be easy to spot visually. Let's check...

First we need a function that calculates the entropy of a given chunk of data. The following code will take a Python string and calculate it's byte entropy, returning a real number in the range 0.0 and 8.0.
Values close to 8.0 would indicate a high entropy, hence the likelihood of compressed or otherwise highly random data. Low values would indicate low complexity data such as text or executable instructions or any other data exhibiting clear patterns.


import math

def H(data):
  if not data:
    return 0
  entropy = 0
  for x in range(256):
    p_x = float(data.count(chr(x)))/len(data)
    if p_x > 0:
      entropy += - p_x*math.log(p_x, 2)
  return entropy



Next we want to be able to take a chunk of data and run the entropy calculation function all across it, on byte increments, with a defined block size. Starting from the byte at offset 0, we will calculate the entropy of each data chunk of the given size and return it's value. The function is an iterator so that we can easily get a list of entropies for all offsets that we can next feed into a plotting function.


def entropy_scan (data, block_size) :
  for block in (
    data[x:block_size+x]
    for x in range (len (data) - block_size) ):

    yield H (block)



Now we need some test data, the following code will generate a low-entropy chunk of data 1024 bytes long, followed by a high-entropy one (assuming the random generator is good enough, which is the case for the example) also 1024 bytes long and closing with 1024 bytes more of low entropy data.


data = ''.join (
  [chr (random.randint (0, 64)) for x in xrange (1024)] +
  [chr (random.randint (0, 255)) for x in xrange (1024)] +
  [chr (random.randint (0, 64)) for x in xrange (1024)] )



If we run the Python code within Mathematica


ListPlot[ Py["\<
  list(
    entropy_scan( data, 256 ) )
\>"] ]


we obtain the following plot



displaying a noticeable bump in the region where the higher entropy data lies within our test data.

Update:

Deadhacker has posted an augmented version of my hack that does not rely on Mathematica in addition of being able to run on arbitrary files passed as arguments to his script.

Tuesday, April 03, 2007

IDAPython 0.9.0

Gergely Erdélyi has just put out the last release of IDAPython, 0.9.0

It can be found in his site together with some brief release notes. This release supports Python 2.5 among a good deal of other enhancements and additions.

Monday, December 11, 2006

Intel binaries for Pythonika

I've just uploaded new Pythonika packages (tar.gz and zip) to my site. The only change is that there are now compiled versions for Intel of the MathLink module for OS X. For Python 2.3 and 2.5.

Monday, November 06, 2006

Pythonika

I have finally managed to release Pythonika! I wrote Pythonika quite a while ago and was never getting around to push it out.

Pythonika is a MathLink module for Mathematica that makes it possible to write Python code within Mathematica's notebooks. It handles the conversion of Python and Mathematica objects transparently and allows to use all of Python's standard modules.

I'm a big fan of Python and I've been using Pythonika for a while. I hope more people will find it useful.

Pythonika is available at:

http://dkbza.org/pythonika.html

(an example notebook is available on the previous link as well as in the downloaded package)

The download includes source code and binaries for OSX/Windows/Linux.

Thursday, June 22, 2006

Recon 2006

Last weekend I had the pleasure of assisting to the reverse engineering conference Recon in Montreal. Recon has probably been one of the best events I had the luck to assist to so far. The quality of the speakers was really high and the people assisting were incredible. The whole event was very well organized, with no hiccups.
I had many really enjoyable talks and met great people.

Regarding the talks I specially enjoyed Bunie's Disassembling And Patching Hardware. I did study electronics some years back, although never to the level of what Bunnie was showing. I found the talk really interesting.
Alex Ionescu's Subverting Windows 2003 SP1 Kernel Integrity Protection was tremendously informative too, going really deep into w2k3 protection mechanisms.
Skype's talk by Fabrice Desclaux and Kostya Kortchinsky was truly amusing, they really took apart every single part of Skype.
Spoonm showcased IDARub, now Ruby lovers have a tool as good as IDAPython, if not better. IDARub's network support is something that IDAPython will surely catch up with soon.
Pedram Amini released PaiMei(download), a tool that surely will be talked about a lot. A full reverse engineering framework in Python. It really looked mighty powerful.

Also, I ended up cooking up two quick (ended up being not-so-quick) turbo-talks; Win32 Static Analysis In Python showing two Python modules I wrote a while ago pefile and pydasm and Unpacking Bird's Eye View showing some results on tracing unpackers using some internal tools. More on that last one on another post...


Thursday, January 19, 2006

IDAPython 0.8.0 released

Dyce has released IDAPython 0.8.0. I know of plenty of people who have been waiting for this one, me for one. Also, the Windows version is linked against Python 2.4.

Tuesday, December 27, 2005

Some Python loving

This post has very good points on Python and Java. Having played with Eclipse myself and coding lots of Python with no IDE (SubEthaEdit rules!) I could hardly agree more with him.