Showing posts with label pythonika. Show all posts
Showing posts with label pythonika. Show all posts

Friday, July 06, 2007

Scanning data for entropy anomalies II

Recently Phantal (aka Brian) left some comments on my blog and in OpenRCE on some calculations he did following up on my post Scanning data for entropy anomalies.

He develops the algorithm aiming at improving the execution speed of the entropy "scanner" example I had shown. I ran through his steps and arrived to the same conclusions he did on his latest comment. I just thought it'd be worth showing his work as a separate post rather than just a comment.

His idea is, by looking into the standard definition of entropy , to isolate all that doesn't change in the expression when the window slides and just update the entropy, instead of blindly recalculating it's value from scratch for each offset of the scan window.

Shannon' s entropy, usually represented as H, takes the following form if we work with the 256 possible byte values as the symbols :



where p(b) is the probability of the occurrence of a given byte.

H, the entropy, will tend towards its maximum value, 8, if the data has the maximum possible entropy. In such case the probability of each byte occurring would be the same which produces

Note that, although this is usually thought of as measuring the "amount of randomness", it is not that much the case. A sequence of bytes starting at 0 and increasing until 255 going through all the values in order would reach the maximum entropy value 8, even that it is all but random.

The probability of a given byte appearing in our window can also be expressed as . being the number of times the byte appears within the window and the width of the window.

The expression for the entropy can be expanded as follows


The entropy after sliding the window, , will have the same sum expansion except for two terms, the ones of the bytes going out and entering the window. We can then just update those and recalculate the expression by first removing the old values for the incoming and outgoing bytes and then adding the new values for both, after updating their count.



and that's all. Now on to some implementations in Mathematica and Python (but creating a Mathematica function with Pythonika). His implementation in C can be found in the comments of the previous post.



EntropyScan = Function[{Data, WindowScanSize},

  SummationTerm [Prob_] := If[Prob > 0, Prob Log[2, Prob], 0];

  (* Get the initial chunk and calculate the entropy *)
  CurrentChunk = Data[[ Range[1, WindowScanSize] ]];

  (* Calculate initial byte count and probabilities *)
  ByteCounts = Table[Count[CurrentChunk, i - 1], {i, 1, 256}];
  ByteProbs = Table[ByteCounts[[i]]/WindowScanSize, {i, 1, 256}];

  FilteredByteProbs = Select[ByteProbs, # > 0 &];
  H = - Total[
    Table[FilteredByteProbs[[i]] Log[2, FilteredByteProbs[[i]]],
    {i, 1, Length[FilteredByteProbs]}]];
  Entropies = {H};

  (* Slide the window and recalculate for incoming and outgoing bytes *)
  For[offset = 1, offset + WindowScanSize <= Length[Data], offset++,

    (* Get incoming and outgoing bytes *)
    ByteOut = Data[[offset]] + 1;
    ByteIn = Data[[offset + WindowScanSize]] + 1;

    (* Get the old probabilities *)
    OldValByteOut = SummationTerm[ByteProbs[[ByteOut]]];
    OldValByteIn = SummationTerm[ByteProbs[[ByteIn]]];

    (* Update counters and values *)
    ByteCounts[[ByteOut]]--;
    ByteCounts[[ByteIn]]++;
    ByteProbs[[ByteOut]] = ByteCounts[[ByteOut]]/WindowScanSize;
    ByteProbs[[ByteIn]] = ByteCounts[[ByteIn]]/WindowScanSize;

    (* Get the new probabilities *)
    ValByteOut = SummationTerm[ByteProbs[[ByteOut]]];
    ValByteIn = SummationTerm[ByteProbs[[ByteIn]]];

    (* Update the entropy *)
    H = H + OldValByteOut + OldValByteIn - ValByteIn - ValByteOut;

    Entropies = Append[Entropies, H];
  ];

  Entropies
];





Py["import math"]

EntropyScanPython = PyFunction["\<
def entropy_scan(args):
  data = args[0]
  window_size = float(args[1])

  summation_term = lambda p: p*math.log(p,2) if p>0 else 0

  current_chunk = data[:int(window_size)]
  byte_counts = [
    len(filter(lambda a:a==i, current_chunk))
    for i in range(256)]
  byte_probs = [byte_counts[i]/window_size for i in range(256)]

  H = -sum(
    [byte_probs[i]*math.log(byte_probs[i], 2)
      for i in range(256) if byte_probs[i]>0])
  entropies = [H]

  for offset in range(len(data)-window_size):
    byte_out, byte_in = data[offset], data[int(offset+window_size)]

    old_val_byte_out = summation_term(byte_probs[byte_out])
    old_val_byte_in = summation_term(byte_probs[byte_in])

    byte_counts[byte_out] -= 1;
    byte_counts[byte_in] += 1;
    byte_probs[byte_out] = byte_counts[byte_out]/window_size;
    byte_probs[byte_in] = byte_counts[byte_in]/window_size;

    val_byte_out = summation_term(byte_probs[byte_out])
    val_byte_in = summation_term(byte_probs[byte_in])

    H = H + old_val_byte_out + old_val_byte_in - val_byte_out - val_byte_in
    entropies.append(H)
  return entropies
\>"];

Saturday, May 12, 2007

Scanning data for entropy anomalies

l0re just asked the following question in the OpenRCE forums:

I'm currently searching for a tool that does an entropy analyse. I want it to use it for finding a RSA key in a binary file. I have seen a tool that could do this on a workshop but unfortunately I don't know the name of tool and I can't find it with help of google. Does any one know the name of the tool or a tool that could do this?

I'm don't know of such tool from the top of my head although PEiD and OllyDBG both do statistical tests in order to detect possibly compressed/packed executables.

But having to come up with such things is one of the reasons why I love Python and Mathematica+Pythonika. With both it's possible to put together, in a few minutes the desired functionality.

So, the idea is to spot the typical high entropy that should be exhibited by something like a RSA key stored in binary form. Assuming that it's stored within data with significantly lower entropy, such as a standard executable file (that is, not packed or compressed itself), it should be easy to spot visually. Let's check...

First we need a function that calculates the entropy of a given chunk of data. The following code will take a Python string and calculate it's byte entropy, returning a real number in the range 0.0 and 8.0.
Values close to 8.0 would indicate a high entropy, hence the likelihood of compressed or otherwise highly random data. Low values would indicate low complexity data such as text or executable instructions or any other data exhibiting clear patterns.


import math

def H(data):
  if not data:
    return 0
  entropy = 0
  for x in range(256):
    p_x = float(data.count(chr(x)))/len(data)
    if p_x > 0:
      entropy += - p_x*math.log(p_x, 2)
  return entropy



Next we want to be able to take a chunk of data and run the entropy calculation function all across it, on byte increments, with a defined block size. Starting from the byte at offset 0, we will calculate the entropy of each data chunk of the given size and return it's value. The function is an iterator so that we can easily get a list of entropies for all offsets that we can next feed into a plotting function.


def entropy_scan (data, block_size) :
  for block in (
    data[x:block_size+x]
    for x in range (len (data) - block_size) ):

    yield H (block)



Now we need some test data, the following code will generate a low-entropy chunk of data 1024 bytes long, followed by a high-entropy one (assuming the random generator is good enough, which is the case for the example) also 1024 bytes long and closing with 1024 bytes more of low entropy data.


data = ''.join (
  [chr (random.randint (0, 64)) for x in xrange (1024)] +
  [chr (random.randint (0, 255)) for x in xrange (1024)] +
  [chr (random.randint (0, 64)) for x in xrange (1024)] )



If we run the Python code within Mathematica


ListPlot[ Py["\<
  list(
    entropy_scan( data, 256 ) )
\>"] ]


we obtain the following plot



displaying a noticeable bump in the region where the higher entropy data lies within our test data.

Update:

Deadhacker has posted an augmented version of my hack that does not rely on Mathematica in addition of being able to run on arbitrary files passed as arguments to his script.

Monday, December 11, 2006

Intel binaries for Pythonika

I've just uploaded new Pythonika packages (tar.gz and zip) to my site. The only change is that there are now compiled versions for Intel of the MathLink module for OS X. For Python 2.3 and 2.5.

Thursday, November 16, 2006

NumPy arrays and Pythonika

If someone tries to pass NumPy's ndarray objects into Mathematica with something like:

Py["numpy.array([[1,2,3],[4,5,6],[7,8,9]])"]

the following error will appear:

Object type can't be converted!

That's due to the fact that Pythonika doesn't know what to do when finding objects of such type (or any type that it's not one of Python's basic types).

In order to get around that, one can do something like:

Py["numpy.array([[1,2,3],[4,5,6],[7,8,9]]).tolist()"]

Which will return the expected nested lists in Mathematica:

{{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}

Monday, November 06, 2006

Pythonika

I have finally managed to release Pythonika! I wrote Pythonika quite a while ago and was never getting around to push it out.

Pythonika is a MathLink module for Mathematica that makes it possible to write Python code within Mathematica's notebooks. It handles the conversion of Python and Mathematica objects transparently and allows to use all of Python's standard modules.

I'm a big fan of Python and I've been using Pythonika for a while. I hope more people will find it useful.

Pythonika is available at:

http://dkbza.org/pythonika.html

(an example notebook is available on the previous link as well as in the downloaded package)

The download includes source code and binaries for OSX/Windows/Linux.