Friday, July 06, 2007

Scanning data for entropy anomalies II

Recently Phantal (aka Brian) left some comments on my blog and in OpenRCE on some calculations he did following up on my post Scanning data for entropy anomalies.

He develops the algorithm aiming at improving the execution speed of the entropy "scanner" example I had shown. I ran through his steps and arrived to the same conclusions he did on his latest comment. I just thought it'd be worth showing his work as a separate post rather than just a comment.

His idea is, by looking into the standard definition of entropy , to isolate all that doesn't change in the expression when the window slides and just update the entropy, instead of blindly recalculating it's value from scratch for each offset of the scan window.

Shannon' s entropy, usually represented as H, takes the following form if we work with the 256 possible byte values as the symbols :



where p(b) is the probability of the occurrence of a given byte.

H, the entropy, will tend towards its maximum value, 8, if the data has the maximum possible entropy. In such case the probability of each byte occurring would be the same which produces

Note that, although this is usually thought of as measuring the "amount of randomness", it is not that much the case. A sequence of bytes starting at 0 and increasing until 255 going through all the values in order would reach the maximum entropy value 8, even that it is all but random.

The probability of a given byte appearing in our window can also be expressed as . being the number of times the byte appears within the window and the width of the window.

The expression for the entropy can be expanded as follows


The entropy after sliding the window, , will have the same sum expansion except for two terms, the ones of the bytes going out and entering the window. We can then just update those and recalculate the expression by first removing the old values for the incoming and outgoing bytes and then adding the new values for both, after updating their count.



and that's all. Now on to some implementations in Mathematica and Python (but creating a Mathematica function with Pythonika). His implementation in C can be found in the comments of the previous post.



EntropyScan = Function[{Data, WindowScanSize},

  SummationTerm [Prob_] := If[Prob > 0, Prob Log[2, Prob], 0];

  (* Get the initial chunk and calculate the entropy *)
  CurrentChunk = Data[[ Range[1, WindowScanSize] ]];

  (* Calculate initial byte count and probabilities *)
  ByteCounts = Table[Count[CurrentChunk, i - 1], {i, 1, 256}];
  ByteProbs = Table[ByteCounts[[i]]/WindowScanSize, {i, 1, 256}];

  FilteredByteProbs = Select[ByteProbs, # > 0 &];
  H = - Total[
    Table[FilteredByteProbs[[i]] Log[2, FilteredByteProbs[[i]]],
    {i, 1, Length[FilteredByteProbs]}]];
  Entropies = {H};

  (* Slide the window and recalculate for incoming and outgoing bytes *)
  For[offset = 1, offset + WindowScanSize <= Length[Data], offset++,

    (* Get incoming and outgoing bytes *)
    ByteOut = Data[[offset]] + 1;
    ByteIn = Data[[offset + WindowScanSize]] + 1;

    (* Get the old probabilities *)
    OldValByteOut = SummationTerm[ByteProbs[[ByteOut]]];
    OldValByteIn = SummationTerm[ByteProbs[[ByteIn]]];

    (* Update counters and values *)
    ByteCounts[[ByteOut]]--;
    ByteCounts[[ByteIn]]++;
    ByteProbs[[ByteOut]] = ByteCounts[[ByteOut]]/WindowScanSize;
    ByteProbs[[ByteIn]] = ByteCounts[[ByteIn]]/WindowScanSize;

    (* Get the new probabilities *)
    ValByteOut = SummationTerm[ByteProbs[[ByteOut]]];
    ValByteIn = SummationTerm[ByteProbs[[ByteIn]]];

    (* Update the entropy *)
    H = H + OldValByteOut + OldValByteIn - ValByteIn - ValByteOut;

    Entropies = Append[Entropies, H];
  ];

  Entropies
];





Py["import math"]

EntropyScanPython = PyFunction["\<
def entropy_scan(args):
  data = args[0]
  window_size = float(args[1])

  summation_term = lambda p: p*math.log(p,2) if p>0 else 0

  current_chunk = data[:int(window_size)]
  byte_counts = [
    len(filter(lambda a:a==i, current_chunk))
    for i in range(256)]
  byte_probs = [byte_counts[i]/window_size for i in range(256)]

  H = -sum(
    [byte_probs[i]*math.log(byte_probs[i], 2)
      for i in range(256) if byte_probs[i]>0])
  entropies = [H]

  for offset in range(len(data)-window_size):
    byte_out, byte_in = data[offset], data[int(offset+window_size)]

    old_val_byte_out = summation_term(byte_probs[byte_out])
    old_val_byte_in = summation_term(byte_probs[byte_in])

    byte_counts[byte_out] -= 1;
    byte_counts[byte_in] += 1;
    byte_probs[byte_out] = byte_counts[byte_out]/window_size;
    byte_probs[byte_in] = byte_counts[byte_in]/window_size;

    val_byte_out = summation_term(byte_probs[byte_out])
    val_byte_in = summation_term(byte_probs[byte_in])

    H = H + old_val_byte_out + old_val_byte_in - val_byte_out - val_byte_in
    entropies.append(H)
  return entropies
\>"];

4 comments:

Cody said...

Good post Ero. I really like seeing mathematics applied to reverse engineering.

Brian said...

There's another problem. I realized it the other night, went to sleep, and forgot all about it.

Anyway, the probability is being calculated wrong. You have a window of length w. The probability that each byte is a particular value is 1/256. If you count up each character in the window, though, what if your window size is say, 512 bytes and there are 512 'a' characters? You then end up with p(a) = 512/256=2 ... probabilities should always fall in the range [0,1].

Similarly, if the window size is 16 bytes and it's full of 'a' characters again, you get 16/256 = 1/32 ... which is saying amongst 16 items, only 1 in 32 are a certain type instead of 100%.

Unfortunately, my background in prob/stats is limited. Throw an ODE my way and I'd probably be able to solve it, but ask me how to [properly] calculate probabilities ... and I'll kinda mumble to myself and look stupidly in your general direction. I think, though, that the correct answer is this: There are w positions in the window, and n possible values for each byte, making for n * w possible combinations. From that set, we're selecting all items q where they have the same type and counting them (c(q)). If we count up all items in the window, then throw them into a bag and shake it up, the probability of drawing an item of type q will be the probability:

p = c(q)/(n*w)

Worst-case scenario, if you end up with all of the same character, you will get:

p = c(q)/(256*w)
... c(q) = w, since all are the same.
p = w/(256 * w)
p = 1/256
en = -p * log_2(p)
en = -1/256 * log_2(1/256)
en = 1/256 * log_2(256)
en = 1/256 * 8
en = 1/32

Best case, all characters are different, so the probability is:

p = c(q)/(256*w)
... c(q) = 1 since all are different:
p = 1/(256*w)
en = -sum:k=0...255:
..1/(256*w) * log_2[1/(256*w)]
.. = -256 * 1/(256 * w) log_2(1/(256 *w)
.. = 1/w * log_2[256*w]
.. = 1/w * (8 + log_2(w))

So, those are your upper and lower bounds on the entropy:

en = [1/32, 1/w(8 + log_2(w)]

The closer to 1/32 you get, the more predictable it is. The closer to that other thing, the less predictable, etc.

-Brian

Brian said...

That doesn't feel right to me, either. Maybe it's just p = c(q)/w. I'll check my prob/stats text sometime this week when I get a moment and see if I can work it out.

-Brian

Brian said...

I think I've got a great way to improve on this. I'll wait 'til after the semester is over to flesh it out (these last few weeks are going to be hell), but I'll give you the jist and you can work it out if you're interested.

First off, I think there will be some minor corrections to our previous work on this.

I'm finishing up 2 semesters of prob/stats, and I've learned a lot about calculating estimated probabilities, means, standard deviations, but more importantly I've picked up some tricks for testing the likelihood of a given statistic occurring given the population's mean, standard deviation, etc.

In short, I think there is a relatively straightforward way of automating this process beyond some hackish attempt at finding a range of entropy values that exceeds some threshold for a sequence of bytes longer than X.

To go along with that, I think using some form of regression analysis will come in quite handy. After having generated the entropy values, it wouldn't be hard at all to 'smooth' the data-set a bit to get something that is fairly representative of the population at large, but is smoother, then perform polynomial regression, find points on this polynomial where the slope is 0 (calculus stuff), then determine whether the inflection points on either side of the portion of the curve leading up to the slope 0 point constitute a range large enough to warrant someone taking a closer look at the file in question.

Obviously this would only be handy if you need to scan a large number of files for entropy anomalies, so if you have a single program and you need an RSA key from it, this technique wouldn't be QUITE as handy, but ... ::shrug::

-Brian