I'm currently searching for a tool that does an entropy analyse. I want it to use it for finding a RSA key in a binary file. I have seen a tool that could do this on a workshop but unfortunately I don't know the name of tool and I can't find it with help of google. Does any one know the name of the tool or a tool that could do this?
I'm don't know of such tool from the top of my head although PEiD and OllyDBG both do statistical tests in order to detect possibly compressed/packed executables.
But having to come up with such things is one of the reasons why I love Python and Mathematica+Pythonika. With both it's possible to put together, in a few minutes the desired functionality.
So, the idea is to spot the typical high entropy that should be exhibited by something like a RSA key stored in binary form. Assuming that it's stored within data with significantly lower entropy, such as a standard executable file (that is, not packed or compressed itself), it should be easy to spot visually. Let's check...
First we need a function that calculates the entropy of a given chunk of data. The following code will take a Python string and calculate it's byte entropy, returning a real number in the range 0.0 and 8.0.
Values close to 8.0 would indicate a high entropy, hence the likelihood of compressed or otherwise highly random data. Low values would indicate low complexity data such as text or executable instructions or any other data exhibiting clear patterns.
| import math def H(data): if not data: return 0 entropy = 0 for x in range(256): p_x = float(data.count(chr(x)))/len(data) if p_x > 0: entropy += - p_x*math.log(p_x, 2) return entropy |
Next we want to be able to take a chunk of data and run the entropy calculation function all across it, on byte increments, with a defined block size. Starting from the byte at offset 0, we will calculate the entropy of each data chunk of the given size and return it's value. The function is an iterator so that we can easily get a list of entropies for all offsets that we can next feed into a plotting function.
| def entropy_scan (data, block_size) : for block in ( data[x:block_size+x] for x in range (len (data) - block_size) ): yield H (block) |
Now we need some test data, the following code will generate a low-entropy chunk of data 1024 bytes long, followed by a high-entropy one (assuming the random generator is good enough, which is the case for the example) also 1024 bytes long and closing with 1024 bytes more of low entropy data.
| data = ''.join ( [chr (random.randint (0, 64)) for x in xrange (1024)] + [chr (random.randint (0, 255)) for x in xrange (1024)] + [chr (random.randint (0, 64)) for x in xrange (1024)] ) |
If we run the Python code within Mathematica
| ListPlot[ Py["\< list( entropy_scan( data, 256 ) ) \>"] ] |
we obtain the following plot

displaying a noticeable bump in the region where the higher entropy data lies within our test data.


13 comments:
Interesting post :-), i specially like the way you have "connected" all these programs to get what you want ;-)
It´s a common task in unpacking software doing some kind of entropy analysis at the entry point or defined points like data sections, overlay, etc.. in order to know if they must going to emulate code or just try some deeper analysis algorithms. There are also some programs that find static cryptographic constants while trying to know if a program is using RSA, CRC32 and stuff like that. But i don´t think this would be too useful when trying to get something like an RSA key because you can find "entropy" in so many places (resources, data tables, ... ) and there are problems to solve like where the buffer starts?, what is the size of the key?.
Perhaps a good approach can be an IDA plugin doing both analysis: find crypto consts, get xrefs from code to them to find the algorithm and then just try to find calls referencing buffers with high entropy. I think this could be done easily with the help of your binnavi software ;-D
Cheers,
Mario Ballano
I agree, this would merely indicate where it could be (if, as I mentioned in the post, it's stored in binary form. If it'd be in text-form the entropy would be much lower)
As well, as you note, there are lots of areas that could have high entropy within the file. But it's better to have some likely starting points than none :-p
IDA actually comes with a plug-in that scans for common initial vectors and other constants used in different crypto-algorithms which helps a lot when one wants to pinpoint the location of cryptographic code.
I have tried this ida plugin and looks nice :-). I´m agree with you, this could be a good way to get some starting points, anyway the best entropy calculator is the human eye with the help of hiew ;-))
Cheers,
Mario
I don't know if running time is a problem with your code (maybe it's dominated by I/O?), but you can make it drastically faster by replacing this
p_x = float(data.count(chr(x)))/len(data)
with something that builds a histogram in a dict and then computes the entrophy of the dict. You are essentially doing 256 passes through the data instead of just one.
Ero, thanks for the code. Wrote a script that includes your code and hopeful provides some rudimentary feedback before one moves toward disassembly.
It will print out ascii and hex block samples of data when they pass a given entropy threshold. Could add logic to print out total size of consecutive blocks that had large entropy. Would also be of use to have the full block of data to easily run through a decompression routine or with a mime/magic match.
hrm. I was hoping for the ability to post a picture, it'd be easier since html doesn't offer a rich enviornment for math-speak.
Anyway, I've refactored this algorithm to be more computationally efficient. My application for this concept is on a very large dataset, so execution time is important.
You start with this eqn:
en = sum over k = 0...255:
px log_2(px)
... which approaches the value 8.0 as the entropy of the block of data you're looking at gets higher. In this eqn, px is the probability that each possible byte occurs.
Here's the resulting algorithm I came up with (you get two choices; I haven't [yet] tested to see whether either gives you any performance gain over the other):
log_2(mult: k=0...255: ck^ck) -> 8(wsize + 256)
... that is to say, the eqn on the left gets closer to the eqn on the right from the left as the entropy approaches 8.0 from the left. wsize is the number of bytes you're calculating entropy for, mult means the same as summation, but multiplication instead of sums. ck is the number of times each byte appeared in the block.
The 2nd choice is to raise both sides to the power of 2, resulting in:
mult: k=0...255: ck^ck -> 2^8(wsize + 256)
There's likely very little difference between the two in time taken, but, I figured I'd state them anyway. In these eqns, ck is the count of each of the possible 256 bytes (assuming the sample has a range of possible values in 0...255), wsize is the size of the block you're trying to calculate entropy for, and mult is the same as summation, except the operation is multiplication instead of addition.
So, first I'll show why this representation is more useful, then for those curious I'll step through the derivation.
The problem with the original [shannon] entropy eqn is that there's no shortcuts you can take. That is to say, how many values could you compute ahead of time and make use of repeatedly to improve performance?
With this representation you can keep around an array of each ck^-ck. Each time the window slides by 1, calculate the new ck^ck, divide out the old one, and multiply in the new one. Congratulations, you just saved yourself from executing 256 log_2s each time you calculate entropy for a new window.
To sum up, here's what you'd want to do:
Calculate: 2^8(wsize + 256)
... or 8(wsize + 256), whichever you picked.
This will never change, so you only calc it once.
Iterate over your window the first time, adding 1 to one of 256 integers in an array (one for each possible value). You then calculate the product of all ck^-ck. Now or later you can compare it to the value calculated above to see whether it has crossed a threshhold of unpredictability.
When the window slides, subtract 1 from your byte counter array for the byte that left the window, and add 1 for the byte that just entered the window. For the counter representing the old value, calculate ck^-ck and divide that off the product you calculated during the first run. Now multiply in the new ck^-ck. There's where the primary savings is.
So, if you don't care about why this works, just that it works, you can stop reading here and go on your merry way.
To start with, shannon entropy is defined as:
en = (-1)sum: k=0...255: pk log_2(pk)
Logs have a convenient property:
a*log(f) = log(f^a)
So, that's what we're gonna do first:
en = (-1)sum: k=0...255: log_2(pk^pk)
Then, another convenient property is:
log(f) + log(g) = log(f*g), so:
en = (-1)log_2(mult: k=0...255: pk^pk)
Carry in the -1, and you have:
en = log_2(mult: k=0...255: pk^-pk)
To make things a little simpler to write, set: p = mult: k=0...255: pk^pk :
en = log_2(1/p) -> 8.0 as 1/p -> 256, which occurs when all ck is at or near 1.
p = mult: k=0...255: (ck/256)^(-ck/256)
= mult: k=0...255: [(ck/256)^-ck]^(1/256)
q = p^256
= mult: k=0...255: (256/ck)^(ck)
= mult: k=0...255: 256^(ck) * (1/ck)^(ck)
= mult: k=0...255: 256^ck
* mult: k=0...255: ck^-ck
The first set of multiplications simplifies to:
256^(sum: k=0...255: ck)
Remember, ck is how many of each byte value. Add them together and you should get the total number of bytes in the set you're calculating entropy for, so this simplifies to:
256^wsize
... where wsize is the number of bytes in the window. this can be further simplified as:
2^8wsize
which gives:
q = 2^8wsize * mult: k=0...255: ck^-ck
Recall that entropy was:
en = log_2(1/p)
= log_2[(1/q)^(1/256)]
= 1/256 log_2[1/q] -> 8.0
log_2[1/q] -> 2^11
-log_2[q] -> 2^11
log_2[q] -> -2^11
q -> 1/2^2^11
2^8wsize * mult: k=0...255: ck^-ck -> 1/2^2^11
mult: k=0...255: ck^-ck -> 2^-8wsize * 2^-2048
mult: k=0...255: ck^-ck -> 2^-8(wsize + 256)
Last, we don't want to be calculating reciprocols of numbers if we don't have to, so:
mult: k=0...255: ck^ck -> 2^8(wsize + 256)
-Brian
I made one mistake in that (that I've found so far).
At the point where you're sliding the window, you need to divide off ck^ck for both counts that change, and multiply in the two new ck^ck's. I had said you only need to divide off the ck^ck for the one that slid out, and multiply in the one that slid into the window.
-Brian
I finally got some time to try implementing that algorithm I described & found a flaw in my logic: c1^c1 * c2^c2 * ... * ck^ck is a number much larger than 32-bits. Fortunately I came up with a fast way to do SE, so feel free to delete those comments if you like:
for (i = 0; i < 256; i++)
ek[i] = i/256 * Log(2,i); // pre-calculate the entropy for each possible probability
for (i = 0; i < windowsize; i++)
ck[buffer[i]]++; // count how many of each byte there is in the window
for (i = 0; i < 256; i++)
entropy[0] -= ek[ck[buffer[i]]]; // calculate the entropy for the first window position
for (i = 1; i < buffer.length - wsize; i++)
{ // slide the window position and update the new entropy as it slides.
index1 = i-1;
index2 = i+windowsize-1;
entropy[i] = entropy[i-1] + ek[ck[buffer[index1]]] + ek[ck[buffer[index2]]];
index1 = ck[buffer[index1]]--;
index2 = ck[buffer[index2]]++;
entropy[i] -= ek[index1] + ek[index2];
}
-Brian
Hey Brian!
That's funny. I was just about to made a blog post about your comments and I had reached myself the same conclusion. The Mathmatica and Python implementations I made are nearly identical to what you just posted!
Great work! :)
A colleague and I have been using sliding window metrics for some time in characterizing both local and global properties of data streams. One of the metrics that we have used with much success is 0-order entropy.
Some years ago, I wrote a sliding entropy calculation program in C for efficiency. I followed much the same reasoning as was recently posted. In particular, if the formula for 0-order entropy is rewritten in terms of frequencies, it's relatively easy to see that the entropy for a window depends on that of the previous window in a straight forward way.
However, keep in mind that entropy is defined in terms of an alphabet and that alphabet changes as the window slides. Replacing the number of unique characters in a window by the number of characters in the character set may be useful, but it isn't exactly entropy.
Consider the following simple example:
Let the alphabet be "ab" and consider a window of size 4. We'll compute the entropy of the window "aaaa." The size of the alphabet is 1 and the frequency f("a") = 4. Sum of the products of the probabilities and logs is 4 * (1/1 * -lg(1/1)) = 4 * 0 and that's the correct answer. The 0 result means that each character in the window is completely predictable with no further information.
Consider what happens if we use 256 as the size of the alphabet. Then we'd have that the entropy, under one suggested formulation, would be 256 * 1/256 * -lg(1/256) = 8, the maximum possible over that alphabet. Unfortunately, we'd be assigning a maximum entropy to an obviously strongly patterned window.
I like to characterize entropy numbers as indicators of predictability. For example, if I give you the string "aaaa" and say that I've chosen a character from it at random and ask you to guess my choice, the answer is trivial and you need no further information to tell me "it's an 'a'" and you'll be right 100% of the time. On the other hand, if the string given is "abbb," you'll be right 75% of the time by guessing 'b.' Computing the entropy, we find a fairly small number indicating that you wouldn't need much more information to be right 100% of the time.
Now, suppose the string is "abcd." Here is the maximum entropy case that was described and you have an entropy of 2. Sure enough, your reply amounts to a shot in the dark.
Note that we haven't mentioned anything about the position of the characters. For order-0 entropy, the strings 'abbb,' 'babb,' 'bbab,' and 'bbba' are equivalent.
If however I tell you I'm thinking of an English word and the first letter is 't' and ask you to guess the next letter, you'd probably guess 'h' and you'd be right a fair percentage of the time. Similar circumstances arise in sliding metrics that can be advantageous. But, that's order-1 entropy, and so on.
-Wilbon Davis
The name of the tool is: CrypTool "http://www.cryptool.org" it is one of the best tool for cryptography I have ever seen for educational purposes. The best thing about it is that it is a free tool and it comes with the source code written in C++, and yes you can check the Entropy analyzer Source Code...
Hi Ero,
The tool they are talking about may be this one:
http://www.lsv.ens-cachan.fr/~olivain/net-entropy/
I have known of it for a while and even wrote to the authors in an attempt to get it for academic research purposes with no luck.
NL
Hi,
Thank you for this interesting article that made me think about the uses of entropy in malware analysis.
There appears to be a small mistake in the code above. You've been taking the log to the base 2, which is correct for a binary base, but wrong for ascii. Since ascii contains 256 possibilities, the correct base for the log is 256.
(256^n instead of 2^n where n is the length of the "string" and the base is the number of possible values that can be chosen - only 2 for binary, but 256 for ascii).
Fixing this will give you proper normalized results - a value between 0 and 1, which you can multiple by 100 to get a percentage if you wish. This is more intuitive then the results of the existing code which fall on a scale of 0 to 8.
You can easily expand on what you already have to include calculations for binary (by using base 2 again), and hex (by using base 16).
A comment to Wilbon: Only a small note to ponder on the statement regarding
"Now, suppose the string is "abcd." Here is the maximum entropy case that was described and you have an entropy of 2. Sure enough, your reply amounts to a shot in the dark."
In that case, you're not totally shooting in the dark and your chances of being right are a bit better than you'd think because having been given one letter (as per your previous examples), at maximum entropy, you can at least state the the other values are not the given letter. Meaning, if you have 'abcd' (or better yet something more 'random' like 'cadb'), and you're told the first letter is 'c' AND that it has the max entropy value, if you're asked to choose the second letter you can rule out 'c', so your chance of being right is 1 in 3 (pick either 'a', 'b', or 'd').
The point is that entropy isn't a measure of randomness as is commonly stated, but a measure of disorder in the system. (Some people will only say that entropy is the mathematical object defined by its formula. To paraphrase a physicist, "nobody really knows what entropy is, anyway!")
I've reworked the Python code in this article and gotten good results, so thanks again for doing the groundwork. I'd like to share some of the more interesting results that will, hopefully, illuminate my comments to Wilbon.
Using an online password generator to generate a random ascii password 256 characters long, the (revised) entropy calculator consistently provides results around 50%. Hmmm, at first you may think that it should be higher -closer to 100%. So what IS 100%? If you create a string ascii 256 characters long with no repeating characters, i.e. a representation of every ascii character, then that result is 100%. That's maximum disorder. On the other side of the spectrum, having a 256 character long string of only 1 character, say only the letter 'a', gives a result of 0%.
Therefore maximum entropy doesn't mean maximum randomness. Maximum randomness appears to fall somewhere near the middle (~50%) of the entropy spectrum.
Sevak Avakians
Post a Comment