Monday, July 25, 2005
Funny ice
The other day I was fixing myself a drink and I got an interesting surprise when I fetched the ice-cubes from the freezer. Some of them had developed an upwards growing spike. I sure was puzzled, although it did make sense to think of it as some result of ice expanding as it changes from water to solid form. This page does a good work explaining the trick.
NCD and malware
Stephanie Wehner published recently an article on using Normalized Compression Distance (NCD) to classify and detect similarities between malware samples and network traffic. I was curious on how well that worked out and being related to information theory, I had to play with it.
The idea is pretty neat, NCD is actually a pretty interesting metric but (I think) of limited use in real-life malware classification as it relies on the good compressibility of the data (executable files in this case), so that common features and other redundancy will be compressed together when looking at two samples at the same time.
Unfortunately, most malware uses some sort of compression which will remove most resemblance among members of a same family or other related classes. Some compressors like UPX still leave enough data to get some interesting results. But other packers do a better job and render NCD not too useful. Yet, NCD still manages to somehow classify samples.

The graph shown here was created calculating the NCD between all the pairs of over 580 samples. The compression method used was BZip2. The clustering was done using UPGMA, commonly used to calculate phylogenies. The clustering algorithm used in her paper is different, leading to a different dendrogram.
(In my personal experience UPGMA perform pretty decently when I cluster malware according to code metrics)
If the samples could be analyzed once they have been uncompressed/unpacked/unscrambled NCD should give much better results. Currently, for good packers it clusters more by the type of packer than by the type of malware.
Another drawback of this approach to automatically classifying samples is that it's very computationally intensive, as it requires to calculate, for every new sample, all the NCDs with the stored ones, although that could be optimized by just comparing against the representatives of different clusters.
The idea is pretty neat, NCD is actually a pretty interesting metric but (I think) of limited use in real-life malware classification as it relies on the good compressibility of the data (executable files in this case), so that common features and other redundancy will be compressed together when looking at two samples at the same time.
Unfortunately, most malware uses some sort of compression which will remove most resemblance among members of a same family or other related classes. Some compressors like UPX still leave enough data to get some interesting results. But other packers do a better job and render NCD not too useful. Yet, NCD still manages to somehow classify samples.

The graph shown here was created calculating the NCD between all the pairs of over 580 samples. The compression method used was BZip2. The clustering was done using UPGMA, commonly used to calculate phylogenies. The clustering algorithm used in her paper is different, leading to a different dendrogram.
(In my personal experience UPGMA perform pretty decently when I cluster malware according to code metrics)
If the samples could be analyzed once they have been uncompressed/unpacked/unscrambled NCD should give much better results. Currently, for good packers it clusters more by the type of packer than by the type of malware.
Another drawback of this approach to automatically classifying samples is that it's very computationally intensive, as it requires to calculate, for every new sample, all the NCDs with the stored ones, although that could be optimized by just comparing against the representatives of different clusters.
Using complexity theory to measure similarity
The Normalized Compression Distance (NCD) will, basically, compare the length of the compressed data of two binary objects separately and compare it to the length of the compressed data resulting from compressing the concatenation of both.
If they resemble a lot each other, the length of the concatenated data will be significantly shorter than the sum of the independently compressed files, indicating then, some similarity.
Thinking of the raw data of two images, if both have a big portion of blue sky and green from some fields and trees they will share some patterns like the distribution of certain color ranges. Then if the images A and B compress certain amounts separately, i.e. Ac and Ab , the combined data of both AB (by appending one to the other) will compress better than the sum of the images compressed separately ABc < A + B, as the combined data has patterns shared by both.
It can be thought as using features of one of the images to try to describe the other. Intuitively, if one tries to describe a car by comparing to another, (like this but bigger wheels, this color, etc) will have less work than describing by comparing it to a tree (has wheels, no leaves, not alive, it moves... etc) so , the closer the two objects, the least effort trying to describe one based on the other and that translates in the digital world as better compression.
For the raw stuff:
The Similarity Metric (MingLi, XinChen, XinLi, BinMa, andPaul M.B. Vitányi)
If they resemble a lot each other, the length of the concatenated data will be significantly shorter than the sum of the independently compressed files, indicating then, some similarity.
Thinking of the raw data of two images, if both have a big portion of blue sky and green from some fields and trees they will share some patterns like the distribution of certain color ranges. Then if the images A and B compress certain amounts separately, i.e. Ac and Ab , the combined data of both AB (by appending one to the other) will compress better than the sum of the images compressed separately ABc < A + B, as the combined data has patterns shared by both.
It can be thought as using features of one of the images to try to describe the other. Intuitively, if one tries to describe a car by comparing to another, (like this but bigger wheels, this color, etc) will have less work than describing by comparing it to a tree (has wheels, no leaves, not alive, it moves... etc) so , the closer the two objects, the least effort trying to describe one based on the other and that translates in the digital world as better compression.
For the raw stuff:
The Similarity Metric (MingLi, XinChen, XinLi, BinMa, andPaul M.B. Vitányi)
Elegant programs and complexity theory
This is a rant on the whole concept, which I find extremely interesting. I will describe in my own words how I understand it.
In the digital world, anything is expressed with ones and zeroes. And in such expressions, similar objects (be it pictures, sounds, text, etc) tend to share some features. Objects, when stored in their raw form, contain certain patterns or redundancy, for instance; raw sound data (storing just the intensity of the sound at a given instant in time, i.e. PCM) or raw image data (storing the intensity values for different colors of a picture).
Whenever the real sound or image contains repeated features (a melody repeating in a song or a visual pattern tiled in a picture), those features will translate as repeated strings of ones and zeroes in their digital representation.
Now, compression (Huffman coding, ZIP, RAR, gz, bzip2, etc) just attempts to capture as many of the pattern as possible and output data describing them... of the sort, this many things like this, that many like that. Simply put, if some data has lots of repeated patterns of ones and zeroes it does not take too much to describe the original data in function of the pattern. On the other hand, if there are no patterns the shorter way of describing it might actually be just giving the raw data as is.
This brings us to the notion of elegant programs and algorithmic information theory and amazing geniuses such as Kolmogorov and Gregory Chaitin, who go on defining the complexity of something by the length of the shorter program that generates the data whose complexity we are trying to measure. While this is just a mere theoretical idea, it's close to the idea of a compressor plus the compressed data, both form a program and some data that, when run, will give the original pattern. If the compressed data in shorter than the original one, we assume that its complexity is proportional to how much it's compressed. The more compressed, the least the complexity (i.e. more pattern and redundancy).
In the digital world, anything is expressed with ones and zeroes. And in such expressions, similar objects (be it pictures, sounds, text, etc) tend to share some features. Objects, when stored in their raw form, contain certain patterns or redundancy, for instance; raw sound data (storing just the intensity of the sound at a given instant in time, i.e. PCM) or raw image data (storing the intensity values for different colors of a picture).
Whenever the real sound or image contains repeated features (a melody repeating in a song or a visual pattern tiled in a picture), those features will translate as repeated strings of ones and zeroes in their digital representation.
Now, compression (Huffman coding, ZIP, RAR, gz, bzip2, etc) just attempts to capture as many of the pattern as possible and output data describing them... of the sort, this many things like this, that many like that. Simply put, if some data has lots of repeated patterns of ones and zeroes it does not take too much to describe the original data in function of the pattern. On the other hand, if there are no patterns the shorter way of describing it might actually be just giving the raw data as is.
This brings us to the notion of elegant programs and algorithmic information theory and amazing geniuses such as Kolmogorov and Gregory Chaitin, who go on defining the complexity of something by the length of the shorter program that generates the data whose complexity we are trying to measure. While this is just a mere theoretical idea, it's close to the idea of a compressor plus the compressed data, both form a program and some data that, when run, will give the original pattern. If the compressed data in shorter than the original one, we assume that its complexity is proportional to how much it's compressed. The more compressed, the least the complexity (i.e. more pattern and redundancy).
Subscribe to:
Posts (Atom)