So I though, a trick will usually be regarded as "common" once it gets implemented in some packer, as those try to make analysis difficult and will attempt to embedded whichever tricks are good/popular within the underground at the time in order to make the reverse engineering process as cumbersome as possible. Therefore if I could somehow place packers in time I'd have a starting point...
That led me to remember about Google Groups. It's possible to make queries restricted to date ranges and the archives go back to 1981. I quickly put together a script to scan with a one-month window through 1981 to 2007 for a set of popular packers.
The most painful part of the whole process was to fool Google... they sure do not like robots... whenever they get a bunch of very simply automated queries they'll server back a "403 Forbidden" telling queries look like coming from a virus or spyware app...
But my script is good, it's no evil spyware... so I got into the mood of working my way around the checks. I needed to do quite some queries (> 10K) so I better make it believe I'm not a robot. Besides finding the right timing for the queries (too often will make Google sad) I had to distribute the search over a few hosts, randomize headers and User-Agents and the query itself (just throw in some randomized, "orthogonal" (nothing to do with your query) search terms). After that the script was good to go...
So, after mining the news groups for popular packer names ( the search string was, most of the time, "








The results will have some inaccuracies, as it's possible some of the terms appeared in some news post not related to the packers. Yet I think they look plausible. When the volume of hits is high enough or constant over time it feels like it would indicate the approximate release date of the packer in question, or at least the first public discussion about it which, I would tend to think, will not necessarily be too far apart.
If someone can either corroborate or refute the data I'll be glad to hear.
I also did some test overlaying virus release times in order to try to spot correlations between big outbreaks and news-posts about packers, but I couldn't see anything particularly significant.



6 comments:
Very interesting work. Could this be generalized to include determination of the "first" occurrence of just about topic? Also, did you think about using tor or something akin to a mix or onion routing system to anonymize the sources when executing the script?
I'd love to see this baby in action.
txs
Using onion routing did cross my mind. That seems the way to go to make something more "scalable"... I just wanted to keep it within the one-night-hack range.. I know myself and if I start putting stuff into it I'll be weeks cooking up some monster python script... otoh, it'd be just making the script use the onion proxy right...? ;)
(never used onion myself.. yet)
I guess one can generalize it to more cases only limited by the fact that news postings might not be a representative "space" from which to draw samples for some topics... I guess it's better for some subjects than others. I'd think that for techie stuff is somewhat useful.
Interesting entry! :)
There is also one more problem, some packers have different versions, that showed up in different moments in time, different versions implement different things.
The part about fooling Google was very amusing. You've always got some crazy shit up your sleeve involving statistics, Python, Mathematica, and pretty diagrams.
Ero,
Cool research above. However, I don't think Yoda Protector was around in 1992 ;)
Keep up the great work, though!
- Jason
I guess Yoda was a cool Jedi back then already :-p
It's one hit... that's the issue when the dataset is so sparse... any "noise" is significant...
Post a Comment