Showing posts with label code opimization. Show all posts
Showing posts with label code opimization. Show all posts

Sunday, September 09, 2007

Reverse engineering a compiler-produced artifact

In our training, Pedram and I deal with some very simple compiler optimizations or artifacts. Although they represent the same semantics that the programmer defined in the original source code, those optimizations are sometimes cumbersome to convert back to a meaningful high-level representation.

The other day I was just studying a piece of code and bumped into a relatively common pattern. The code I was looking at was supposed to represent a division of a function's argument by a constant. But in the disassembled code I was studying I could only see a multiplication. This can be slightly confusing unless one has seen a bit more assembly than what is healthy and remembers some of the compiler-produced fun that goes on...

A couple of things to remember:
  • Compilers love to work with multiples of 2. The processor can can just shift registers left and right (shifting is incredibly fast, that is moving the contents of a register left or right padding with o or 1 as appropriate). Shifting to the left for multiplication by 2 and towards the right for division by 2 (this is akin to having a number in base 10 and multiplying by 10 by adding zeros to the right and dividing by by removing the rightmost digit).
  • Compilers hate to use the division instruction. The division takes a lot of steps, or cycles, for the CPU to complete. Hence they will avoid to use it at all cost.

The code looked like this:
(irrelevant interleaved code left out)

mov ecx, [esp+4+arg_4]
mov eax, 66666667h
imul ecx
sar edx, 3


  • In the snippet we can see function argument being multiplied by 0x66666667, and the result being stored as a 64 bit value in EDX:EAX (topmost 32 bits in EDX, the lower 32 in EAX)
  • Then the top 32 bits are shifted ("arithmetically") to the right. That is, divided by 2 thrice, same as 2^3 = 8. Effectively dividing the value by 8.
  • But the division is applied only to the top 32 bits, ignoring the lower 32. That could be understood to also mean that, by taking the topmost 32 bits and ignoring the bottom ones, the result of the multiplication is implicitly being divided by 2^32. (That's only guessed by the subsequent usage of the value just obtained, there's never again a reference to the lower 32bits, so I assume that they are discarded)


What do we have so far?

[ (Value * 0x66666667) / 2^32 ] / 2^3 ]

But, what's that 0x66666667? why to multiply by something so large and then divide?
The reason is that such computation allows the processor to keep most of the precision of the division it is trying to perform, still obtaining an integer in the end but without having to resort to using floating point arithmetic (which is far slower)

Let's do an example in base 10. Imagine that you only can multiply and divide by 10 (shifting numbers left and right) and we want to divide a number by 30. By shifting we can only divide by 10, 100, 1000, etc

But we have that: Value/30 = value * 1/3 * 1/10

Given that, represented as an integer, 1/3 would produce 0 we can "scale" it by multiplying by a large constant that later, once we are done, we divide by to get the value we're after. Given that the easiest for us is to multiply/divide by 10, we can "scale" 1/3 and make it 100000/3 which approximately equals 33333, which is a nice integer value. We would want to make this value as large as it fits in our registers in order to be as precise as possible. The bigger it is the more precision it will retain for subsequent operations.

Value/30 = ( Value*33333 ) / 1000000

Hence, we now have a clue now of where that 0x66666667 value might be coming from. Given that the processor works in base 2. We can assume that it's going to prefer multiples of 2. Also, given that it will try to obtain the largest value that fits in a 32bit register, that gives us an idea of the range of the power-of-two in use. We can get there with a bit of trial and error (We want to obtain an integer as a result of dividing a power of two by 0x66666667).

2.0^33/0x66666667 = 4.9999999982537702 ~= 5

Therefore:

0x66666667 ~= 2^33/5

So, in the end we get to

( [ (Value * 2^33)/5] /2^32 ) / 2^3

And with some algebra it simplifies to:

Value / (5*2^2) = Value/20

Effectively dividing the value by 20, without actually using the division instruction. That's to the extent that compilers will go to avoid using the division instruction...

Reverse engineering is fun isn't it?

Update: Given that this is a relatively old and well known optimization strategy it's only natural that it had been discussed before. It was just brought to my attention that Ilfak had blogged about a similar optimization and this chapter (PDF) from Hacker's Delight provides more details.

Saturday, January 20, 2007

BinNavi's basic block handling

A while back I talked about the problem of highly optimized code and the resulting problems when we want to store it in a database while allowing for all possible constructs.
In this post I'll show how the recently released BinNavi 1.2 handles some cases where code is shared among functions and basic blocks exhibit non-typical characteristics.

Note: I could not get the Microsoft or Intel compilers to produce compiled code with functions sharing basic blocks as an example for this post. Using PGO (Profiling Guided Optimization) and other optimizations proved fun but the furthest I could go was producing multi-chunked functions. I might need to play more with it in order to get chunks to be shared... Anyway, I just picked a Microsoft DLL for the example. Specifically wkssvc.dll.

In this DLL there are two functions sharing some blocks on their exit paths. The shared code is shown in green in this graphic exported from BinNavi.





Interestingly enough, those blocks share exactly the same code but one function has six of them while the other has seven. I previously commented on the issue here and happends because of a reference in one of the functions that targets the shared code and causes a block to split.

Here one can see the shared code grouped and highlighted. Click on the image for a larger view.



Now a zoomed version of the block on the left:



And the one on the right:



Within the database, BinNavi handles this by allowing instructions to belong to multiple basic blocks, as well as basic blocks to belong to multiple functions. And it just works...

Friday, December 08, 2006

Simply blocks, basically...

A few days ago I bumped into something I was not really counting on seeing. Compiler optimizations have surely gone a long way.

Some background first...

After having seeing functions split over non-contiguous basic blocks for some time now, it was quite natural to think that some of those basic blocks could be shared among functions ( obviously, only the ones leading to the function exit points as once the shared code is reached, there's no way of getting flow back to basic blocks not shared by those functions).

Then we have that functions can be split with their blocks in different parts of a binary and some of those blocks shared. The reason for the splitting comes from doing profiling in normal use-cases of the applications and trying to group frequently accessed code into as few pages of the executable as possible, so that a minimum set of those need to be mapped at one time in memory. Only when infrequently visited code is reached some new pages new to be mapped. The following figures illustrates the concept.

UPDATE: Just got told that the reason for the splitting is more likely to be there to take advantage of the internal CPU instruction cache than of memory paging. Keeping the frequently traversed code together will result in less instructions being fetched from RAM (slower) for that code area. Also will allow to fit more code in the code-cache by moving away the less used blocks.

Here we can see the blocks being laid out continuously in memory. As can be normally seen in non-optimized code.

Unsplit functions

This would be how the same function would be laid out if profiling information is incorporated, so that frequently traversed paths are together within the code (in the same memory page if possible, in order to reduce memory footprint and paging).

Split functions


Once one has the splitting, the idea of sharing comes naturally.

This results in that, from the disassembler point of view, one has to allow for those chunks and also for those chunks to be assigned to an arbitrary number of "owning" or parent functions.

Shared blocks

What is more interesting, and the subject of this post, is the fact that instructions can also belong to different basic blocks. At least, under one view. This arises from cases where extensive optimizations are used.

A couple of days ago I was looking into an optimized binary (the craziest I have seen in a while) and how it was mapping into the SQL representation we are using at Sabre, there were some problems when exporting the information from IDA. (IDA can't really handle too well (yet) heavily-"chunked" code, so I have to account for that and build intelligence that analyzes the code for cases like the one I'm discussing here)

The problem was with two functions sharing a number of basic blocks, the funny side was that, depending which function one analyzes the flow among the shared blocks will look different. And the cause is fairly obvious too once one realizes why the problem appears.

A conditional branch from the non-shared code in one of the functions targeting the shared code will cause a split in the flow. A split which is not present from the other function's point of view. The following figure shows the result of a branch into shared code from only one of the sharing functions.

Evil branch

There are two solutions for this problem. One would be to represent the same basic blocks all over the binary, which would introduce a non-natural spit in a function, the other way would be allowing to have different "views" of the code, using the basic blocks simply as a representation of the underlying model (the disassembled instructions), so that different basic blocks would contain the same instructions and those basic blocks would accurately represent the flow in the two functions...
In the next figure, the colored basic blocks contain the same instructions in both functions, but the flow is different because of the branching.

Different basic blocks, same code

I'm leaning towards the second approach (the one in the previous figure), our SQL schema should support it trivially, which is fairly neat.