ASM-code size optimization tricks


:::[ CONTENTS ]:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

#############################################
# --[ ASM-code size hand optimization tricks
#############################################

# ---[ Purpose use case
===================//===
- modify compiled binaries with a limited amount of space available to include the modified code
- to develop shellcodes for 0-day exploits, where again the size of the shellcode is limited.
- there are many tricks that can be used to achieve even better results in terms of minimizing the code size.
- may come in handy if, you write some shellcode or you need to modify the compiled code 
using as few instructions as possible, and the space to use will be very modest

# ---[ Zeroing of CPU registers
===========================//===

mov eax,0 ; 5 bytes -> B0 00 00 00 00
xor eax,eax ; 2 bytes -> 33 C0
sub eax,eax ; 2 bytes -> 2B C0
and eax,0 ; 3 bytes -> 83 E0 00

As it turns out, even the simplest operation can take up to 5 bytes, but if we use xor instruction instead, the same operation
will take 2 bytes in the resulting program code. The value 0 is often used as a base parameter for WinAPI functions.

# ---[ Example: standard version of code
=========================================//===

push    offset szSansSerif      ; lpFace                        ; 5 bytes
push    0                       ; pitch and family              ; 2 bytes
push    0                       ; output quality                ; 2 bytes
push    0                       ; clipping precision            ; 2 bytes
push    0                       ; output precision              ; 2 bytes
push    1                       ; char set identifier           ; 2 bytes
push    0                       ; strikeout attribute flag      ; 2 bytes
push    1                       ; underline attribute flag      ; 2 bytes
push    0                       ; italic attribute flag         ; 2 bytes
push    400                     ; font weight(normal)           ; 5 bytes
push    0                       ; base-line orientation angle   ; 2 bytes
push    0                       ; angle of escapement           ; 2 bytes
push    0                       ; logical average character     ; 2 bytes
push    0Dh                     ; logical height of font        ; 2 bytes
call    CreateFontA

>> Above, CreateFontA parameter take 34 bytes from this case.

# ---[ Size optimized version
=========================================//===

sub     eax,eax                                                 ; 2 bytes
push    offset szSansSerif      ; lpFace                        ; 5 bytes
push    eax                     ; pitch and family              ; 1 byte
push    eax                     ; output quality                ; 1 byte
push    eax                     ; clipping precision            ; 1 byte
push    eax                     ; output precision              ; 1 byte
push    1                       ; char set identifier           ; 2 bytes
push    eax                     ; strikeout attribute flag      ; 1 byte
push    1                       ; underline attribute flag      ; 2 bytes
push    eax                     ; italic attribute flag         ; 1 byte
push    400                     ; font weight(normal)           ; 5 bytes
push    eax                     ; base-line orientation angle   ; 1 byte
push    eax                     ; angle of escapement           ; 1 byte
push    eax                     ; logical average character     ; 1 byte
push    0Dh                     ; logical height of font        ; 2 bytes
call    CreateFontA

>> Above, CreateFontA parameter take 27 bytes, small profit.

# ---[ Passing series of the same values
=========================================//===

pass the same parameters to the function, usually done like this:

push    0               ; 2 bytes
push    0               ; 2 bytes
push    0               ; 2 bytes
push    0               ; 2 bytes
push    0               ; 2 bytes
push    0               ; 2 bytes
push    0               ; 2 bytes
================================
                    = 14 bytes

Or more size optimized, like this:

sub     eax,eax         ; 2 bytes
push    eax             ; 1 byte
push    eax             ; 1 byte
push    eax             ; 1 byte
push    eax             ; 1 byte
push    eax             ; 1 byte
push    eax             ; 1 byte
push    eax             ; 1 byte
===============================
                    = 9 bytes

But it can be further size optimized using a simple loop:

sub     eax,eax         ; 2 bytes

push    7               ; 2 bytes
pop     ecx             ; 1 byte

@save_args:
push    eax             ; 1 byte
loop    @save_args      ; 2 bytes
================================
                    = 8 bytes

# ---[ Zeroeing EDX register
=========================================//===

Our plan is to, 0 the EDX register. normally we do "xor edx,edx", but there is another way easier,
using the cdq instruction (it stands for Convert Double to Quad).

cdq = causes the edx register to be filled with a sign bit from eax register
(sign bit is the most significant bit of the register value, so in this case it's the 31st bit).

So if we know that in eax we have e.g. 1, then execution of the cdq instruction will cause edx to be reset to zero.

Simple explain, to make EDX = 0 using cdq, You must ensure EAX is non-negative before executing cdq. (To get EDX = 0, ensure EAX ≥ 0)

*Example:

xor eax, eax   ; EAX = 0
cdq            ; EDX = 0 (because EAX is non-negative)

mov eax, 123   ; any positive value
cdq            ; EDX = 0

*Important example insight:

- EDX will turn to negative if EAX is negative.

mov eax, -1
cdq            ; EDX = 0xFFFFFFFF, not zero

If you are not sure about the content of the eax register (for example, after the function calls) you shouldn't use, because it can lead to errors:

eax=80000001h = 1000000000000000000000000000000000000001b
                ^ most significant bit of the EAX register is set to 1

This execution of cdq will cause edx to be filled with a bit of eax, which is 1, so in edx there will be 0FFFFFFh.

cdq instruction takes only one byte.

# ---[ Transferring 32-bit values from 0-255 range to the CPU registers
================================================================//===

mov  eax,7Fh 5 bytes        ; B0 FF 00 00 00

sub  eax,eax 4 bytes         ; 2 bytes C0
mov  al,7Fh                  ; B0 FF

push 7Fh     3 bytes         ; 6A FF
pop  eax                     ; 58

It is often necessary to transfer values from 0-255 range into 32-bit register. We can do it like this:

mov     eax,4           ; B0 04 00 00 00 - 5 bytes

This instruction takes 5 bytes. A value of 4 is treated as a full 32-bit value that needs 4 bytes to encode. 
The most optimized solution is to store aka "push" this value on the stack and "pop" it back to the CPU register:

push    4                       ; 6A 04 - 3 bytes
pop     eax                     ; 58

This time it takes only 3 bytes, even though it takes up more space in the source code, it takes up fewer bytes on the disk!

It should be mentioned, that the compiler will write the shortened form of push instruction if the value is between 0-127 (signed integer value).

push -127 <- shortened version of push instruction even for signed integer values

or by using helper macro

pushb   macro   byteval
db      06Ah,byteval
endm

pushb   080h    ; store 128 value (
pop     eax

After these instructions are completed, the eax will hold a value of 0FFFFFF80h (-80h) but why not 00000080h?

The numbers in the range 128-255 in the short version of push instruction are treated as negative numbers (aka sign-extended).

The sign bit from the short encoded integer value is then copied to the upper bits of the CPU register:

00000000 00000000 00000000 10000000 = 00000080h
                            ^integer sign bit

11111111 11111111 11111111 10000000 = FFFFFF80h
                            ^signed integer

There is another trick to make the code a little short in case you want to encode values in the range from 128-255 to a full 32-bit value:

# Standard way:

mov     eax,255 ; bytes

# Size optimized way:

xor     eax,eax ; bytes
mov     al,255  ; bytes


# ---[ The use of error codes returned by functions
================================================================//===

This is another of the tricks often overlooked by HLL compilers.

Functions by definition return some values. In the case of WinAPI functions, the returned value is always stored in the eax register.

Depending on the function, returned values can differ and it could be 0, -1, file handle, etc.

For example CreateFileA function returns -1 in eax register when we don't have access to the file we just wanted to open.

But another WinAPI function like CreateIcon returns in eax 0 if there is an error.

We can use those values, before checking the MSDN documentation to our advantage:

push    ...
call    LoadBitmapA

Documentation about LoadBitmapA function says the function returns the handle to the bitmap on success and 0 on error.

push    ..
call    LoadBitmapA
cmp     eax,0           ; 83 F0 00
jz      @error

cmp eax,0 instruction takes 3 bytes. Can't we do it better? Of course, we can by using logical operations like or or test:

call    LoadBitmapA
or      eax,eax         ; 0B C0
jz      @error

or:

call    LoadBitmapA
test    eax,eax         ; 85 C0
jz      @error

Both of the or and test instructions sets the CPU zero flag if the eax register value is set to 0, 
it gives us the same result as the cmp eax,0 instruction but with 1 byte less size in output code.

We can optimize it even further by using xchg instruction:

call    LoadBitmapA
xchg    eax,ecx         ; 1 byte
jecxz   @error          ; jecxz instruction takes 2 bytes (the same as jxx short range branches)

The jecxz instruction jumps to the provided label if the ecx register is set to 0.

But there is a catch! The instruction itself is a conditional branch instruction to the nearest label in range of -127 to 128 bytes 
from the instruction itself in compiled code (it's a short jump type instruction only).

So if your destination, in our case @error label is further away in compiled code than that you will get an error message from the compiler.

Some assembly compilers like an old school TASM compiler will automatically translate jecxz with destinations further than 128 bytes to:

    call    LoadBitmapA
    xchg    eax,ecx
    jecxz   @dummy

    jmp     @next

@dummy:
    jmp     @error

@next:

Many WinAPI functions returns -1 (0FFFFFFFh) value on error. How can we check it? The simplest way is of course:

call    CreateFileA
cmp     eax,-1          ; 83 F0 00
je      @error

We can get the same result using much more size optimized code:

call    CreateFileA
inc     eax             ; if there was -1 value returned, the inc instruction will set the EAX register to 0
je      @error          ; and we can detect it with a conditional JE/JZ instruction
dec                     ; if there wasn't an error, restore the originally returned value

In this case, the resulting code will be 1 byte smaller than the one using cmp eax,-1.

# ---[ Exchanging CPU registers values
================================================================//===

value of 4 stored in the eax register and a value of 98 stored in edx register. How to exchange those two registers?

push    eax <- 4 bytes. 
push    edx
pop     eax
pop     edx

mov     ebx,eax <- 6 bytes even bigger
mov     eax,edx
mov     edx,ebx

xor     edx,eax <- clever trick logical "xor" instruction (6 bytes)
xor     eax,edx
xor     edx,eax

xchg    eax,edx         ; 92h (1 bytes)

xchg    edx,esi         ; 87h 0D6h

The xchg instruction takes only 1 byte in output code, but only if 
one of the exchanged registers is eax. Otherwise it's encoded as 2 bytes.

many other instructions are smaller if you use the eax register:

add     edi,400000h     ; 6 bytes -> 81 C7 00 00 40 00
add     eax,400000h     ; 5 bytes -> 05 00 00 40 00

So it's the same instruction add, but if the eax is used 
- the output code is 1 byte smaller. Keep that in mind.

# ---[ CPU string instructions
================================================================//===

There is a separate set of string instructions in CPUs. They operate on esi and edi registers only.

Some of those instructions are rarely used by modern compilers, 
but they have one advantage to us - the size of the output code.

Let's look at this example. We have a simple loop and after each
iteration, we increase the value of the esi pointer by 4.

_loop_label:
    ...
    ...
    ...
    add     esi,4
    loop    _loop_label

Easy & simple. But the:

add esi,4               ; 83 C6 04

instruction takes 3 bytes. But we can use the string instruction
"lodsd" to make our code shorter and it does exactly the same:

lodsd                   ; AD     = add esi,4
lodsw                   ; 66 0A  = add esi,2
lodsb                   ; 0A     = add esi,1

There are 3 variants of this instruction, operating on 32 bit, 16 bit and 8 bit values:

lodsd                   ; mov eax,dword ptr[esi]
                        ; add esi,4

lodsw                   ; mov ax,word ptr[esi]
                        ; add esi,2

lodsb                   ; mov al,byte ptr[esi]
                        ; inc esi

_loop_label:
    ...
    ...
    ...
    lodsd                   ; mov eax,dword ptr[esi]
                            ; add esi,4
    loop    _loop_label

So we can use it a short version of add esi,4 instruction, just keep in mind it access the memory pointer 
in esi register (so it cannot be any value, it must be a pointer to some data) and it writes to eax register.

If you need to preserve the value of the eax register you can do it like this:

_loop_label:
    ...
    push    eax

    lodsd

    pop     eax

    loop    _loop_label

There is also a scasX instruction. It compares the value pointed by the edi register to the value 
from eax register and increases (if the direction flag DF is set to 0, use the cld instruction) or 
decreases (if the direction flag DF is set to 1, use the std instruction) the value of the 
edi registers. It also comes in 3 variants for 32 bit, 16 bit and 8-bit comparisons. In order to 
use it, you need to make sure the edi register points to a valid data buffer, so again it cannot be 
any number or value you want because it will end with an exception if you try that (access violation).

So if one of registers you want to increase is edi, instead of this:

add     edi,4           ; 83 C7 04

it's better to use:

scasd                   ; AF
scasw                   ; 66 AF
scasb                   ; AE

and it works like this:

scasd                   ; cmp dword ptr[edi],eax
                        ; add edi,4

scasw                   ; cmp word ptr[edi],ax
                        ; add edi,2

scasb                   ; cmp byte ptr[edi],al
                        ; inc edi

The CPU direction flag decides if the value of the edi register is increased or decreased:

std                     ; set DF (Direction Flag), 1 byte
scasd                   ; cmp dword ptr[edi],eax
                        ; sub edi,4

Keep in mind the direction flag (DF) is always cleared after the application starts, 
at least for the Windows PE executables and it's also expected to be clear between any WinAPI functions.

So if you ever set it with "std" instruction, make sure to reset it back afterward with "cld" 
otherwise you might end up with hard to find bugs related to this issue in other applications or OS components.

std                     ; set DF (Direction Flag), 1 byte
lodsd                   ; mov eax,dword ptr[esi]
                        ; sub esi,4
...
...
cld                     ; restore DF to its expected default state


#############################################
# --[ Win32 Shellcode Stuff
#############################################

int 0x2e <- System Calls

there exists an inherent problem with using kernel32.dll. not guaranteed to have kernel32.dll
loaded at the same address for every different version of Windows.
addresses to functions in kernel32.dll cannot be hardcoded in shellcode without giving up reliability.
Many current implementations of Windows shellcode make the mistake of hardcoding addresses into the code itself.
There do exist ways to find the base address of kernel32.dll without hardcoding any addresses at all.



#############################################
# --[ Extra
#############################################

MMX, SSE, AVX <- for speed optimization