Inspecting the Implementation of QEMU: How QEMU Handles Self-modifying Code

Inspecting the Implementation of QEMU: How QEMU Handles Self-modifying Code

2022. 4. 19. 00:22ㆍResearch

Some weeks ago, I was trying to solve malware in pwnable.kr. I basically already knew that QEMU basically uses compiler technology called TCG and kind of easily inferred that there could be some timing issues related to code modification at runtime, so I merely "easily" solved it. However, I wanted to find out why. Saying that again, I wanted to check with my own eyes and clearly find out how it works in real QEMU code.

With that in mind, I quickly searched for the qemu repository on google. I opened the repository and enthusiastically started analyzing the code of QEMU... was the plan, but my laziness yelled at me to immediately close the repository and open twitch.

Now, several weeks passed, but the curiosity seems to have been waiting for the time to show up and settle for the thirst of knowledge. The time had finally came, this weekend.

0. Synopsis

So QEMU is an emulator that emulates a software based on the target architecture so that the software can be ran even if the target architecture is different from the host architecture. QEMU is so powerful that it can not only emulate a single executable, but also a full operating system kernel and of course, applications executed on top of the emulated operating system.

QEMU can basically analyze the assembly inside the software and interpret it into the host architecture assembly or in some special cases just interpret it. To do this, QEMU uses an older technology called Tiny Code Generator(TCG). The overall picture is to, first, translate the target assembly to TCG Intermediate Representation(IR), and then translate it again to host assembly, if it is the case. They say that TCG began with the aim of becoming the backend of the gcc compiler^[각주:1]. I guess that is why the overall picture of this technology is so silmilar to the overall picture of modern compilers.

In case of most of the applications, or maybe most developers, after the code is compiled to a binary, the code does not change. In this case, it may also be possible for QEMU to just translate all the code into host assembly and the TCG may never be triggered again until the process dies. However, in case of self-modifying code, the memory value that stores the code may change. That means the code changed. If QEMU tries to execute code on that region, it definitely must retranslate it to maintain the correct behavior of the emulated program. This point is the origin of the question. How does QEMU handle this in real code?

Changed code must be retranslated in self-modifying code

1. QEMU Introduction

QEMU is a software that basically emulates various kinds of software. Since it emulates software, there are advantages and disadvantages about it. A big advantage of using QEMU is that a more broader range of applications can be executed on the same machine because it can let you come over the hurdle of architecture differences. For example, you can execute ARM applications on an intel or amd cpu. However, this view is only partial. Since it needs to interpret over the architectural differences, it intrinsically has to be slower than just using native architecture instructions. This disadvantage applies even when the target architecture is the same as the host architecture. So, for example, you can think of this as if some you hired a interpreter to come over the hurdle of language differences, like the situations where two countries are having a summit, but they speak different languages. Saying again, this performance penalty applies even when the architecture is the same. So if possible, you might as well just want to directly execute the application directly on your machine.

In this article, we are going to take a look at the actual implementation of QEMU. This is possible because QEMU is gratefully open source, based basically on GPL2, though the licensing of the specific files can vary. Here, we will use the code in the QEMU github repository. The version that we will use is v4.2.0. If you would like, you can browse the files by opening the commit tagged as v4.2.0.

2. QEMU Tiny Code Generator

QEMU Tiny Code Generator, shortly QEMU TCG, is the main interpreter that is the core part of overcoming the hurdle of architectural differences. As the picture above, through the TCG, the target architecture assembly is interpreted into the host architecture assembly. Now, we are not going to thoroughly go through the TCG internal code, but maybe we can just take a glimpse through it later in this article.

The QEMU TCG tries to minimize the overhead of translating code by using self-modifying code. Self-modifying code is a code that modifies other code to execute. This means that the code itself can generate another code or function to execute at runtime.

Let's look at an example by taking a quick look at the QEMU by actually executing it. I won't show you how to obtain the binary. Just get the source code on QEMU github repository or maybe you can install it using your linux distro package manager.

QEMU has a functionality to emit logs as specified by the user. If you take a look at the qemu help message below, we can see that qemu provides a '-d' option that logs items specified. It says that we can get more help by the '-d help' option. Let's do it right away.

user@DESKTOP-4E0H5D7:~/qemu_analysis$ ./qemu-x86_64 -h
usage: qemu-x86_64 [options] program [arguments...]
Linux CPU emulator (compiled for x86_64 emulation)

Options and associated environment variables:

Argument             Env-variable      Description
-h                                     print this help
-help
-g port              QEMU_GDB          wait gdb connection to 'port'
-L path              QEMU_LD_PREFIX    set the elf interpreter prefix to 'path'
-s size              QEMU_STACK_SIZE   set the stack size to 'size' bytes
-cpu model           QEMU_CPU          select CPU (-cpu help for list)
-E var=value         QEMU_SET_ENV      sets targets environment variable (see below)
-U var               QEMU_UNSET_ENV    unsets targets environment variable (see below)
-0 argv0             QEMU_ARGV0        forces target process argv[0] to be 'argv0'
-r uname             QEMU_UNAME        set qemu uname release string to 'uname'
-B address           QEMU_GUEST_BASE   set guest_base address to 'address'
-R size              QEMU_RESERVED_VA  reserve 'size' bytes for guest virtual address space
-d item[,...]        QEMU_LOG          enable logging of specified items (use '-d help' for a list of items)
-dfilter range[,...] QEMU_DFILTER      filter logging based on address range
-D logfile           QEMU_LOG_FILENAME write logs to 'logfile' (default stderr)
-p pagesize          QEMU_PAGESIZE     set the host page size to 'pagesize'
-singlestep          QEMU_SINGLESTEP   run in singlestep mode
-strace              QEMU_STRACE       log system calls
-seed                QEMU_RAND_SEED    Seed for pseudo-random number generator
-trace               QEMU_TRACE        [[enable=]<pattern>][,events=<file>][,file=<file>]
-version             QEMU_VERSION      display version information and exit

It again shows lots of options to log. Since we want to look at the translate process, we will give four options: in_asm, op, op_opt, out_asm. We need in_asm and out_asm to look at the translation input and output. In addition, QEMU uses intermediate representation as I described before. Those intermediate representations, shortly IRs, are called "micro ops" here and can be logged out by using the op option. In addition, we will also take a look at the IRs after optimization by using the op_opt option, which shows the IR sequence right before it gets translated back to the output assembly.

user@DESKTOP-4E0H5D7:~/qemu_analysis$ ./qemu-x86_64 -d help
Log items (comma separated):
out_asm         show generated host assembly code for each compiled TB
in_asm          show target assembly code for each compiled TB
op              show micro ops for each compiled TB
op_opt          show micro ops after optimization
op_ind          show micro ops before indirect lowering
int             show interrupts/exceptions in short format
exec            show trace before each executed TB (lots of logs)
cpu             show CPU registers before entering a TB (lots of logs)
fpu             include FPU registers in the 'cpu' logging
mmu             log MMU-related activities
pcall           x86 only: show protected mode far calls/returns/exceptions
cpu_reset       show CPU state before CPU resets
unimp           log unimplemented functionality
guest_errors    log when the guest OS does something invalid (eg accessing a
non-existent register)
page            dump pages at beginning of user mode emulation
nochain         do not chain compiled TBs so that "exec" and "cpu" show
complete traces
trace:PATTERN   enable trace events

Use "-d trace:help" to get a list of trace events.

I will compile this helloworld example and execute.

#include <stdio.h>

int main(){
	printf("Hello, World!\n");
}

Now let's really execute it! Note that this emits lots of logs, so you may want to pipe the output somewhere in the shell.

user@DESKTOP-4E0H5D7:~/qemu_analysis$ ./qemu-x86_64 -d in_asm,cpu,exec -D hello.qemulog hello
Hello, World!

I used the -D option to redirect the log to the file "hello.qemulog". Let's look at the file. We don't need to look at all of it. We will just try to find the main function. To find the main function accurately, we first dump the emulated binary and take a look at how the main function is compiled.

user@DESKTOP-4E0H5D7:~/qemu_analysis$ objdump -d hello

hello:     file format elf64-x86-64


Disassembly of section .init:

0000000000001000 <_init>:
    1000:       f3 0f 1e fa             endbr64
    1004:       48 83 ec 08             sub    $0x8,%rsp
    1008:       48 8b 05 d9 2f 00 00    mov    0x2fd9(%rip),%rax        # 3fe8 <__gmon_start__>
    100f:       48 85 c0                test   %rax,%rax
    1012:       74 02                   je     1016 <_init+0x16>
    1014:       ff d0                   callq  *%rax
    
...
    
0000000000001149 <main>:
    1149:       f3 0f 1e fa             endbr64
    114d:       55                      push   %rbp
    114e:       48 89 e5                mov    %rsp,%rbp
    1151:       48 8d 3d ac 0e 00 00    lea    0xeac(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
    1158:       e8 f3 fe ff ff          callq  1050 <puts@plt>
    115d:       b8 00 00 00 00          mov    $0x0,%eax
    1162:       5d                      pop    %rbp
    1163:       c3                      retq
    1164:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
    116b:       00 00 00
    116e:       66 90                   xchg   %ax,%ax

...

Taking a look at the main function, we can see that the assembly sequence is straight forward. First of all, since the format string does not contain a format specifier, i.e., formatting doesn't happen, the function is substituted to the puts function. This is a really basic optimization these days. Honestly, I didn't even give any optimization options, but this optimization happens automatically. Secondly, we can see that the argument is loaded using the lea instruction. Since it is a constant string, it would just be a pointer to a part of the binary file, so relative addressing by the lea instruction is sufficient. Thirdly, the main function return code, 0, is loaded into the return value register using the mov instruction. Lastly, the rest of the code on the start and end seems to be the stack frame setup and cleanup code.

Now let's really look at what the log from QEMU. We will look at the movq instruction at the second line of the main function assembly below.

----------------
IN: main

...

0x400000114d:  55                       pushq    %rbp
0x400000114e:  48 89 e5                 movq     %rsp, %rbp
0x4000001151:  48 8d 3d ac 0e 00 00     leaq     0xeac(%rip), %rdi
0x4000001158:  e8 f3 fe ff ff           callq    0x4000001050

We can easily find what the input assembly was because the QEMU log kindly tells us the input assembly address. It says that instead of moving the value directly from rsp to rbp, the value goes through a temporary register tmp0.

OP:

...

---- 000000400000114e 0000000000000000
 mov_i64 tmp0,rsp
 mov_i64 rbp,tmp0

...

We can see that after optimization, it is optimized to just directly move the value from rsp to rbp.

OP after optimization and liveness analysis:

...

---- 000000400000114d 0000000000000000
 mov_i64 tmp0,rbp
 movi_i64 tmp13,$0x8
 sub_i64 tmp2,rsp,tmp13
 qemu_st_i64 tmp0,tmp2,leq,0
 mov_i64 rsp,tmp2

---- 000000400000114e 0000000000000000
 mov_i64 rbp,rsp                          sync: 0  dead: 0  pref=0xffff
 
...

Now, we need to think a bit to find what instruction is the output of the IR above. I can tell you that the value of rsp is stored in the output assembly register rbx. How do I know that? Well, to explain this, I added the IR from the former instruction above. It loads an immediate value 8 and then subtracts it from the rsp register. This subtraction corresponds to the subq instruction in the output assembly. So the rsp value is in the rbx register. Now, we can see that two mov_i64 instructions store the same value to rsp and rbp consecutively in the IR above. So it is in the output assembly below, represented as two consecutive movq instructions from the rbx register. So that part is the final output. You can also see that the instructions are interleaved. It seems that some additional optimizations took place on the process of outputting host assembly.

OUT: [size=117]

...

0x557ff4fb1b4b:  48 8b 5d 20              movq     0x20(%rbp), %rbx
0x557ff4fb1b4f:  48 83 eb 08              subq     $8, %rbx
0x557ff4fb1b53:  4c 8b 65 28              movq     0x28(%rbp), %r12
0x557ff4fb1b57:  4c 89 23                 movq     %r12, (%rbx)
0x557ff4fb1b5a:  48 89 5d 20              movq     %rbx, 0x20(%rbp)
0x557ff4fb1b5e:  48 89 5d 28              movq     %rbx, 0x28(%rbp)
0x557ff4fb1b62:  49 bc 04 20 00 00 40 00  movabsq  $0x4000002004, %r12
0x557ff4fb1b6a:  00 00

...

3. Finding the QEMU Emulation Main Loop

Now that we understood how the IR translation works briefly, we need to see where it actually is called in the QEMU implementation. To find the main function, I will use a little hack using objdump. The objdump tool has a '-l' option that shows the source code line information where the assembly originated from. Finding the main function, we can see that the main function is in the linux-user/main.c file. Now we can take a look at the source code.

user@DESKTOP-4E0H5D7:~/qemu_analysis/qemu/build$ objdump -xd -l qemu-x86_64

...

0000000000072ed0 <main>:
main():
/home/user/qemu_analysis/qemu/linux-user/main.c:618
   72ed0:       f3 0f 1e fa             endbr64
   72ed4:       41 57                   push   %r15
   72ed6:       41 56                   push   %r14
   72ed8:       41 55                   push   %r13
   72eda:       41 54                   push   %r12
   72edc:       49 89 f4                mov    %rsi,%r12
   72edf:       55                      push   %rbp
   72ee0:       53                      push   %rbx
   72ee1:       48 81 ec 48 06 00 00    sub    $0x648,%rsp
   72ee8:       89 7c 24 08             mov    %edi,0x8(%rsp)

...

Finally, we can look at C code, which is much more friendly than assembly. We can easily see that QEMU won't just exit until the program requests it, so it will have some main loop that it will keep executing, perhaps forever if the program doesn't exit forever. Keeping that in mind, the cpu_loop function call seems interesting.

int main(int argc, char **argv, char **envp)
{
    struct target_pt_regs regs1, *regs = &regs1;
    struct image_info info1, *info = &info1;
    struct linux_binprm bprm;
    TaskState *ts;
    CPUArchState *env;
    CPUState *cpu;
    int optind;
    char **target_environ, **wrk;
    char **target_argv;
    int target_argc;
    int i;
    int ret;
    int execfd;
    unsigned long max_reserved_va;

...

    cpu_loop(env);
    /* never exits */
    return 0;
}

We continue using the same kind of hack to find the source code.

00000000000eb510 <cpu_loop>:
cpu_loop():
/home/user/qemu_analysis/qemu/linux-user/x86_64/../i386/cpu_loop.c:85
   eb510:       f3 0f 1e fa             endbr64
   eb514:       41 55                   push   %r13
   eb516:       49 89 fd                mov    %rdi,%r13
   eb519:       41 54                   push   %r12
   eb51b:       55                      push   %rbp

We can see that this function lives at linux-user/x86_64/../i386/cpu_loop.c which is linux-user/i386/cpu_loop.c. Taking a look at the source code, there is loop in it. Now, we want to find where the TCG related actions happen, so we want to get in further. The cpu_exec function seems interesting in that TCG may happen on execution. Let's take a look at that function.

void cpu_loop(CPUX86State *env)
{
    CPUState *cs = env_cpu(env);
    int trapnr;
    abi_ulong pc;
    abi_ulong ret;
    target_siginfo_t info;

    for(;;) {
        cpu_exec_start(cs);
        trapnr = cpu_exec(cs);
        cpu_exec_end(cs);
        process_queued_cpu_work(cs);

        switch(trapnr) {

...

        }
        process_pending_signals(env);
    }
}

Looking at objdump, the cpu_exec function is located in accel/tcg/cpu-exec.c. Let's take a look at it.

00000000000b7960 <cpu_exec>:
cpu_exec():
/home/user/qemu_analysis/qemu/accel/tcg/cpu-exec.c:662
   b7960:       f3 0f 1e fa             endbr64
   b7964:       41 57                   push   %r15
   b7966:       41 56                   push   %r14
   b7968:       41 55                   push   %r13

It kindly says in the comment, "main execution loop". Jackpot! It is highly likely what we were looking for. We can see the struct type TranslationBlock, which sounds like a code compiling related type, like a basic block in the control flow graph. There are two interesting functions, tb_find and cpu_loop_exec_tb, which both contain the acronym "tb", which would be the acronym for "TranslationBlock". So it looks like we found the right place.

/* main execution loop */

int cpu_exec(CPUState *cpu)
{
    CPUClass *cc = CPU_GET_CLASS(cpu);
    int ret;
    SyncClocks sc = { 0 };

...

    /* if an exception is pending, we execute it here */
    while (!cpu_handle_exception(cpu, &ret)) {
        TranslationBlock *last_tb = NULL;
        int tb_exit = 0;

        while (!cpu_handle_interrupt(cpu, &last_tb)) {
            uint32_t cflags = cpu->cflags_next_tb;
            TranslationBlock *tb;

            /* When requested, use an exact setting for cflags for the next
               execution.  This is used for icount, precise smc, and stop-
               after-access watchpoints.  Since this request should never
               have CF_INVALID set, -1 is a convenient invalid value that
               does not require tcg headers for cpu_common_reset.  */
            if (cflags == -1) {
                cflags = curr_cflags();
            } else {
                cpu->cflags_next_tb = -1;
            }

            tb = tb_find(cpu, last_tb, tb_exit, cflags);
            cpu_loop_exec_tb(cpu, tb, &last_tb, &tb_exit);
            /* Try to align the host and virtual clocks
               if the guest is in advance */
            align_clocks(&sc, cpu);
        }
    }

    cc->cpu_exec_exit(cpu);
    rcu_read_unlock();

    return ret;
}

4. Constructing a Self-modifying Code

Now that we found the entrance to code translation and execution, we need to get on the main goal of this article. By this time, you may have forgot what the main goal of the article. The title says "How QEMU Handles Self-modifying Code". So how does it? To try to find that, I am not going to just do more static analysis, blindly searching for another jackpot. So what will we do now? We will do some dynamic analysis smartly.

To do a dynamic analysis, we will need to create one. Since really compiling something is overly complex for a simple payload, we will just create a simple assembly function sequence and just copy that.

We make a simple assembly function that does a syscall to print a string. It can be written as follows.

mov edx, 6
mov esi, 0x20000
mov edi, 1
mov eax, 1
syscall
ret

This can be equivalently writen in C code as follows. It specifically issues the write system call that writes 6 bytes on the address 0x20000 to stdout. Also, I didn't put position dependent instructions, so this function can be copied to any address without any issues.

long foo() {
	return write(fd_stdout, (void *)0x20000, 6);
}

To use this code properly, we will have to acquire memory accessible at location 0x20000. So we will mmap a page at address 0x20000 and copy "Hello!", which is 6 bytes.

To embed this code into C code, we will get the byte string which encodes this assembly code using an online assembly tool. It kindly gives the byte string in C array initializer format, so we will use it right away.

Now we embed it in C code. We will mmap a writable and executable page and copy the assembly code there, and then call the function there.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <assert.h>
#include <sys/mman.h>
#include <time.h>

/*
 * assuming string literal at 0x20000
 * return write(stdout, "hello\n", 6);
 */
uint8_t shellcode[] = { 
    0xBA, 0x06, 0x00, 0x00, 0x00, 
    0xBE, 0x00, 0x00, 0x02, 0x00, 
    0xBF, 0x01, 0x00, 0x00, 0x00, 
    0xB8, 0x01, 0x00, 0x00, 0x00, 
    0x0F, 0x05, 
    0xC3 
};

int main(){
        int ret;
        void *p = NULL;
        void (*shellcode_copied)(void);

        p = mmap((void *)0x20000UL, 0x1000, PROT_WRITE|PROT_READ|PROT_EXEC, MAP_ANONYMOUS|MAP_SHARED, -1, 0);
        if(p == MAP_FAILED){
                perror("mmap");
                return 0;
        }
        assert(p == (void *)0x20000UL);

        memcpy(p, "hello\n", 6);

        shellcode_copied = (void *)(p + 0x100);

        for(int k=0;k<0x10;++k){
                memcpy(shellcode_copied, shellcode, sizeof(shellcode));
                shellcode_copied();
        }

        ret = munmap(p, 0x1000);
        if(ret){
                perror("munmap");
                return 0;
        }
}

Now we execute it. Seems good. Let's use gdb to look at it thoroughly.

user@DESKTOP-4E0H5D7:~/qemu_analysis$ cc -o shellcode_exec shellcode_exec.c
user@DESKTOP-4E0H5D7:~/qemu_analysis$ ./shellcode_exec
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello

What? Why SIGSEGV? Well, in the first place I thought I stepped onto a rare bug of QEMU. The truth was that it wasn't a bug. The QEMU uses the SIGSEGV signal to efficiently handle self-modifying code.

user@DESKTOP-4E0H5D7:~/qemu_analysis$ gdb -q qemu-x86_64
Reading symbols from qemu-x86_64...
(gdb) r shellcode_exec
Starting program: /home/krlee/qemu_analysis/qemu-x86_64 shellcode_exec
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff7a93700 (LWP 237)]
hello

Thread 1 "qemu-x86_64" received signal SIGSEGV, Segmentation fault.
0x0000555555a2677e in static_code_gen_buffer ()
(gdb)

If you look at the signal manual, we can see that, surprisingly, the SIGSEGV signal is catchable. The manual says signals SIGKILL and SIGSTOP are not catchable. Then SIGSEGV can be? Yes, it is catchable. The QEMU uses this facility to efficiently detect changes of already translated code and invalidate it if so. In fact, without gdb, it executes without any notice.

user@DESKTOP-4E0H5D7:~/qemu_analysis$ man 7 signal
...
       SIGQUIT      P1990      Core    Quit from keyboard
       SIGSEGV      P1990      Core    Invalid memory reference
       SIGSTKFLT      -        Term    Stack fault on coprocessor (unused)
...
       SIGWINCH       -        Ign     Window resize signal (4.3BSD, Sun)

       The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored.
...

user@DESKTOP-4E0H5D7:~/qemu_analysis$ ./qemu-x86_64 shellcode_exec
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello

So now we can see that we have constructed a correctly running self-modifying code sample and it seems to be executed in QEMU without any faults.

5. Translation Block eviction on SIGSEGV handler in QEMU

This is the climax of this article. We are not looking at the implementation of how self-modifying code gets evicted. On SIGSEGV, the execution stops at static_code_gen_buffer. It looks like it stopped at it, but this is where SIGSEGV was issued. If we do a little stepi command once and then see the backtrace, it says that it is currently at the signal handler function named host_signal_handler.

Thread 1 "qemu-x86_64" received signal SIGSEGV, Segmentation fault.
0x0000555555a2677e in static_code_gen_buffer ()
(gdb) si
host_signal_handler (host_signum=11, info=0x7fffffffd170, puc=0x7fffffffd040)
    at /home/krlee/qemu_analysis/qemu/linux-user/signal.c:653
653     {
(gdb) bt
#0  host_signal_handler (host_signum=11, info=0x7fffffffd170, puc=0x7fffffffd040)
    at /home/krlee/qemu_analysis/qemu/linux-user/signal.c:653
#1  <signal handler called>
#2  0x0000555555a2677e in static_code_gen_buffer ()
#3  0x000055555560bce0 in cpu_tb_exec (itb=<optimized out>, cpu=0x555555a27000 <static_code_gen_buffer+607808>)
    at /home/krlee/qemu_analysis/qemu/accel/tcg/cpu-exec.c:172
#4  cpu_loop_exec_tb (tb_exit=<synthetic pointer>, last_tb=<synthetic pointer>, tb=<optimized out>,
    cpu=0x555555a27000 <static_code_gen_buffer+607808>) at /home/krlee/qemu_analysis/qemu/accel/tcg/cpu-exec.c:618
#5  cpu_exec (cpu=cpu@entry=0x5555579cba20) at /home/krlee/qemu_analysis/qemu/accel/tcg/cpu-exec.c:731
#6  0x000055555563f558 in cpu_loop (env=0x5555579d3cf0)
    at /home/krlee/qemu_analysis/qemu/linux-user/x86_64/../i386/cpu_loop.c:94
#7  0x00005555555c746d in main (argc=<optimized out>, argv=0x7fffffffe338, envp=<optimized out>)
    at /home/krlee/qemu_analysis/qemu/linux-user/main.c:865

On the same time, gdb kindly tells us where the source code is. If we look at the implementation, the comment again confesses that the CPU emulator uses some host signals to detect exceptions. So I guess we are on the right track. Let's get into cpu_signal_handler to continue.

static void host_signal_handler(int host_signum, siginfo_t *info,
                                void *puc)
{
    CPUArchState *env = thread_cpu->env_ptr;
    CPUState *cpu = env_cpu(env);
    TaskState *ts = cpu->opaque;

    int sig;
    target_siginfo_t tinfo;
    ucontext_t *uc = puc;
    struct emulated_sigtable *k;

    /* the CPU emulator uses some host signals to detect exceptions,
       we forward to it some signals */
    if ((host_signum == SIGSEGV || host_signum == SIGBUS)
        && info->si_code > 0) {
        if (cpu_signal_handler(host_signum, info, puc))
            return;
    }

...

    /* interrupt the virtual CPU as soon as possible */
    cpu_exit(thread_cpu);
}

Again using gdb, we locate the source code.

(gdb) n
654         CPUArchState *env = thread_cpu->env_ptr;
(gdb)
656         TaskState *ts = cpu->opaque;
(gdb)
665         if ((host_signum == SIGSEGV || host_signum == SIGBUS)
(gdb)
667             if (cpu_signal_handler(host_signum, info, puc))
(gdb) s
cpu_x86_signal_handler (host_signum=11, pinfo=pinfo@entry=0x7fffffffd170, puc=0x7fffffffd040)
    at /home/krlee/qemu_analysis/qemu/accel/tcg/user-exec.c:306
306     {

The main part seems to be in the function handle_cpu_signal. Let's keep moving.

int cpu_signal_handler(int host_signum, void *pinfo,
                       void *puc)
{
    siginfo_t *info = pinfo;
    unsigned long pc;
#if defined(__NetBSD__) || defined(__FreeBSD__) || defined(__DragonFly__)
    ucontext_t *uc = puc;
#elif defined(__OpenBSD__)
    struct sigcontext *uc = puc;
#else
    ucontext_t *uc = puc;
#endif

    pc = PC_sig(uc);
    return handle_cpu_signal(pc, info,
                             TRAP_sig(uc) == 0xe ? (ERROR_sig(uc) >> 1) & 1 : 0,
                             &MASK_sig(uc));
}

Using dynamic analysis, we can see that the function page_unprotect is triggered.

handle_cpu_signal (old_set=0x7fffffffd168, is_write=1, info=0x7fffffffd170, pc=93824997287806) at /home/krlee/qemu_analysis/qemu/accel/tcg/user-exec.c:64
64          CPUState *cpu = current_cpu;
(gdb) n
69          switch (helper_retaddr) {
(gdb)
97              pc += GETPC_ADJ;
(gdb)
98              break;
(gdb)
126         if (!cpu || !cpu->running) {
(gdb)
147         if (is_write && info->si_signo == SIGSEGV && info->si_code == SEGV_ACCERR &&
(gdb)
149             switch (page_unprotect(h2g(address), pc)) {

The comment on function page_unprotect says that it invalidates code if the page is unprotected. It seems to find the page information and invalidates all the translation blocks, and then finally turn on the real page write permission.

/* called from signal handler: invalidate the code and unprotect the
 * page. Return 0 if the fault was not handled, 1 if it was handled,
 * and 2 if it was handled but the caller must cause the TB to be
 * immediately exited. (We can only return 2 if the 'pc' argument is
 * non-zero.)
 */
int page_unprotect(target_ulong address, uintptr_t pc)
{

...

    p = page_find(address >> TARGET_PAGE_BITS);
    if (!p) {
        mmap_unlock();
        return 0;
    }

...

            for (addr = host_start; addr < host_end; addr += TARGET_PAGE_SIZE) {
                p = page_find(addr >> TARGET_PAGE_BITS);
                p->flags |= PAGE_WRITE;
                prot |= p->flags;

                /* and since the content will be modified, we must invalidate
                   the corresponding translated code. */
                current_tb_invalidated |= tb_invalidate_phys_page(addr, pc);
#ifdef CONFIG_USER_ONLY
                if (DEBUG_TB_CHECK_GATE) {
                    tb_invalidate_check(addr);
                }
#endif
            }
            mprotect((void *)g2h(host_start), qemu_host_page_size,
                     prot & PAGE_BITS);

...

    return 0;
}

At this point, it seems sufficient to stop, but let's just go a few steps further. Let's look at function tb_invalidate_phys_page. It seems to have a for loop that iterates over all the translation blocks and invalidates it. This may be a source of inefficiency if lots of jumping happens on self-modifying code in the same page, though it might not be the common case. Let's step further and see the function tb_phys_invalidate.

static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
{
    TranslationBlock *tb;
    PageDesc *p;
    int n;

...

    PAGE_FOR_EACH_TB(p, tb, n) {
#ifdef TARGET_HAS_PRECISE_SMC
        if (current_tb == tb &&
            (tb_cflags(current_tb) & CF_COUNT_MASK) != 1) {
                /* If we are modifying the current TB, we must stop
                   its execution. We could be more precise by checking
                   that the modification is after the current PC, but it
                   would require a specialized function to partially
                   restore the CPU state */

            current_tb_modified = 1;
            cpu_restore_state_from_tb(cpu, current_tb, pc, true);
            cpu_get_tb_cpu_state(env, &current_pc, &current_cs_base,
                                 &current_flags);
        }
#endif /* TARGET_HAS_PRECISE_SMC */
        tb_phys_invalidate(tb, addr);
    }

...

    return false;
}

The comment says it invalidates one TB. The main implementation seems to be in function do_tb_phys_invalidate. Let's get into it.

/* invalidate one TB
 *
 * Called with mmap_lock held in user-mode.
 */
void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
{
    if (page_addr == -1 && tb->page_addr[0] != -1) {
        page_lock_tb(tb);
        do_tb_phys_invalidate(tb, true);
        page_unlock_tb(tb);
    } else {
        do_tb_phys_invalidate(tb, false);
    }
}

I copied the whole function because it contains a lot to observe. It removes the translation block from several lists that it may possibly be in. I think this is a good place to stop, since we found the location where the page permission is changed and the cache is invalidated.

static void do_tb_phys_invalidate(TranslationBlock *tb, bool rm_from_page_list)
{
    CPUState *cpu;
    PageDesc *p;
    uint32_t h;
    tb_page_addr_t phys_pc;

    assert_memory_lock();

    /* make sure no further incoming jumps will be chained to this TB */
    qemu_spin_lock(&tb->jmp_lock);
    atomic_set(&tb->cflags, tb->cflags | CF_INVALID);
    qemu_spin_unlock(&tb->jmp_lock);

    /* remove the TB from the hash list */
    phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb_cflags(tb) & CF_HASH_MASK,
                     tb->trace_vcpu_dstate);
    if (!(tb->cflags & CF_NOCACHE) &&
        !qht_remove(&tb_ctx.htable, tb, h)) {
        return;
    }

    /* remove the TB from the page list */
    if (rm_from_page_list) {
        p = page_find(tb->page_addr[0] >> TARGET_PAGE_BITS);
        tb_page_remove(p, tb);
        invalidate_page_bitmap(p);
        if (tb->page_addr[1] != -1) {
            p = page_find(tb->page_addr[1] >> TARGET_PAGE_BITS);
            tb_page_remove(p, tb);
            invalidate_page_bitmap(p);
        }
    }

    /* remove the TB from the hash list */
    h = tb_jmp_cache_hash_func(tb->pc);
    CPU_FOREACH(cpu) {
        if (atomic_read(&cpu->tb_jmp_cache[h]) == tb) {
            atomic_set(&cpu->tb_jmp_cache[h], NULL);
        }
    }

    /* suppress this TB from the two jump lists */
    tb_remove_from_jmp_list(tb, 0);
    tb_remove_from_jmp_list(tb, 1);

    /* suppress any remaining jumps to this TB */
    tb_jmp_unlink(tb);

    atomic_set(&tcg_ctx->tb_phys_invalidate_count,
               tcg_ctx->tb_phys_invalidate_count + 1);
}

7. Then Where Does Page Permission Get Discarded?

So if the page is marked not writable, then it should have been discarded because the guest process writes code into that page before any translation blocks are made, i.e., is executed. Where would that be?

Again, we use dynamic analysis. Now we know that the page permission is added by mprotect call, it is highly likely that the same permission is discarded by the same mprotect call. So we put a breakpoint on the mprotect call and see where it proceeds in gdb.

If you keep continuing several times, you can see that the backtrace relays between two different traces. One is called within the signal handlers, but the other one is called at function tb_find which is called at function cpu_exec, which is the main loop that we found. So this seems to be what we are looking for.

(gdb) b mprotect
Breakpoint 4 at 0x7ffff7c249d0: file ../sysdeps/unix/syscall-template.S, line 78.
(gdb) c
Continuing.

Thread 1 "qemu-x86_64" hit Breakpoint 4, mprotect () at ../sysdeps/unix/syscall-template.S:78
78      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  mprotect () at ../sysdeps/unix/syscall-template.S:78
#1  0x000055555560e583 in page_unprotect (address=<optimized out>, pc=pc@entry=93824997287808) at /home/krlee/qemu_analysis/qemu/accel/tcg/translate-all.c:2658
#2  0x000055555560ed7d in handle_cpu_signal (old_set=0x7fffffffd168, is_write=<optimized out>, info=0x7fffffffd170, pc=93824997287808) at /home/krlee/qemu_analysis/qemu/accel/tcg/user-exec.c:149
#3  cpu_x86_signal_handler (host_signum=<optimized out>, pinfo=pinfo@entry=0x7fffffffd170, puc=0x7fffffffd040) at /home/krlee/qemu_analysis/qemu/accel/tcg/user-exec.c:318
#4  0x0000555555638c6a in host_signal_handler (host_signum=11, info=0x7fffffffd170, puc=0x7fffffffd040) at /home/krlee/qemu_analysis/qemu/linux-user/signal.c:667
#5  <signal handler called>
#6  0x0000555555a2677e in static_code_gen_buffer ()
#7  0x000055555560bce0 in cpu_tb_exec (itb=<optimized out>, cpu=0x555555a26dc0 <static_code_gen_buffer+607232>) at /home/krlee/qemu_analysis/qemu/accel/tcg/cpu-exec.c:172
#8  cpu_loop_exec_tb (tb_exit=<synthetic pointer>, last_tb=<synthetic pointer>, tb=<optimized out>, cpu=0x555555a26dc0 <static_code_gen_buffer+607232>) at /home/krlee/qemu_analysis/qemu/accel/tcg/cpu-exec.c:618
#9  cpu_exec (cpu=cpu@entry=0x5555579cba20) at /home/krlee/qemu_analysis/qemu/accel/tcg/cpu-exec.c:731
#10 0x000055555563f558 in cpu_loop (env=0x5555579d3cf0) at /home/krlee/qemu_analysis/qemu/linux-user/x86_64/../i386/cpu_loop.c:94
#11 0x00005555555c746d in main (argc=<optimized out>, argv=0x7fffffffe338, envp=<optimized out>) at /home/krlee/qemu_analysis/qemu/linux-user/main.c:865
(gdb) c
Continuing.

Thread 1 "qemu-x86_64" hit Breakpoint 4, mprotect () at ../sysdeps/unix/syscall-template.S:78
78      in ../sysdeps/unix/syscall-template.S
(gdb) bt
#0  mprotect () at ../sysdeps/unix/syscall-template.S:78
#1  0x000055555560daaf in tb_page_add (tb=0x555555a26ec0 <static_code_gen_buffer+607488>, n=0, p=<optimized out>, p=<optimized out>, page_addr=131072) at /home/krlee/qemu_analysis/qemu/accel/tcg/translate-all.c:1567
#2  tb_page_add (p=0x555557a1e410, p=0x555557a1e410, page_addr=131072, n=0, tb=0x555555a26ec0 <static_code_gen_buffer+607488>) at /home/krlee/qemu_analysis/qemu/accel/tcg/translate-all.c:1530
#3  tb_link_page (phys_page2=<optimized out>, phys_pc=131328, tb=0x555555a26ec0 <static_code_gen_buffer+607488>) at /home/krlee/qemu_analysis/qemu/accel/tcg/translate-all.c:1622
#4  tb_gen_code (cpu=cpu@entry=0x5555579cba20, pc=pc@entry=131328, cs_base=cs_base@entry=0, flags=flags@entry=4243635, cflags=<optimized out>, cflags@entry=0) at /home/krlee/qemu_analysis/qemu/accel/tcg/translate-all.c:1866
#5  0x000055555560befc in tb_find (cf_mask=0, tb_exit=0, last_tb=0x0, cpu=0x0) at /home/krlee/qemu_analysis/qemu/accel/tcg/cpu-exec.c:406
#6  cpu_exec (cpu=cpu@entry=0x5555579cba20) at /home/krlee/qemu_analysis/qemu/accel/tcg/cpu-exec.c:730
#7  0x000055555563f558 in cpu_loop (env=0x5555579d3cf0) at /home/krlee/qemu_analysis/qemu/linux-user/x86_64/../i386/cpu_loop.c:94
#8  0x00005555555c746d in main (argc=<optimized out>, argv=0x7fffffffe338, envp=<optimized out>) at /home/krlee/qemu_analysis/qemu/linux-user/main.c:865

Since the backtrace tells us all the source code locations, let's just follow it. This is function tb_gen_code, and it calls tb_link_page. This is a long function, and it deserves some observing. It does what we think was doing. It allocates a buffer to write a translation block, generates intermediate code, and then finally generates assembly code. Then it saves it in the cache by tb_link_page. Let's see more in function tb_link_page.

/* Called with mmap_lock held for user mode emulation.  */
TranslationBlock *tb_gen_code(CPUState *cpu,
                              target_ulong pc, target_ulong cs_base,
                              uint32_t flags, int cflags)
{
    CPUArchState *env = cpu->env_ptr;
    TranslationBlock *tb, *existing_tb;
    tb_page_addr_t phys_pc, phys_page2;
    target_ulong virt_page2;
    tcg_insn_unit *gen_code_buf;
    int gen_code_size, search_size, max_insns;
#ifdef CONFIG_PROFILER
    TCGProfile *prof = &tcg_ctx->prof;
    int64_t ti;
#endif

...

    tb = tcg_tb_alloc(tcg_ctx);

...

    tcg_func_start(tcg_ctx);

    tcg_ctx->cpu = env_cpu(env);
    gen_intermediate_code(cpu, tb, max_insns);
    tcg_ctx->cpu = NULL;

...

    gen_code_size = tcg_gen_code(tcg_ctx, tb);
    if (unlikely(gen_code_size < 0)) {
        switch (gen_code_size) {

...

        default:
            g_assert_not_reached();
        }
    }

...

    existing_tb = tb_link_page(tb, phys_pc, phys_page2);

...

    tcg_tb_insert(tb);
    return tb;
}

This is function tb_link_page, calling tb_page_add. The comment says that it adds the tb to the page.

/* add a new TB and link it to the physical page tables. phys_page2 is
 * (-1) to indicate that only one page contains the TB.
 *
 * Called with mmap_lock held for user-mode emulation.
 *
 * Returns a pointer @tb, or a pointer to an existing TB that matches @tb.
 * Note that in !user-mode, another thread might have already added a TB
 * for the same block of guest code that @tb corresponds to. In that case,
 * the caller should discard the original @tb, and use instead the returned TB.
 */
static TranslationBlock *
tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
             tb_page_addr_t phys_page2)
{

...

    /*
     * Add the TB to the page list, acquiring first the pages's locks.
     * We keep the locks held until after inserting the TB in the hash table,
     * so that if the insertion fails we know for sure that the TBs are still
     * in the page descriptors.
     * Note that inserting into the hash table first isn't an option, since
     * we can only insert TBs that are fully initialized.
     */
    page_lock_pair(&p, phys_pc, &p2, phys_page2, 1);
    tb_page_add(p, tb, 0, phys_pc & TARGET_PAGE_MASK);
    if (p2) {
        tb_page_add(p2, tb, 1, phys_page2);
    } else {
        tb->page_addr[1] = -1;
    }

...

    return tb;
}

This is function tb_page_add. We can see that it calls mprotect, and clearly can see that it masks out the write permission bit. So we have successfully found where the write permission is discarded, and it is after some translation block is created on the page.

/* add the tb in the target page and protect it if necessary
 *
 * Called with mmap_lock held for user-mode emulation.
 * Called with @p->lock held in !user-mode.
 */
static inline void tb_page_add(PageDesc *p, TranslationBlock *tb,
                               unsigned int n, tb_page_addr_t page_addr)
{

...

#if defined(CONFIG_USER_ONLY)
    if (p->flags & PAGE_WRITE) {
        target_ulong addr;
        PageDesc *p2;
        int prot;

        /* force the host page as non writable (writes will have a
           page fault + mprotect overhead) */
        page_addr &= qemu_host_page_mask;
        prot = 0;
        for (addr = page_addr; addr < page_addr + qemu_host_page_size;
            addr += TARGET_PAGE_SIZE) {

            p2 = page_find(addr >> TARGET_PAGE_BITS);
            if (!p2) {
                continue;
            }
            prot |= p2->flags;
            p2->flags &= ~PAGE_WRITE;
          }
        mprotect(g2h(page_addr), qemu_host_page_size,
                 (prot & PAGE_BITS) & ~PAGE_WRITE);
        if (DEBUG_TB_INVALIDATE_GATE) {
            printf("protecting code page: 0x" TB_PAGE_ADDR_FMT "\n", page_addr);
        }
    }
#else

...

#endif
}

7. Overall Execution Flow

So in summary, when normal code is emulated, as the guest binary is executed, instructions are translated if not cached in advance, which means not executed before, and then the translated instructions are executed. In case of self-modifying code, it needs a mechanism to ensure that the translated block matches with the current instruction on the emulated process memory. QEMU uses SIGSEGV to detect that the instruction written in memory might be changed. When a page includes an instruction that is translated and cached, the page is marked as not writable in host process. However, if the guest tries to overwrite the instructions, SIGSEGV is triggered on the host process and the QEMU process knows that the page is overwritten and that the translated information now might be garbage. At this point, QEMU just invalidates all the translation blocks on that page and marks that page writable, and then just proceeds. This is how self-modifying code is handled specially.

https://en.wikipedia.org/wiki/QEMU#Tiny_Code_Generator [본문으로]

'Research' 카테고리의 다른 글

서버시간 확인 방법과 그 정확도에 관한 고찰 (0)	2020.08.16

서버시간 확인 방법과 그 정확도에 관한 고찰 2020.08.16

computer noraboja

computer noraboja

태그

최근글

댓글

공지사항

아카이브

'Research' 카테고리의 다른 글

관련글

티스토리툴바