All parts: Part 1 | Part 2 | Part 3 | Part 4 | Part 5
The bootstrapping process is the key to understanding how the Go runtime works. Learning it is essential, if you want to move forward with Go. So the fifth installment in our Golang Internals series is dedicated to the Go runtime and, specifically, the Go bootstrap process. This time you will learn about:
- Go bootstrapping
- resizable stacks implementation
- internal TLS implementation
Note that this post contains a lot of assembler code and you will need at least some basic knowledge of it to proceed (here is a quick guide to Go’s assembler). So let’s get going!
Finding an entry point
First, we need to find what function is executed immediately after we start a Go program. To do this, we will write a simple Go app:
package main func main() { print(123) }
Then we need to compile and link it:
go tool 6g test.go go tool 6l test.6
This will create an executable file called 6.out in your current directory. The next step involves the objdump tool, which is specific to Linux. Windows and Mac users can find analogs or skip this step altogether. Now run the following command:
objdump -f 6.out
You should get output that will contain the start address:
6.out: file format elf64-x86-64 architecture: i386:x86-64, flags 0x00000112: EXEC_P, HAS_SYMS, D_PAGED start address 0x000000000042f160
Next, we need to disassemble our executable and find what function is located at this address:
objdump -d 6.out > disassemble.txt
Then we need to open the disassemble.txt file and search for “42f160.” Here is what I got:
000000000042f160 <_rt0_amd64_linux>: 42f160: 48 8d 74 24 08 lea 0x8(%rsp),%rsi 42f165: 48 8b 3c 24 mov (%rsp),%rdi 42f169: 48 8d 05 10 00 00 00 lea 0x10(%rip),%rax # 42f18042f170: ff e0 jmpq *%rax
Nice, we have found it! The entry point for my OS and architecture is a function called _rt0_amd64_linux.
The starting sequence
Now we need to find this function in Go runtime sources. It is located in the rt0_linux_arm64.s file. If you look inside the Go runtime package, you can find many filenames with postfixes related to OS and architecture names. When a runtime package is built, only the files that correspond to the current OS and architecture are selected. The rest are skipped. Let’s take a closer look at rt0_linux_arm64.s:
TEXT _rt0_amd64_linux(SB),NOSPLIT,$-8 LEAQ 8(SP), SI // argv MOVQ 0(SP), DI // argc MOVQ $main(SB), AX JMP AX TEXT main(SB),NOSPLIT,$-8 MOVQ $runtime·rt0_go(SB), AX JMP AX
The _rt0_amd64_linux function is very simple. It calls the main function and saves arguments (argc and argv) in registers (DI and SI). The arguments are located in the stack and can be accessed via the SP (stack pointer) register. The main function is also very simple. It calls runtime.rt0_go. The runtime.rt0_go function is longer and more complicated, so I will break it into small parts and describe each one separately.
The first section goes like this:
MOVQ DI, AX // argc MOVQ SI, BX // argv SUBQ $(4*8+7), SP // 2args 2auto ANDQ $~15, SP MOVQ AX, 16(SP) MOVQ BX, 24(SP)
Here, we put some previously saved command line argument values inside the AX and BX decrease stack pointers. We also add space for two more four-byte variables and adjust it to be 16-bit aligned. Finally, we move the arguments back to the stack.
// create istack out of the given (operating system) stack. // _cgo_init may update stackguard. MOVQ $runtime·g0(SB), DI LEAQ (-64*1024+104)(SP), BX MOVQ BX, g_stackguard0(DI) MOVQ BX, g_stackguard1(DI) MOVQ BX, (g_stack+stack_lo)(DI) MOVQ SP, (g_stack+stack_hi)(DI)
The second part is a bit more tricky. First, we load the address of the global runtime.g0 variable into the DI register. This variable is defined in the proc1.go file and belongs to the runtime,g type. Variables of this type are created for each goroutine in the system. As you can guess, runtime.g0 describes a root goroutine. Then we initialize the fields that describe the stack of the root goroutine. The meaning of stack.lo and stack.hi should be clear. These are pointers to the beginning and the end of the stack for the current goroutine, but what are the stackguard0 and stackguard1 fields? To understand this, we need to set aside the investigation of the runtime.rt0_go function and take a closer look at stack growth in Go.
Resizable stack implementation in Go
The Go language uses resizable stacks. Each goroutine starts with a small stack and its size changes each time a certain threshold is reached. Obviously, there is a way to check whether we have reached this threshold or not. In fact, the check is performed at the beginning of each function. To see how it works, let’s compile our sample program one more time with the -S flag (this will show the generated assembler code). The beginning of the main function looks like this:
"".main t=1 size=48 value=0 args=0x0 locals=0x8 0x0000 00000 (test.go:3) TEXT "".main+0(SB),$8-0 0x0000 00000 (test.go:3) MOVQ (TLS),CX 0x0009 00009 (test.go:3) CMPQ SP,16(CX) 0x000d 00013 (test.go:3) JHI ,22 0x000f 00015 (test.go:3) CALL ,runtime.morestack_noctxt(SB) 0x0014 00020 (test.go:3) JMP ,0 0x0016 00022 (test.go:3) SUBQ $8,SP
First, we load a value from thread local storage (TLS) to the CX register (I have already explained what TLS is in one of my previous posts). This value always contains a pointer to the runtime.g structure that corresponds to the current goroutine. Then we compare the stack pointer to the value located at an offset of 16 bytes in the runtime.g structure. We can easily calculate that this corresponds to the stackguard0 field.
So, this is how we check if we have reached the stack threshold. If we haven’t reached it yet, the check fails. In this case, we call the runtime.morestack_noctxt function repeatedly until enough memory has been allocated for the stack. The stackguard1 field works very similarly to stackguard0, but it is used inside the C stack growth prologue instead of Go. The inner workings of runtime.morestack_noctxt is also a very interesting topic, but we will discuss it later. For now, let’s return to the bootstrap process.
Continuing the investigation of Go bootstrapping
We will proceed with the starting sequence by looking at the next portion of code inside the runtime.rt0_go function:
// find out information about the processor we're on MOVQ $0, AX CPUID CMPQ AX, $0 JE nocpuinfo // Figure out how to serialize RDTSC. // On Intel processors LFENCE is enough. AMD requires MFENCE. // Don't know about the rest, so let's do MFENCE. CMPL BX, $0x756E6547 // "Genu" JNE notintel CMPL DX, $0x49656E69 // "ineI" JNE notintel CMPL CX, $0x6C65746E // "ntel" JNE notintel MOVB $1, runtime·lfenceBeforeRdtsc(SB) notintel: MOVQ $1, AX CPUID MOVL CX, runtime·cpuid_ecx(SB) MOVL DX, runtime·cpuid_edx(SB) nocpuinfo:
This part is not crucial for understanding major Go concepts, so we will look through it briefly. Here, we are trying to figure out what processor we are using. If it is Intel, we set the runtime·lfenceBeforeRdtsc variable. The runtime·cputicks method is the only place where this variable is used. This method utilizes a different assembler instruction to get cpu ticks depending on the value of runtime·lfenceBeforeRdtsc. Finally, we call the CPUID assembler instruction, execute it, and save the result in the runtime·cpuid_ecx and runtime·cpuid_edx variables. These are used in the alg.go file to select a proper hashing algorithm that is natively supported by your computer’s architecture.
Ok, let’s move on and examine another portion of code:
// if there is an _cgo_init, call it. MOVQ _cgo_init(SB), AX TESTQ AX, AX JZ needtls // g0 already in DI MOVQ DI, CX // Win64 uses CX for first parameter MOVQ $setg_gcc<>(SB), SI CALL AX // update stackguard after _cgo_init MOVQ $runtime·g0(SB), CX MOVQ (g_stack+stack_lo)(CX), AX ADDQ $const__StackGuard, AX MOVQ AX, g_stackguard0(CX) MOVQ AX, g_stackguard1(CX) CMPL runtime·iswindows(SB), $0 JEQ ok
This fragment is only executed when cgo is enabled. cgo is a topic for a separate discussion and we might talk about it in one of the upcoming posts. At this point, we only want to understand the basic bootstrap workflow, so we will skip it.
The next code fragment is responsible for setting up TLS:
needtls: // skip TLS setup on Plan 9 CMPL runtime·isplan9(SB), $1 JEQ ok // skip TLS setup on Solaris CMPL runtime·issolaris(SB), $1 JEQ ok LEAQ runtime·tls0(SB), DI CALL runtime·settls(SB) // store through it, to make sure it works get_tls(BX) MOVQ $0x123, g(BX) MOVQ runtime·tls0(SB), AX CMPQ AX, $0x123 JEQ 2(PC) MOVL AX, 0 // abort
I have already mentioned TLS before. Now it is time to understand how it is implemented.
Internal TLS implementation
If you look at the previous code fragment carefully, you can easily understand that the only lines that do actual work are:
LEAQ runtime·tls0(SB), DI CALL runtime·settls(SB)
All the other stuff is used to skip TLS setup when it is not supported on your OS and check that TLS works correctly. The two lines above store the address of the runtime·tls0 variable in the DI register and call the runtime·settls function. The code of this function is shown below:
// set tls base to DI TEXT runtime·settls(SB),NOSPLIT,$32 ADDQ $8, DI // ELF wants to use -8(FS) MOVQ DI, SI MOVQ $0x1002, DI // ARCH_SET_FS MOVQ $158, AX // arch_prctl SYSCALL CMPQ AX, $0xfffffffffffff001 JLS 2(PC) MOVL $0xf1, 0xf1 // crash RET
From the comments, we can understand that this function makes an arch_prctl system call and passes ARCH_SET_FS as an argument. We can also see that this system call sets a base for the FS segment register. In our case, we set TLS to point to the runtime·tls0 variable.
Do you remember the instruction that we saw at the beginning of the assembler code for the main function?
0x0000 00000 (test.go:3) MOVQ (TLS),CX
I have previously explained that it loads the address of the runtime.g structure instance into the CX register. This structure describes the current goroutine and is stored in thread local storage. Now we can find out and understand how this instruction is translated into machine assembler. If you open the previously created disassembly.txt file and look for the main.main function, the first instruction inside it should look like this:
400c00: 64 48 8b 0c 25 f0 ff mov %fs:0xfffffffffffffff0,%rcx
The colon in this instruction (%fs:0xfffffffffffffff0) stands for segmentation addressing (you can read more on it here).
Returning to the starting sequence
Finally, let’s look at the last two parts of the runtime.rt0_go function:
ok: // set the per-goroutine and per-mach "registers" get_tls(BX) LEAQ runtime·g0(SB), CX MOVQ CX, g(BX) LEAQ runtime·m0(SB), AX // save m->g0 = g0 MOVQ CX, m_g0(AX) // save m0 to g0->m MOVQ AX, g_m(CX)
Here, we load the TLS address into the BX register and save the address of the runtime·g0 variable in TLS. We also initialize the runtime.m0 variable. If runtime.g0 stands for root goroutine, then runtime.m0 corresponds to the root operating system thread used to run this goroutine. We may take a closer look at runtime.g0 and runtime.m0 structures in upcoming blog posts.
The final part of the starting sequence initializes arguments and calls different functions, but this is a topic for a separate discussion.
More on Golang
So, we have learned the inner mechanisms of the bootstrap process and found out how stacks are implemented. To move forward, we need to analyze the last part of the starting sequence. That will be the subject of my next post. If you want to get notified as soon as it comes out, hit the subscribe button below or follow @altoros.
Read all parts of the series: Part 1 | Part 2 | Part 3 | Part 4 | Part 5
About the author: Sergey Matyukevich is a Cloud Engineer and Go Developer at Altoros. With 6+ years in software engineering, he is an expert in cloud automation and designing architectures for complex cloud-based systems. An active member of the Go community, Sergey is a frequent contributor to open-source projects, such as Ubuntu and Juju Charms.
有疑问加站长微信联系(非本文作者)