通过内存分配来学习 go 中的机制

胡大海 · · 503 次点击 · · 开始浏览

这是一个创建于的文章，其中的信息可能已经有所发展或是发生改变。

第一次，站长亲自招 Gopher 了>>>

前言

在前一篇博客中，我介绍了逃逸分析的基础场景。但是还有一些其他场景，我并没有做介绍。为了介绍其他场景，我专门写了了一个程序用于 debug，这个程序中分配内存的方式比较让人吃惊。

程序

为了更多的学习io包，我尝试了一个快速的项目。找到字节流中的字符串 elvis，并且替换为首字母大写的字符串 Elvis。

代码中列出了两个用于解决这个这个问题的函数。这个博客主要集中于函数algOne，因为这个函数用到了io包。

下面的数据中，一个是输入，一个是希望通过函数algOne作用之后的输出。

Listing 1

Input:
abcelvisaElvisabcelviseelvisaelvisaabeeeelvise l v i saa bb e l v i saa elvi
selvielviselvielvielviselvi1elvielviselvis

Output:
abcElvisaElvisabcElviseElvisaElvisaabeeeElvise l v i saa bb e l v i saa elvi
selviElviselvielviElviselvi1elviElvisElvis
复制代码

下面是函数algOne

Listing 2

 80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := io.ReadFull(input, buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := io.ReadFull(input, buf[:end]); err != nil {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }
复制代码

我想知道这个函数的表现以及函数给堆上的压力。为了了解这些，我们需要运行下 benchmark。

Benchmarking

下面是用来运行函数algOne来处流数据的 benchmark 函数

Listing 3

15 func BenchmarkAlgorithmOne(b *testing.B) {
16     var output bytes.Buffer
17     in := assembleInputStream()
18     find := []byte("elvis")
19     repl := []byte("Elvis")
20
21     b.ResetTimer()
22
23     for i := 0; i < b.N; i++ {
24         output.Reset()
25         algOne(in, find, repl, &output)
26     }
27 }
复制代码

有了这个函数，我们就可以运行go test了，并且可以使用选项-bench，-benchtime和-benchmem选项。

Listing 4

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8    	2000000 	     2522 ns/op       117 B/op  	      2 allocs/op
复制代码

在运行 benchmark 之后，我们可以看到函数algOne函数的每次操作都分配了两次内存，并且分配的内存大小为 117 字节。这个表现非常好了，但是我们需要知道是哪些代码造成了这些内存的分配。为了知道这些，我们需要产生运行 benchmark 的 profiling data。

Profiling

为了产生 profile data，我们需要运行 benchmark，不过这次需要使用选项 -memprofile选项。

Listing 5

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8    	2000000 	     2570 ns/op       117 B/op  	      2 allocs/op
复制代码

在程序运行完之后，就会产生两个新的文件。

Listing 6

~/code/go/src/.../memcpu
$ ls -l
total 9248
-rw-r--r--  1 bill  staff      209 May 22 18:11 mem.out       (NEW)
-rwxr-xr-x  1 bill  staff  2847600 May 22 18:10 memcpu.test   (NEW)
-rw-r--r--  1 bill  staff     4761 May 22 18:01 stream.go
-rw-r--r--  1 bill  staff      880 May 22 14:49 stream_test.go
复制代码

源码所在的文件夹为memcpu，函数algOne就存在于文件stream.go中，函数BenchmarkAlgorithmOne存在于stream_test.go。两个产生的文件分别是mem.out和memcpu.test。文件mem.out包含了 profiles data。文件memcpu.test是一个二进制文件，当我们需要看 profile data 的时候需要使用到这个文件。

有了 profile data 和二进制文件，我们就可以运行pprof工具来学习 profile data。

Listing 7

$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) _
复制代码

当需要 profiling memory 并且寻找容易解决的问题的时候，我们需要使用选项-alloc_space而不是默认的选项-inuse_space。这个选项会展示每次分配内存的情况，而不管你 take the profile 的时候，分配的内存是否还在使用。

通过pprof的作用，我们可以使用list命令来检查函数algOne的情况。list命令接受一个正则表达式，用于匹配表达式匹配的函数。

Listing 8

(pprof) list algOne
Total: 335.03MB
ROUTINE ======================== .../memcpu.algOne in code/go/src/.../memcpu/stream.go
 335.03MB   335.03MB (flat, cum)   100% of Total
        .          .     78:
        .          .     79:// algOne is one way to solve the problem.
        .          .     80:func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
        .          .     81:
        .          .     82: // Use a bytes Buffer to provide a stream to process.
 318.53MB   318.53MB     83: input := bytes.NewBuffer(data)
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
  16.50MB    16.50MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := io.ReadFull(input, buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])
(pprof) _
复制代码

基于这个 profile，我们可以知道input以及切片buf的底层数组被分配到了堆。由于input是指针，所以这个 profile 是说明，input所指向的bytes.Buffer是分配的到堆的。所以我们先聚焦于变量input的变量的分配，并且理解是如何分配的。

由于函数bytes.NewBuffer创建的变量，和函数algOne共享，所以导致变量分配到堆。并且flat列(pprof 输出的第一列)出现的值告诉我们这个值是分配到堆的，因为函数algOne共享变量的原因导致的变量分配逃逸到堆。

flat列表示的是函数的堆的分配，可以看看list命令展示函数Benchmark是如何调用函数algOne的。

Listing 9

(pprof) list Benchmark
Total: 335.03MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
        0   335.03MB (flat, cum)   100% of Total
        .          .     18: find := []byte("elvis")
        .          .     19: repl := []byte("Elvis")
        .          .     20:
        .          .     21: b.ResetTimer()
        .          .     22:
        .   335.03MB     23: for i := 0; i < b.N; i++ {
        .          .     24:       output.Reset()
        .          .     25:       algOne(in, find, repl, &output)
        .          .     26: }
        .          .     27:}
        .          .     28:
(pprof) _
复制代码

由于只有第二列cum才有值，所以函数Benchmark函数并不直接的创建任何变量到堆的。在循环内部，每次对函数调用的时候都会分配变量到堆。你可以看到两次对list命令调用的时候，分配的值到堆是匹配的(译者注：$$318.53 + 16.50 = 335.03$$)。

到此呢，我们仍然不知道为什么bytes.Buffer会创建变量到堆。这个时候可以使用go build命令的-gcflags "-m -m"选项了。profiler会告诉我们值逃逸到的堆，而go build命令会告诉我们为什么。

编译器报告

我们可以让编译器告诉我们代码里面变量逃逸到堆的原因。

Listing 10

$ go build -gcflags "-m -m"
复制代码

这个命令会产生非常多的输出。我们需要找到的就是包含stream.go:83的行，因为stream.go是文件的名称，并且第 83 行含有代码来构建bytes.buffer的值。在搜索之后，找到了如下 6 行。

Listing 11

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93
复制代码

第一行是非常有意思的

Listing 12

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }
复制代码

这句话告诉了我们bytes.Buffer逃逸到堆的原因并不是对函数bytes.Buffer调用造成的。因为bytes.Buffer压根没有被调用，函数的操作被内联到了调用的地方。

第 83 行的的如下代码

Listing 13

83     input := bytes.NewBuffer(data)
复制代码

由于编译器选择把bytes.NewBuffer内联到代码里面，所以上面的代码在实际调用的时候是如下的

Listing 14

input := &bytes.Buffer{buf: data}
复制代码

这就意味着函数algOne是直接创建bytes.Buffer的。那么到底是什么导致 input 被分配到堆中的呢？答案就在剩下的五行报告中。

Listing 15

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93
复制代码

上面的这些内容告诉我们是第 93 行造成的值逃逸的。因为input变量被赋值给了一个接口。

接口

我并没有印象在代码中对接口有过赋值的操作。但是如果看了第 93 行代码，问题就变得清晰了。

Listing 16

 93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
 94         output.Write(buf[:n])
 95         return
 96     }
复制代码

由于调用了io.ReadFull函数，所以造成了对接口的赋值。如果你看了io.ReadFull的定义，你可以看到函数io.ReadFull接受的第一个参数是一个接口。

Listing 17

type Reader interface {
      Read(p []byte) (n int, err error)
}

func ReadFull(r Reader, buf []byte) (n int, err error) {
      return ReadAtLeast(r, buf, len(buf))
}
复制代码

这个说明了，把bytes.Buffer的地址传递给函数，然后函数把这个地址作为一个接口存储，这就造成了变量逃逸到了堆。现在我们看到了使用接口的代价：变量分配到堆和变量的间接使用(如果分配到栈，变量的访问速度会更快)。如果使用接口并没有使得代码变得更好，那就最好别使用接口。我跟随这下面这些指导来使用接口

当有下面几种情况的时候，我会使用接口

用户需要自己实现接口的细节
API 有许多实现方法，需要各自维护其细节
API 的部分操作随着时间会改变，需要解耦

不需要使用接口的情况如下

为了使用接口而使用接口
用于完成一个算法
当用户可以自己定义接口的时候

现在我们需要问自己，这个算法真的需要使用io.ReadFull函数吗？答案是否定的，因为bytes.Buffer类型有一系列方法可以使用，并且使用这些方法可以有效的避免变量被分配到堆。

现在我们可以移去io包，并使用input变量已有的方法Read。

下面的代码移去了io包，为了保持新的代码行和原来的代码行不变，使用了变量_来避免导入io包。这样就可以保持io包还在引入的行列中。

Listing 18

 12 import (
 13     "bytes"
 14     "fmt"
 15     _ "io"
 16 )

 80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := input.Read(buf[:end]); err != nil || n < end {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := input.Read(buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := input.Read(buf[:end]); err != nil || n < end {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }
复制代码

当我们再次运行 benchmark 的时候，就可以看到变量bytes.Buffer不再分配到堆中了。

Listing 19

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8    	2000000 	     1814 ns/op         5 B/op  	      1 allocs/op
复制代码

也可以从上面的输出看到，代码性能提升了约 29%。代码花费的时间由 2570 ns/op 到 1814 ns/op。既然这个问题解决了，我们现在就可以聚焦于切片buf背后的数组分配到了堆的问题。如果我们使用新的代码，来运行得到 profile 的结果，我们也许就可以解决这个问题了。

Listing 20

$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) list algOne
Total: 7.50MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
     11MB       11MB (flat, cum)   100% of Total
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
     11MB       11MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := input.Read(buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])
复制代码

现在唯一分配到堆的一行就是第 89 行了，这部分的分配就是切片底层的数组。

栈帧

我们需要知道为什么buf底层的数组分配到了堆。再次运行go build指令，并且使用参数-gcflags "-m -m"，在输出的结果中搜索stream.go:89。

Listing 21

$ go build -gcflags "-m -m"
./stream.go:89: make([]byte, size) escapes to heap
./stream.go:89:   from make([]byte, size) (too large for stack) at ./stream.go:89
复制代码

报告中说的是分配的数组对于栈来说太大了。这个信息是非常的有迷惑性的。因为并不是底层数组太大了，而是编译器在编译的时候不知道底层数组的大小。

只有在编译器在编译期间知道值的大小的时候，值才会被分配到栈。这是因为每个函数的栈帧的大小都是在编译期间计算的。如果编译器不知道一个值的大小，那么编译器会把值分配到堆上。

为了展示这个，我们暂时硬编码切片的大小为 5 到代码中去

Listing 22

 89     buf := make([]byte, 5)
复制代码

这个时候再运行 benchmark，所有的分配到堆的操作都没有了。

Listing 23

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8    	3000000 	     1720 ns/op         0 B/op  	      0 allocs/op
复制代码

如果再次查看编译器的报告，你会发现没有变量的逃逸行为

Listing 24

$ go build -gcflags "-m -m"
./stream.go:83: algOne &bytes.Buffer literal does not escape
./stream.go:89: algOne make([]byte, 5) does not escape
复制代码

显然，并不能硬编码切片的大小到代码中，所以代码中知道存在着一次的变量分配到堆的操作。

分配和性能

有了三次的修改，我们可以查看、对比每次修改后的性能

Listing 25

Before any optimization
BenchmarkAlgorithmOne-8    	2000000 	     2570 ns/op       117 B/op  	      2 allocs/op

Removing the bytes.Buffer allocation
BenchmarkAlgorithmOne-8    	2000000 	     1814 ns/op         5 B/op  	      1 allocs/op

Removing the backing array allocation
BenchmarkAlgorithmOne-8    	3000000 	     1720 ns/op         0 B/op  	      0 allocs/op
复制代码

在第一优化的时候，性能提升大约 29%。第二次优化之后，性能提升约 33%。通过这些数据，我们可以看到变量分配到堆是影响程序性能的。

结论

go 有许多让人吃惊的工具，来让我们理解编译器在涉及到逃逸分析是所做的决定的原由。基于这些信息，我们可以修改代码以保持可以存在于栈中的变量避免存在于堆中。你并不需要完成一个在堆上不分配内存的程序，但是你需要使得这些操作尽可能的避免。

永远不要基于程序的性能写代码，因为你不想猜测程序的性能。我们应该首先基于正确性来写代码。这意味着需要聚焦于整体性，可读性和简单性。在你有了一个程序的时候，确认下程序是否运行的足够块。如果不够快，那么可以使用 go 提供的工具来找到修复程序运行慢的问题。

有疑问加站长微信联系（非本文作者）

本文来自：掘金

感谢作者：胡大海

查看原文：通过内存分配来学习 go 中的机制

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

503 次点击

加入收藏微博

收入我的专栏

上一篇：go实现laravel的encrypt()和decrypt()方法

下一篇：PHP程序员学习路线

代码

函数

变量

0 回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

关注我

扫码关注领全套学习资料
加入 QQ 群：
- 192706294（已满）
- 731990104（已满）
- 798786647（已满）
- 729884609（已满）
- 977810755（已满）
- 815126783（已满）
- 812540095（已满）
- 1006366459（已满）
- 692541889
加入微信群：liuxiaoyan-s，备注入群
也欢迎加入知识星球 Go粉丝们（免费）

通过内存分配来学习 go 中的机制

前言

程序

Benchmarking

Profiling

编译器报告

接口

栈帧

分配和性能

结论

用户登录

今日阅读排行

一周阅读排行

关注我

前言

程序

Benchmarking

Profiling

编译器报告

接口

栈帧

分配和性能

结论

通过内存分配来学习 go 中的机制

前言

程序

Benchmarking

Profiling

编译器报告

接口

栈帧

分配和性能

结论

用户登录

今日阅读排行

一周阅读排行

关注我

给该专栏投稿 写篇新文章

收入到我管理的专栏 新建专栏

前言

程序

Benchmarking

Profiling

编译器报告

接口

栈帧

分配和性能

结论

给该专栏投稿写篇新文章

收入到我管理的专栏新建专栏