剖析与优化 Go 的 web 应用

lightfish-zhang · 2018-03-25 22:32:08 · 2940 次点击 · 预计阅读时间 18 分钟 · 大约8小时之前开始浏览

这是一个创建于 2018-03-25 22:32:08 的文章，其中的信息可能已经有所发展或是发生改变。

Go 语言有一个很强大的内置分析器（profiler），支持CPU、内存、协程与阻塞/抢占（block/contention）的分析。

开启分析器（profiler）

Go 提供了一个低级的分析 API runtime/pprof ，但如果你在开发一个长期运行的服务，使用更高级的 net/http/pprof 包会更加便利。

你只需要在代码中加入 import _ "net/http/pprof" ，它就会自动注册所需的 HTTP 处理器（Handler）。

package main

import (
    "net/http"
    _ "net/http/pprof"
)

func hiHandler(w http.ResponseWriter, r *http.Request) {
    w.Write([]byte("hi"))
}

func main() {
    http.HandleFunc("/", hiHandler)
    http.ListenAndServe(":8080", nil)
}

如果你的 web 应用使用自定义的 URL 路由，你需要手动注册一些 HTTP 端点（endpoints）　。

package main

import (
    "net/http"
    "net/http/pprof"
)

func hiHandler(w http.ResponseWriter, r *http.Request) {
    w.Write([]byte("hi"))
}

func main() {
    r := http.NewServeMux()
    r.HandleFunc("/", hiHandler)

    // Register pprof handlers
    r.HandleFunc("/debug/pprof/", pprof.Index)
    r.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
    r.HandleFunc("/debug/pprof/profile", pprof.Profile)
    r.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
    r.HandleFunc("/debug/pprof/trace", pprof.Trace)

    http.ListenAndServe(":8080", r)
}

如上代码那样，开启 web 应用，然后使用 pprof 工具：

go tool pprof [binary] http://127.0.0.1:8080/debug/pprof/profile

pprof 的最大的优点之一是它是的性能负载很小，可以在生产环境中使用，不会对 web 请求响应造成明显的性能消耗。

但是在深入挖掘 pprof 之前，我们需要一个真实案例来展示如何在 GO 应用中检查并解决性能问题。

案例： Left-pad 微服务

假设，你需要开发一个全新的微服务，为输入的字符串添加左填充

$ curl "http://127.0.0.1:8080/v1/leftpad/?str=test&len=10&chr=*"
{"str":"******test"}

这个服务需要收集基本的指标（metric），如请求的数量与每个请求的响应时间。收集到的所有指标都应该发送到一个指标聚合器（metric aggregator）（例如 StatsD）除此之外，这个服务需要日志记录这个请求的详细信息，如 URL，IP 地址与 user-agent 。

你可以在 Github 上看到这个微服务的初步实现，tag 为 v1

编译并运行这个应用

go build && ./goprofex

性能分析

我们将要测试这个微服务每秒可以处理多少个请求，可以使用这个工具 Apache Benchmark tool :

ab -k -c 8 -n 100000 "http://127.0.0.1:8080/v1/leftpad/?str=test&len=50&chr=*"
# -k   Enables HTTP keep-alive
# -c   Number of concurrent requests
# -n   Number of total requests to make

测试结果不差，但可以做到更快

Requests per second:    22810.15 [#/sec] (mean)
Time per request:       0.042 [ms] (mean, across all concurrent requests)

注：上面的测试结果的执行环境：笔记本 MacBook Pro Late 2013 (2.6 GHz Intel Core i5, 8 GB 1600 MHz DDR3, macOS 10.12.3) , Go编译器版本是1.8 。

CPU 分析（CPU profile）

再次执行 Apache benchmark tool ，但这次使用更高的请求数量（1百万应该足够了），并同时执行 pprof ：

go tool pprof goprofex http://127.0.0.1:8080/debug/pprof/profile

这个 CPU profiler 默认执行30秒。它使用采样的方式来确定哪些函数花费了大多数的CPU时间。Go runtime 每10毫秒就停止执行过程并记录每一个运行中的协程的当前堆栈信息。

当 pprof 进入交互模式，输入 top，这条命令会展示收集样本中最常出现的函数列表。在我们的案例中，是所有 runtime 与标准库函数，这不是很有用。

(pprof) top
63.77s of 69.02s total (92.39%)
Dropped 331 nodes (cum <= 0.35s)
Showing top 10 nodes out of 78 (cum >= 0.64s)
      flat  flat%   sum%        cum   cum%
    50.79s 73.59% 73.59%     50.92s 73.78%  syscall.Syscall
     4.66s  6.75% 80.34%      4.66s  6.75%  runtime.kevent
     2.65s  3.84% 84.18%      2.65s  3.84%  runtime.usleep
     1.88s  2.72% 86.90%      1.88s  2.72%  runtime.freedefer
     1.31s  1.90% 88.80%      1.31s  1.90%  runtime.mach_semaphore_signal
     1.10s  1.59% 90.39%      1.10s  1.59%  runtime.mach_semaphore_wait
     0.51s  0.74% 91.13%      0.61s  0.88%  log.(*Logger).formatHeader
     0.49s  0.71% 91.84%      1.06s  1.54%  runtime.mallocgc
     0.21s   0.3% 92.15%      0.56s  0.81%  runtime.concatstrings
     0.17s  0.25% 92.39%      0.64s  0.93%  fmt.(*pp).doPrintf

有一个更好的方法来查看高级别的性能概况 —— web 命令，它会生成一个热点（hot spots）的 SVG 图像，可以在浏览器中打开它：

从上图你可以看到这个应用花费了 CPU 大量的时间在 logging、测试报告（metric reporting ）上，以及部分时间在垃圾回收上。

使用 list 命令可以 inspect 每个函数的详细代码，例如 list leftpad ：

(pprof) list leftpad
ROUTINE ======================== main.leftpad in /Users/artem/go/src/github.com/akrylysov/goprofex/leftpad.go
      20ms      490ms (flat, cum)  0.71% of Total
         .          .      3:func leftpad(s string, length int, char rune) string {
         .          .      4:   for len(s) < length {
      20ms      490ms      5:       s = string(char) + s
         .          .      6:   }
         .          .      7:   return s
         .          .      8:}

对无惧查看反汇编代码的人而言，可以使用 pprof 的 disasm 命令，它有助于查看实际的处理器指令：

(pprof) disasm leftpad
ROUTINE ======================== main.leftpad
      20ms      490ms (flat, cum)  0.71% of Total
         .          .    1312ab0: GS MOVQ GS:0x8a0, CX
         .          .    1312ab9: CMPQ 0x10(CX), SP
         .          .    1312abd: JBE 0x1312b5e
         .          .    1312ac3: SUBQ $0x48, SP
         .          .    1312ac7: MOVQ BP, 0x40(SP)
         .          .    1312acc: LEAQ 0x40(SP), BP
         .          .    1312ad1: MOVQ 0x50(SP), AX
         .          .    1312ad6: MOVQ 0x58(SP), CX
...

函数堆栈分析（Heap profile）

执行堆栈分析器（heap profiler）

go tool pprof goprofex http://127.0.0.1:8080/debug/pprof/heap

默认情况下，它显示当前正在使用的内存量：

(pprof) top
512.17kB of 512.17kB total (  100%)
Dropped 85 nodes (cum <= 2.56kB)
Showing top 10 nodes out of 13 (cum >= 512.17kB)
      flat  flat%   sum%        cum   cum%
  512.17kB   100%   100%   512.17kB   100%  runtime.mapassign
         0     0%   100%   512.17kB   100%  main.leftpadHandler
         0     0%   100%   512.17kB   100%  main.timedHandler.func1
         0     0%   100%   512.17kB   100%  net/http.(*Request).FormValue
         0     0%   100%   512.17kB   100%  net/http.(*Request).ParseForm
         0     0%   100%   512.17kB   100%  net/http.(*Request).ParseMultipartForm
         0     0%   100%   512.17kB   100%  net/http.(*ServeMux).ServeHTTP
         0     0%   100%   512.17kB   100%  net/http.(*conn).serve
         0     0%   100%   512.17kB   100%  net/http.HandlerFunc.ServeHTTP
         0     0%   100%   512.17kB   100%  net/http.serverHandler.ServeHTTP

但是我们更感兴趣的是分配的对象的数量，执行 pprof 时使用选项 -alloc_objects

go tool pprof -alloc_objects goprofex http://127.0.0.1:8080/debug/pprof/heap

几乎 70% 的对象仅由两个函数分配　——　leftpad 与 StatsD ，我们需要更仔细的查看它们：

(pprof) top
559346486 of 633887751 total (88.24%)
Dropped 32 nodes (cum <= 3169438)
Showing top 10 nodes out of 46 (cum >= 14866706)
      flat  flat%   sum%        cum   cum%
 218124937 34.41% 34.41%  218124937 34.41%  main.leftpad
 116692715 18.41% 52.82%  218702222 34.50%  main.(*StatsD).Send
  52326692  8.25% 61.07%   57278218  9.04%  fmt.Sprintf
  39437390  6.22% 67.30%   39437390  6.22%  strconv.FormatFloat
  30689052  4.84% 72.14%   30689052  4.84%  strings.NewReplacer
  29869965  4.71% 76.85%   29968270  4.73%  net/textproto.(*Reader).ReadMIMEHeader
  20441700  3.22% 80.07%   20441700  3.22%  net/url.parseQuery
  19071266  3.01% 83.08%  374683692 59.11%  main.leftpadHandler
  17826063  2.81% 85.90%  558753994 88.15%  main.timedHandler.func1
  14866706  2.35% 88.24%   14866706  2.35%  net/http.Header.clone

还有一些非常有用的调试内存问题的选项， -inuse_objects 可以显示正在使用的对象的数量，-alloc_space 可以显示程序启动以来分配的多少内存。

自动内存分配很便利，但世上没有免费的午餐。动态内存分配不仅比堆栈分配要慢得多，还会间接地影响性能。你在堆上分配的每一块内存都会增加 GC 的负担，并且占用更多的 CPU 资源。要使垃圾回收花费更少的时间，唯一的方法是减少内存分配。

逃逸分析（Escape analysis）

无论何时使用 ＆ 运算符来获取指向变量的指针或使用 make 或 new 分配新值，它并不一定意味着它被分配在堆上。

func foo(a []string) {
    fmt.Println(len(a))
}

func main() {
    foo(make([]string, 8))
}

在上面的例子中，　make([]string, 8) 是在栈上分配内存的。Go 通过 escape analysis 来判断使用堆而不是栈来分配内存是否安全。你可以添加选项 -gcflags=-m 来查看逃逸分析（escape analysis）的结果：

5  type X struct {v int}
6
7  func foo(x *X) {
8       fmt.Println(x.v)
9  }
10
11 func main() {
12      x := &X{1}
13      foo(x)
14 }

go build -gcflags=-m
./main.go:7: foo x does not escape
./main.go:12: main &X literal does not escape

Go 编译器足够智能，可以将一些动态分配转换为栈分配。但你如果使用接口来处理变量，会导致糟糕的情况。

// Example 1
type Fooer interface {
    foo(a []string)
}

type FooerX struct{}

func (FooerX) foo(a []string) {
    fmt.Println(len(a))
}

func main() {
    a := make([]string, 8) // make([]string, 8) escapes to heap
    var fooer Fooer
    fooer = FooerX{}
    fooer.foo(a)
}

// Example 2
func foo(a interface{}) string {
    return a.(fmt.Stringer).String()
}

func main() {
    foo(make([]string, 8)) // make([]string, 8) escapes to heap
}

Dmitry Vyukov 的论文 Go Escape Analysis Flaws 讲述了更多的逃逸分析（escape analysis）无法处理的案例。

一般来说，对于你不需要再修改数据的小结构体，你应该使用值传参而不是指针传参。

注：对于大结构体，使用指针传参而不是值传参（复制整个结构体）的性能消耗更低。

协程分析（Goroutine profile）

Goroutine profile 会转储协程的调用堆栈与运行中的协程数量

go tool pprof goprofex http://127.0.0.1:8080/debug/pprof/goroutine

上图只有18个活跃中的协程，这是非常小的数字。拥有数千个运行中的协程的情况并不少见，但并不会显著降低性能。

阻塞分析（Block profile）

阻塞分析会显示导致阻塞的函数调用，它们使用了同步原语（synchronization primitives），如互斥锁（mutexes）和 channels 。

在执行 block contention profile 之前，你必须设置使用runtime.SetBlockProfileRate 设置 profiling rate　。你可以在 main 函数或者 init 函数中添加这个调用。

go tool pprof goprofex http://127.0.0.1:8080/debug/pprof/block

timedHandler 与 leftpadHandler 花费了大量的时间来等待 log.Printf 中的互斥锁。导致这个结果的原因是 log package 的实现使用了互斥锁来对多个协程共享的文件进行同步访问（synchronize access）。

指标（Benchmarking）

正如我们之前注意的，在这个案例的最大的几个性能杀手是 log package ，leftpad 与 StatsD.Send 函数。现在我们找到了性能瓶颈，但是在优化代码之前，我们需要一个可重复的方法来对我们关注的代码进行性能测试。Go 的 testing package 包含了这样的一个机制。你需要在测试文件中创建一个函数，以 func BenchmarkXxx(*testing.B) 的格式。

func BenchmarkStatsD(b *testing.B) {
    statsd := StatsD{
        Namespace:  "namespace",
        SampleRate: 0.5,
    }
    for i := 0; i < b.N; i++ {
        statsd.Incr("test")
    }
}

也可以使用 net/http/httptest 对这整个 HTTP 程序进行基准测试：

func BenchmarkLeftpadHandler(b *testing.B) {
    r := httptest.NewRequest("GET", "/v1/leftpad/?str=test&len=50&chr=*", nil)
    for i := 0; i < b.N; i++ {
        w := httptest.NewRecorder()
        leftpadHandler(w, r)
    }
}

执行基准测试：

go test -bench=. -benchmem

它会显示每次迭代需要的时间量，以及内存/分配数量（amount of memory/number of allocations）：

BenchmarkTimedHandler-4           200000          6511 ns/op        1621 B/op         41 allocs/op
BenchmarkLeftpadHandler-4         200000         10546 ns/op        3297 B/op         75 allocs/op
BenchmarkLeftpad10-4             5000000           339 ns/op          64 B/op          6 allocs/op
BenchmarkLeftpad50-4              500000          3079 ns/op        1568 B/op         46 allocs/op
BenchmarkStatsD-4                1000000          1516 ns/op         560 B/op         15 allocs/op

优化性能

Logging

让应用运行更快，一个很好又不是经常管用的方法是，让它执行更少的工作。除了 debug 的目的之外，这行代码 log.Printf("%s request took %v", name, elapsed) 在 web service 中不需要。所有非必要的 logs 应该在生产环境中被移除代码或者关闭功能。可以使用分级日志（a leveled logger）来解决这个问题，比如这些很棒的日志工具库（logging libraries）

关于打日志或者其他一般的 I/O 操作，另一个重要的事情是尽可能使用有缓冲的输入输出（buffered input/output），这样可以减少系统调用的次数。通常，并不是每个 logger 调用都需要立即写入文件 —— 使用 bufio package 来实现 buffered I/O 。我们可以使用 bufio.NewWriter 或者 bufio.NewWriterSize 来简单地封装 io.Writer 对象，再传递给 logger ：

log.SetOutput(bufio.NewWriterSize(f, 1024*16))

左填充（leftpad）

再看一遍 leftpad 函数

func leftpad(s string, length int, char rune) string {
    for len(s) < length {
        s = string(char) + s
    }
    return s
}

在每一个循环中连接字符串的做法并不高效，因为每一次循环迭代都会分配一个新的字符串（反复分配内存空间）。有一种更好的方法来构建字符串，使用 bytes.Buffer ：

func leftpad(s string, length int, char rune) string {
    buf := bytes.Buffer{}
    for i := 0; i < length-len(s); i++ {
        buf.WriteRune(char)
    }
    buf.WriteString(s)
    return buf.String()
}

另外，我们还可以使用 string.Repeat ，使代码更加简洁：

func leftpad(s string, length int, char rune) string {
    if len(s) < length {
        return strings.Repeat(string(char), length-len(s)) + s
    }
    return s
}

StatsD client

接下来需要优化的代码是 StatsD.Send 函数：

func (s *StatsD) Send(stat string, kind string, delta float64) {
    buf := fmt.Sprintf("%s.", s.Namespace)
    trimmedStat := strings.NewReplacer(":", "_", "|", "_", "@", "_").Replace(stat)
    buf += fmt.Sprintf("%s:%s|%s", trimmedStat, delta, kind)
    if s.SampleRate != 0 && s.SampleRate < 1 {
        buf += fmt.Sprintf("|@%s", strconv.FormatFloat(s.SampleRate, 'f', -1, 64))
    }
    ioutil.Discard.Write([]byte(buf)) // TODO: Write to a socket
}

以下是有一些可能的值得改进的地方：

Sprintf 对字符串格式化非常便利，它性能表现很好，除非你每秒调用它几千次。不过，它把输入参数进行字符串格式化的时候会消耗 CPU 时间，而且每次调用都会分配一个新的字符串。为了更好的性能优化，我们可以使用 bytes.Buffer + Buffer.WriteString/Buffer.WriteByte 来替换它。
这个函数不需要每一次都创建一个新的 Replacer 实例，它可以声明为全局变量，或者作为 StatsD 结构体的一部分。
用 strconv.AppendFloat 替换 strconv.FormatFloat ，并且使用堆栈上分配的 buffer 来传递变量，防止额外的堆分配。

func (s *StatsD) Send(stat string, kind string, delta float64) {
    buf := bytes.Buffer{}
    buf.WriteString(s.Namespace)
    buf.WriteByte('.')
    buf.WriteString(reservedReplacer.Replace(stat))
    buf.WriteByte(':')
    buf.Write(strconv.AppendFloat(make([]byte, 0, 24), delta, 'f', -1, 64))
    buf.WriteByte('|')
    buf.WriteString(kind)
    if s.SampleRate != 0 && s.SampleRate < 1 {
        buf.WriteString("|@")
        buf.Write(strconv.AppendFloat(make([]byte, 0, 24), s.SampleRate, 'f', -1, 64))
    }
    buf.WriteTo(ioutil.Discard) // TODO: Write to a socket
}

这样做，将分配数量（number of allocations）从14减少到1个，并且使 Send 运行快了4倍。

BenchmarkStatsD-4                5000000           381 ns/op         112 B/op          1 allocs/op

测试优化结果

做了所有优化之后，基准测试显示出非常好的性能提升：

benchmark                     old ns/op     new ns/op     delta
BenchmarkTimedHandler-4       6511          1181          -81.86%
BenchmarkLeftpadHandler-4     10546         3337          -68.36%
BenchmarkLeftpad10-4          339           136           -59.88%
BenchmarkLeftpad50-4          3079          201           -93.47%
BenchmarkStatsD-4             1516          381           -74.87%

benchmark                     old allocs     new allocs     delta
BenchmarkTimedHandler-4       41             5              -87.80%
BenchmarkLeftpadHandler-4     75             18             -76.00%
BenchmarkLeftpad10-4          6              3              -50.00%
BenchmarkLeftpad50-4          46             3              -93.48%
BenchmarkStatsD-4             15             1              -93.33%

benchmark                     old bytes     new bytes     delta
BenchmarkTimedHandler-4       1621          448           -72.36%
BenchmarkLeftpadHandler-4     3297          1416          -57.05%
BenchmarkLeftpad10-4          64            24            -62.50%
BenchmarkLeftpad50-4          1568          160           -89.80%
BenchmarkStatsD-4             560           112           -80.00%

注: 作者使用 benchcmp 来对比结果：

再一次运行 ab

Requests per second:    32619.54 [#/sec] (mean)
Time per request:       0.030 [ms] (mean, across all concurrent requests)

这个 web 服务现在可以每秒多处理10000个请求！　

优化技巧

避免不必要的 heap 内存分配。
对于不大的结构体，值传参比指针传参更好。
如果你事先知道长度，最好提前分配 maps 或者 slice 的内存。
生产环境下，非必要情况不打日志。
如果你要频繁进行连续的读写，请使用缓冲读写（buffered I/O）
如果你的应用广泛使用 JSON，请考虑使用解析器/序列化器（parser/serializer generators）（作者个人更喜欢 easyjson）
在主要路径上的每一个操作都很关键（Every operation matters in a hot path）

结论

有时候，性能瓶颈可能不是你预想那样，理解应用程序真实性能的最好途径是认真分析它。

你可以在 Github 上找到本案例的完整的源代码，初始版本 tag 为 v1，优化版本 tag 为 v2 。比较这两个版本的传送门。

作者并非以英语为母语，并且他在努力提高英语水平，如果原文有表达问题或者语法错误，请纠正他。

via: http://artem.krylysov.com/blog/2017/03/13/profiling-and-optimizing-go-web-applications/

作者：Artem Krylsov 译者：lightfish-zhang 校对：Unknwon

本文由 GCTT 原创编译，Go语言中文网荣誉推出

本文由 GCTT 原创翻译，Go语言中文网首发。也想加入译者行列，为开源做一些自己的贡献么？欢迎加入 GCTT！
翻译工作和译文发表仅用于学习和交流目的，翻译工作遵照 CC-BY-NC-SA 协议规定，如果我们的工作有侵犯到您的权益，请及时联系我们。
欢迎遵照 CC-BY-NC-SA 协议规定转载，敬请在正文中标注并保留原文/译文链接和作者/译者等信息。
文章仅代表作者的知识和看法，如有不同观点，请楼下排队吐槽

有疑问加站长微信联系（非本文作者））

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

2940 次点击 ∙ 1 赞

加入收藏微博

被以下专栏收入，发现更多相似内容

GCTT - Go 中文翻译组

经典收录收入我的专栏

上一篇：go基础

下一篇：Go 语言的数据结构：链表

pprof

web

net

runtime

1 回复 | 直到 2018-03-27 16:33:36

monkee · #1 · 7年之前

赞👍

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

剖析与优化 Go 的 web 应用

开启分析器（profiler）

案例： Left-pad 微服务

性能分析

CPU 分析（CPU profile）

函数堆栈分析（Heap profile）

逃逸分析（Escape analysis）

协程分析（Goroutine profile）

阻塞分析（Block profile）

指标（Benchmarking）

优化性能

Logging

左填充（leftpad）

StatsD client

测试优化结果

优化技巧

结论

用户登录

今日阅读排行

一周阅读排行

关注我

剖析与优化 Go 的 web 应用

开启分析器（profiler）

案例： Left-pad 微服务

性能分析

CPU 分析（CPU profile）

函数堆栈分析（Heap profile）

逃逸分析（Escape analysis）

协程分析（Goroutine profile）

阻塞分析（Block profile）

指标（Benchmarking）

优化性能

Logging

左填充（leftpad）

StatsD client

测试优化结果

优化技巧

结论

用户登录

今日阅读排行

一周阅读排行

关注我

给该专栏投稿 写篇新文章

收入到我管理的专栏 新建专栏

给该专栏投稿写篇新文章

收入到我管理的专栏新建专栏