golang性能调试优化方法

在 golang 程序中，有哪些内容需要调试优化？

一般常规内容：

cpu：程序对cpu的使用情况 - 使用时长，占比等
内存：程序对cpu的使用情况 - 使用时长，占比，内存泄露等。如果在往里分，程序堆、栈使用情况
I/O：IO的使用情况 - 哪个程序IO占用时间比较长

golang 程序中：

goroutine：go的协程使用情况，调用链的情况
goroutine leak：goroutine泄露检查
go dead lock：死锁的检测分析
data race detector：数据竞争分析，其实也与死锁分析有关

golang 性能调试优化方法：

Benchmark：基准测试，对特定代码的运行时间和内存信息等进行测试
Profiling：程序分析，程序的运行画像，在程序执行期间，通过采样收集的数据对程序进行分析
Trace：跟踪，在程序执行期间，通过采集发生的事件数据对程序进行分析

profiling 和 trace 有啥区别？
profiling 分析没有时间线，trace 分析有时间线。

概要

pprof 是 golang 官方提供的性能调优分析工具，可以对程序进行性能分析，并可视化数据，看起来相当的直观。

当你的 go 程序遇到性能瓶颈时，可以使用这个工具来进行调试并优化程序。

本文将对下面 golang 中 2 个监控性能的包 pprof 进行运用：

runtime/pprof：采集程序运行数据进行性能分析，一般用于后台工具型应用。
net/http/pprof：对 runtime/pprof 的二次封装，一般应用于 http server ，采集 http server 运行时数据进行分析。

pprof 开启后，每隔一段时间就会采集当前程序的堆栈信息，获取函数的 cpu、内存等使用情况。通过对采样的数据进行分析，形成一个数据分析报告。

pprof 以 profile.proto 的格式保存数据，然后根据这个数据生成可视化的分析报告，支持文本形式和图形形式报告。

profile.proto 里具体的数据格式是 protocol buffers。

pprof 使用模式

Report generation：报告生成
Interactive terminal use：交互式终端
Web interface：Web 界面

pprof开启后，每隔一段时间(10ms)就会收集当前的堆栈信息，获取各个函数占用的CPU以及内存资源，然后通过对这些采样数据进行分析，形成一个性能分析报告。

性能优化主要有一下几个方面：

CPU Profile：报告程序的CPU使用情况，按照一定频率去采集应用程序在CPU和寄存器上面的数据。
Memory Profile（Heap Profile）:报告程序的内存使用情况。
Block Profiling: 报告goroutines不在运行状态的情况，可以用来分析和查找死锁等性能瓶颈。
Goroutine Profiling: 报告goroutines的使用情况，有哪些roroutines，它们的调用关系是怎样的。

注意：我们只应该在性能测试的时候才在代码中引入pprof

生成profile方法

golang目前提供了3中profile，分别是 cpu profile, memery profile, blocking profile, 对于如何生成这些profile有两种办法，
一种是使用net/http/pprof包，一种是需要自己手写代码。

net/http/pprof

这种方法是非常非常简单的，只需要引入 net/http/pprof 包就可以了，网页上可以查看
其实net/http/pprof中只是使用runtime/pprof包来进行封装了一下，并在http端口上暴露出来。

默认Mux

package main

import (
    "fmt"
    "log"
    "net/http"
    _ "net/http/pprof"
)

func main() {
    go func() {
        for {
            fmt.Println("hello world")
        }
    }()
    log.Fatal(http.ListenAndServe("0.0.0.0:8080", nil))
}

自定义Mux

package main

import (
	"github.com/gin-gonic/gin"
	"net/http/pprof"
)

func main() {
	r := gin.Default()
	
	// "github.com/gin-contrib/pprof" 
	// pprof.Register(r)
	
	prefixRouter := r.Group("/debug/pprof")
	{
		prefixRouter.GET("/", gin.WrapF(pprof.Index))
		prefixRouter.GET("/cmdline", gin.WrapF(pprof.Cmdline))
		prefixRouter.GET("/profile", gin.WrapF(pprof.Profile))
		prefixRouter.POST("/symbol", gin.WrapF(pprof.Symbol))
		prefixRouter.GET("/symbol", gin.WrapF(pprof.Symbol))
		prefixRouter.GET("/trace", gin.WrapF(pprof.Trace))
		prefixRouter.GET("/allocs", gin.WrapH(pprof.Handler("allocs")))
		prefixRouter.GET("/block", gin.WrapH(pprof.Handler("block")))
		prefixRouter.GET("/goroutine", gin.WrapH(pprof.Handler("goroutine")))
		prefixRouter.GET("/heap", gin.WrapH(pprof.Handler("heap")))
		prefixRouter.GET("/mutex", gin.WrapH(pprof.Handler("mutex")))
		prefixRouter.GET("/threadcreate", gin.WrapH(pprof.Handler("threadcreate")))
	}
	r.Run(":8080")
}

使用go tool pprof采集数据

查看

1 2	http://127.0.0.1:8080/debug/pprof 查看整体信息 http://127.0.0.1:8080/debug/pprof/profile 可以将cpu profile下载下来观察分析

查看信息内容

allocs: 内存分配情况的采样信息
blocks: 阻塞操作情况的采样信息
cmdline: 显示程序启动命令参数及其参数
goroutine: 显示当前所有协程的堆栈信息
heap: 堆上的内存分配情况的采样信息
mutex: 锁竞争情况的采样信息
profile: cpu占用情况的采样信息，点击会下载文件
threadcreate: 系统线程创建情况的采样信息
trace: 程序运行跟踪信息

采集

go tool pprof http://127.0.0.1:8080/debug/pprof/profile # (pprof) traces
go tool pprof http://127.0.0.1:8080/debug/pprof/mutex # (pprof) traces Unlock
go tool pprof http://127.0.0.1:8080/debug/pprof/heap
go tool pprof http://127.0.0.1:8080/debug/pprof/block # (pprof) traces runtime.chanrecv1 找到协程使用list查看

# 以下命令的意思是采集协程数据并持续20S
go tool pprof --seconds 20 http://127.0.0.1:8080/debug/pprof/goroutine
go tool pprof http://127.0.0.1:8080/debug/pprof/goroutine?second=20

# inuse_space表示查看常驻内存的使用情况（list 查看相应的函数）
go tool pprof -inuse_space http://127.0.0.1:8080/debug/pprof/heap
# alloc_objects分析应用程序的内存临时分配情况（list 查看相应的函数）
go tool pprof -alloc_objects http://127.0.0.1:8080/debug/pprof/heap

runtime/pprof

自定编码实现profile的采集

package main

import (
	"bytes"
	"flag"
	"log"
	"math/rand"
	"os"
	"runtime"
	"runtime/pprof"
	"sync"
)

var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to `file`")
var memprofile = flag.String("memprofile", "", "write mem profile to `file`")

func main() {
	flag.Parse()

	if *cpuprofile != "" {
		f, err := os.Create(*cpuprofile)
		if err != nil {
			log.Fatal("could not create CPU profile: ", err)
		}
		defer f.Close() // error handling omitted for example
		if err := pprof.StartCPUProfile(f); err != nil {
			log.Fatal("could not start CPU profile: ", err)
		}
		defer pprof.StopCPUProfile()
	}

	// ... rest of the program ...
	var wg sync.WaitGroup
	wg.Add(200)
	for i := 0; i < 200; i++ {
		go cyclenum(30000, &wg)
	}
	writeBytes()
	wg.Wait()

	if *memprofile != "" {
		f, err := os.Create(*memprofile)
		if err != nil {
			log.Fatal("could not create memory profile: ", err)
		}
		defer f.Close() // error handling omitted for example
		runtime.GC()    // get up-to-date statistics
		if err := pprof.WriteHeapProfile(f); err != nil {
			log.Fatal("could not write memory profile: ", err)
		}
	}
}
func cyclenum(num int, wg *sync.WaitGroup) {
	slice := make([]int, 0)
	for i := 0; i < num; i++ {
		for j := 0; j < num; j++ {
			j = i + j
			slice = append(slice, j)
		}
	}
	wg.Done()
}
func writeBytes() *bytes.Buffer {
	var buff bytes.Buffer
	for i := 0; i < 30000; i++ {
		buff.Write([]byte{'0' + byte(rand.Intn(10))})
	}
	return &buff
}

go build test.go
./test.exe --cpuprofile=cpu.pprof  --memprofile=mem.pprof

go tool pprof cpu.pprof

程序退出时会生成两个profile，一个是cpu.pprof的，一个是mem.pprof。

go tool pprof 简单的使用格式为：
go tool pprof [binary] [source]

binary：是应用的二进制文件，用来解析各种符号
source：表示 profile 数据的来源，可以是本地的文件，也可以是 http 地址

go test

编写一个test程序,可以更加精准地进行测试

-bench=. 进行性能测试，"."是正则匹配，匹配了所有的测试函数
-benchmem 打印出申请内存的次数。一般用于简单的性能测试，不会导出数据文件。
-blockprofile block.out 将协程的阻塞数据写入特定的文件（block.out）。如果-c，则写成二进制文件。
-cpuprofile cpu.out 将协程的CPU使用数据写入特定的文件（cpu.out）。如果-c，则写成二进制文件。
-memprofile mem.out 将协程的内存申请数据写入特定的文件（mem.out）。如果-c，则写成二进制文件。
-mutexprofile mutex.out 将协程的互斥数据写入特定的文件（mutex.out）。如果-c，则写成二进制文件。
-trace trace.out 将执行调用链写入特定文件（trace.out）。

可以看到，go test会把数据导出在特定的文件之中。之后分析数据就需要读取这些数据。这些数据本身的可读性很差，还是需要借助go tool pprof来分析。

1 2	# -run=none 只进行基准测试 go test -v -run=none -bench=. -blockprofile block.out -cpuprofile cpu.out -memprofile mem.out -mutexprofile mutex.out -trace trace.out

分析

go tool pprof block.out
go tool pprof cpu.out
go tool pprof mem.out
go tool pprof mutex

go tool trace trace.out

分析数据

命令行交互分析

1 2	go tool pprof cpu.pprof go tool pprof test.exe cpu.pprof

$ go tool pprof cpu.pprof
Type: cpu
Time: Jun 11, 2022 at 8:07pm (CST)
Duration: 515.79ms, Total samples = 1.47s (285.00%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

Type: 分析类型，这里是 cpu

Duration: 程序执行的时长

Duration 下面还有一行提示，这是交互模式（通过输入 help 获取帮助信息，输入 o 获取选项信息）。

pprof命令

(pprof) help
  Commands:
    callgrind        Outputs a graph in callgrind format
    comments         Output all profile comments
    disasm           Output assembly listings annotated with samples
    dot              Outputs a graph in DOT format
    eog              Visualize graph through eog
    evince           Visualize graph through evince
    gif              Outputs a graph image in GIF format
    gv               Visualize graph through gv
    kcachegrind      Visualize report in KCachegrind
    list             Output annotated source for functions matching regexp
    pdf              Outputs a graph in PDF format
    peek             Output callers/callees of functions matching regexp
    png              Outputs a graph image in PNG format
    proto            Outputs the profile in compressed protobuf format
    ps               Outputs a graph in PS format
    raw              Outputs a text representation of the raw profile
    svg              Outputs a graph in SVG format
    tags             Outputs all tags in the profile
    text             Outputs top entries in text form
    top              Outputs top entries in text form
    topproto         Outputs top entries in compressed protobuf format
    traces           Outputs all profile samples in text form
    tree             Outputs a text rendering of call graph
    web              Visualize graph through web browser
    weblist          Display annotated source in a web browser
    o/options        List options and their current values
    q/quit/exit/^D   Exit pprof

  Options:
    call_tree        Create a context-sensitive call tree
    compact_labels   Show minimal headers
    divide_by        Ratio to divide all samples before visualization
    drop_negative    Ignore negative differences
    edgefraction     Hide edges below <f>*total
    focus            Restricts to samples going through a node matching regexp
    hide             Skips nodes matching regexp
    ignore           Skips paths going through any nodes matching regexp
    intel_syntax     Show assembly in Intel syntax
    mean             Average sample value over first value (count)
    nodecount        Max number of nodes to show
    nodefraction     Hide nodes below <f>*total
    noinlines        Ignore inlines.
    normalize        Scales profile based on the base profile.
      files            Aggregate at the file level.
      lines            Aggregate at the source code line level.
      addresses        Aggregate at the address level.
    sort
      cum              Sort entries based on cumulative weight
      flat             Sort entries based on own weight
  :   Clear focus/ignore/hide/tagfocus/tagignore

  type "help <cmd|option>" for more information

常用命令

top

top 默认查看程序中占用cpu前10位的函数。
top 3 可以查看程序中占用CPU前三位的函数。

(pprof) top
Showing nodes accounting for 1.41s, 95.92% of 1.47s total
Showing top 10 nodes out of 70
      flat  flat%   sum%        cum   cum%
     0.82s 55.78% 55.78%      0.82s 55.78%  runtime.memmove
     0.18s 12.24% 68.03%      0.18s 12.24%  runtime.memclrNoHeapPointers
     0.13s  8.84% 76.87%      1.21s 82.31%  main.cyclenum
     0.12s  8.16% 85.03%      0.12s  8.16%  runtime.stdcall2
     0.07s  4.76% 89.80%      0.07s  4.76%  runtime._ExternalCode
     0.03s  2.04% 91.84%      0.03s  2.04%  runtime.procyield
     0.02s  1.36% 93.20%      0.02s  1.36%  runtime.asyncPreempt
     0.02s  1.36% 94.56%      0.03s  2.04%  runtime.gentraceback
     0.01s  0.68% 95.24%      0.01s  0.68%  runtime.(*pageAlloc).find
     0.01s  0.68% 95.92%      0.01s  0.68%  runtime.casgstatus
(pprof) top 3
Showing nodes accounting for 1.46s, 99.32% of 1.47s total
Showing top 15 nodes out of 70
      flat  flat%   sum%        cum   cum%
     0.82s 55.78% 55.78%      0.82s 55.78%  runtime.memmove
     0.18s 12.24% 68.03%      0.18s 12.24%  runtime.memclrNoHeapPointers
     0.13s  8.84% 76.87%      1.21s 82.31%  main.cyclenum

最后一列为函数名称，其他各项内容意义如下：

flat:当前函数占用CPU的耗时
flat%:当前函数占用CPU的耗时百分比
sum%:函数占用CPU的累积耗时百分比
cum：当前函数+调用当前函数的占用CPU总耗时
cum%: 当前函数+调用当前函数的占用CPU总耗时百分比

从字段数据我们可以看出哪一个函数比较耗费时间，就可以对这个函数进一步分析

list

我们还可以使用list <函数名>命令查看具体的函数分析
从上面采样数据可以分析出总耗时最长的函数是main.cycylenum,用list cyclenum命令进行分析

(pprof) list cyclenum
Total: 1.47s
ROUTINE ======================== main.cyclenum in E:\project\goproject\test\test.go
     130ms      1.21s (flat, cum) 82.31% of Total
         .          .     51:   }
         .          .     52:}
         .          .     53:func cyclenum(num int, wg *sync.WaitGroup) {
         .          .     54:   slice := make([]int, 0)
         .          .     55:   for i := 0; i < num; i++ {
      70ms       80ms     56:           for j := 0; j < num; j++ {
      30ms       30ms     57:                   j = i + j
      30ms      1.10s     58:                   slice = append(slice, j)
         .          .     59:           }
         .          .     60:   }
         .          .     61:   wg.Done()
         .          .     62:}
         .          .     63:func writeBytes() *bytes.Buffer {

text

(pprof) text
Showing nodes accounting for 1.41s, 95.92% of 1.47s total
Showing top 10 nodes out of 70
      flat  flat%   sum%        cum   cum%
     0.82s 55.78% 55.78%      0.82s 55.78%  runtime.memmove
     0.18s 12.24% 68.03%      0.18s 12.24%  runtime.memclrNoHeapPointers
     0.13s  8.84% 76.87%      1.21s 82.31%  main.cyclenum
     0.12s  8.16% 85.03%      0.12s  8.16%  runtime.stdcall2
     0.07s  4.76% 89.80%      0.07s  4.76%  runtime._ExternalCode
     0.03s  2.04% 91.84%      0.03s  2.04%  runtime.procyield
     0.02s  1.36% 93.20%      0.02s  1.36%  runtime.asyncPreempt
     0.02s  1.36% 94.56%      0.03s  2.04%  runtime.gentraceback
     0.01s  0.68% 95.24%      0.01s  0.68%  runtime.(*pageAlloc).find
     0.01s  0.68% 95.92%      0.01s  0.68%  runtime.casgstatus

pdf

pdf命令可以生成可视化的pdf文件。

help

help命令可以提供所有pprof支持的命令说明。

图形可视化分析

安装Graphviz:

macos: brew install graphviz

ubuntu: apt-get install graphviz

centos: yum install graphviz

windows: https://graphviz.org/download/

dot -version命令查看是否安装成功

web命令生成svg

在命令行里输入 web 命令，就可以生成一个 svg 格式的文件，用浏览器打开即可查看 svg 文件。

每个框代表一个函数，理论上框越大表示占用的 cpu 资源越多

每个框之间的线条代表函数之间的调用关系，线条上的数字表示函数调用的次数

每个框中第一行数字表示当前函数占用 cpu 的百分比，第二行数字表示当前函数累计占用 cpu 的百分比

$ go tool pprof cpu.pprof 
Type: cpu
Time: Jun 11, 2022 at 8:07pm (CST)
Duration: 515.79ms, Total samples = 1.47s (285.00%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) web

生成pdf

$ go tool pprof cpu.pprof 
Type: cpu
Time: Jun 11, 2022 at 8:07pm (CST)
Duration: 515.79ms, Total samples = 1.47s (285.00%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) pdf

go-torch生成火焰图

用 pprof 生成的采样数据，要把它转换成火焰图，就要使用一个转换工具 go-torch，这个工具是 uber 开源，它是用 go 语言编写的，
可以直接读取 pprof 采集的数据，并生成一张火焰图， svg 格式的文件。

生成火焰图的程序 FlameGraph 是用 perl 写的，所以先要安装执行 perl 语言的环境。
在终端下执行命令：perl -h ，输出了帮助信息，则说明安装成功

1 2	go install github.com/uber/go-torch@latest git clone https://github.com/brendangregg/FlameGraph.git

把FlameGraph中的文件加入Path中
go-torch -b cpu.pprof
go-torch -b cpu.pprof -f cpu_flamegraph.svg
火焰图 svg 文件，你可以点击上面的每个方块来查看分析它上面的内容。

火焰图的调用顺序从下到上，每个方块代表一个函数，它上面一层表示这个函数会调用哪些函数，
方块的大小代表了占用 CPU 使用时长长短。

go tool可视化分析

1	go tool pprof -http=:8080 cpu.pprof

Graph

节点的颜色越红，其cum和cum%越大。其颜色越灰白，则cum和cum%越小。

节点越大，其flat和flat%越大；其越小，则flat和flat%越小

线条代表了函数的调用链，线条越粗，代表指向的函数消耗了越多的资源。反之亦然。

线条的样式代表了调用关系。实线代表直接调用；虚线代表中间少了几个节点；带有inline字段表示该函数被内联进了调用方（不用在意，可以理解成实线）。

Flame Graph

火焰图每个方块代表一个函数，它下面一层表示这个函数会调用哪些函数，方块的大小代表了占用 CPU 使用的长短。
火焰图的配色并没有特殊的意义，默认的红、黄配色是为了更像火焰而已。

测试覆盖率

语句的覆盖率是指在测试中至少被运行一次的代码占总代码数的比例。
go test -run=Converage 标志参数通过在测试代码中插入生成钩子来统计覆盖率数据。在运行每个测试前，它将待测代码拷贝一份并做修改，在每个词法块都会设置一个布尔标志变量。当被修改后的被测试代码运行退出时，将统计日志数据写入c.out文件，并打印一部分执行的语句的一个总结。

# 用-coverprofile标志参数，生成覆盖率文件c.out
go test -run=Coverage -coverprofile=cover.out
# 测试覆盖率工具，打印了测试日志，生成一个HTML报告
go tool cover -html=cover.out

注意

获取的 Profiling 数据是动态的，要想获得有效的数据，请保证应用处于较大的负载（比如正在生成中运行的服务，或者通过其他工具模拟访问压力）。
否则如果应用处于空闲状态，得到的结果可能没有任何意义。

参考

https://go.dev/blog/pprof
https://blog.csdn.net/winter_wu_1998/article/details/104579701
https://segmentfault.com/a/1190000041261187
https://juejin.cn/post/6860844230273662984