CPU 100% 与内存泄漏排查实战

本文针对服务器性能问题（CPU满载、内存泄漏等）提供了系统化的排查指南

小二·

119人浏览 · 2026-06-04 10:02:36

小二· · 2026-06-04 10:02:36 发布

前言

💡 痛点：服务器 CPU 突然打满，业务响应缓慢？内存持续增长，不知道哪里泄漏？程序运行一段时间后越来越卡？

🎯 解决方案：掌握 CPU 与内存问题排查 — 从系统监控、进程分析、到问题定位与解决。

CPU/内存问题排查流程图：

常见性能问题类型：

类型	表现	原因
CPU 100%	所有请求变慢	死循环/密集计算
CPU 单核 100%	单线程瓶颈	GIL/锁竞争
内存泄漏	内存持续增长	对象未释放
内存溢出	OOM 崩溃	内存持续增长

一、CPU 问题排查

1.1 系统 CPU 监控

# ===== 系统 CPU 监控 =====

# 1. top - 实时监控（按 CPU 排序）
top
# 按 P 键：按 CPU 使用率排序
# 按 M 键：按内存排序
# 按 1 键：显示每个 CPU 核心
# 按 q 键：退出

# 2. htop - 更友好的 top（如果可用）
htop
# F6: 排序
# F5: 树形视图
# F3: 搜索

# 3. mpstat - 多核 CPU 使用率
mpstat -P ALL 1 5
# -P ALL: 所有 CPU
# 1: 每秒更新
# 5: 共 5 次

# 输出示例：
# 09:00:00 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
# 09:00:01 AM  all    5.23    0.00    1.45    0.12    0.00    0.05    0.00    0.00    0.00   93.15
# 09:00:01 AM    0    3.12    0.00    0.89    0.05    0.00    0.02    0.00    0.00    0.00   95.92
# 09:00:01 AM    1    7.34    0.00    2.01    0.20    0.00    0.08    0.00    0.00    0.00   90.37

# 4. sar - 系统活动报告
sar -u 1 10       # CPU 使用率
sar -P ALL 1 10   # 每个 CPU 使用率
sar -q 1 10       # 负载情况

# 5. uptime - 快速查看负载
uptime
# 输出：09:00:00 up 10 days,  3:22,  2 users,  load average: 5.23, 4.56, 3.21
# load average: 1分钟, 5分钟, 15分钟负载

# 6. Python 脚本监控 CPU

#!/usr/bin/env python3
"""CPU 监控脚本"""

import psutil
import time
from datetime import datetime

def monitor_cpu(interval=1, duration=60):
    """监控 CPU 使用率"""
    print(f"{'时间':<20} {'CPU%':<10} {'每个核%':<40}")
    print("-" * 70)
    
    start = time.time()
    while time.time() - start < duration:
        cpu_percent = psutil.cpu_percent(interval=interval)
        cpu_per_core = psutil.cpu_percent(interval=0, percpu=True)
        
        now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        cores = ", ".join([f"C{i}:{p:.1f}%" for i, p in enumerate(cpu_per_core)])
        
        print(f"{now:<20} {cpu_percent:<10.1f} {cores:<40}")
        
        if time.time() - start >= duration:
            break

def get_high_cpu_processes(top_n=10):
    """获取高 CPU 进程"""
    processes = []
    for p in psutil.process_iter(['pid', 'name', 'cpu_percent', 'memory_percent']):
        try:
            pinfo = p.info
            cpu = pinfo['cpu_percent']
            if cpu > 0:
                processes.append(pinfo)
        except (psutil.NoSuchProcess, psutil.AccessDenied):
            pass
    
    # 按 CPU 排序
    processes.sort(key=lambda x: x['cpu_percent'], reverse=True)
    
    print(f"\nTop {top_n} CPU 进程:")
    print(f"{'PID':<10} {'名称':<30} {'CPU%':<10} {'内存%':<10}")
    print("-" * 60)
    
    for proc in processes[:top_n]:
        print(f"{proc['pid']:<10} {proc['name']:<30} {proc['cpu_percent']:<10.1f} {proc['memory_percent']:<10.2f}")

if __name__ == '__main__':
    import sys
    
    if len(sys.argv) > 1 and sys.argv[1] == 'top':
        get_high_cpu_processes()
    else:
        monitor_cpu(interval=2, duration=30)

1.2 进程 CPU 分析

# ===== 进程 CPU 分析 =====

# 1. top 查看指定进程
top -p <PID>
top -p 12345

# 2. ps 查看进程 CPU
ps aux | sort -k3 -rn | head -20
ps -eo pid,ppid,%cpu,%mem,cmd | sort -k3 -rn | head -20

# 3. pidstat - 进程级 CPU 统计（需要 sysstat）
pidstat -p 12345 1 5    # 监控指定进程
pidstat -u 1 5          # 所有进程 CPU
pidstat -r 1 5          # 所有进程内存

# 4. 查看进程的线程 CPU
ps -T -p <PID>
# -T: 显示线程
# SPID: 线程 ID
# CMD: 线程名

# 5. top 查看线程
top -H -p <PID>
# -H: 显示线程
# 按 Shift+H 切换线程/进程视图

# 6. 查看进程的打开文件
lsof -p <PID>

# 7. strace 跟踪系统调用（CPU 高时使用，谨慎）
strace -p <PID> -c    # 统计系统调用
strace -p <PID> -f    # 跟踪子进程
strace -p <PID> -T    # 显示调用时间

# 8. perf 性能分析
perf top -p <PID>     # 实时分析
perf record -p <PID>   # 记录数据
perf report            # 查看报告

1.3 CPU 高原因分析

场景	表现	原因
死循环	CPU 单核 100%	代码逻辑错误
GIL 锁	Python 单核	多线程竞争 GIL
密集计算	CPU 持续高	算法复杂度高
锁竞争	多核同时高	锁粒度过粗
上下文切换	大量线程	线程数过多

二、内存问题排查

2.1 系统内存监控

# ===== 系统内存监控 =====

# 1. free - 查看内存使用
free -h
# -h: 人类可读格式
# 输出：
#               total        used        free      shared  buff/cache   available
# Mem:           31Gi       15Gi       2.3Gi       1.2Gi        13Gi        14Gi
# Swap:         8.0Gi       0B         8.0Gi

# 2. vmstat - 虚拟内存统计
vmstat 1 10
# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
#  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
#  1  0      0 2354688  524288 134217728    0    0     0    0   100  200  5  2 93  0  0

# 3. /proc/meminfo - 详细内存信息
cat /proc/meminfo
# MemTotal:       32768000 kB
# MemFree:         2354688 kB
# MemAvailable:   14680064 kB
# Buffers:          524288 kB
# Cached:         134217728 kB
# SwapCached:            0 kB
# ...

# 4. smem - 内存报告（如果可用）
smem -r -k
# -r: 按内存排序
# -k: 显示单位

# 5. Python 内存监控

#!/usr/bin/env python3
"""内存监控脚本"""

import psutil
import time
from datetime import datetime

def monitor_memory(interval=2, duration=60):
    """监控内存使用"""
    print(f"{'时间':<20} {'总量':<12} {'已用':<12} {'可用':<12} {'使用率':<10} {'Swap':<12}")
    print("-" * 78)
    
    start = time.time()
    while time.time() - start < duration:
        mem = psutil.virtual_memory()
        swap = psutil.swap_memory()
        
        now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        
        print(f"{now:<20} "
              f"{format_bytes(mem.total):<12} "
              f"{format_bytes(mem.used):<12} "
              f"{format_bytes(mem.available):<12} "
              f"{mem.percent:<10.1f} "
              f"{format_bytes(swap.used):<12}")
        
        if time.time() - start >= duration:
            break
        time.sleep(interval)

def format_bytes(bytes_value):
    """格式化字节"""
    for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
        if bytes_value < 1024.0:
            return f"{bytes_value:.1f}{unit}"
        bytes_value /= 1024.0
    return f"{bytes_value:.1f}PB"

def get_high_memory_processes(top_n=10):
    """获取高内存进程"""
    processes = []
    for p in psutil.process_iter(['pid', 'name', 'memory_info', 'memory_percent']):
        try:
            pinfo = p.info
            mem_rss = pinfo['memory_info'].rss if pinfo['memory_info'] else 0
            processes.append({
                'pid': pinfo['pid'],
                'name': pinfo['name'],
                'memory_rss': mem_rss,
                'memory_percent': pinfo['memory_percent']
            })
        except (psutil.NoSuchProcess, psutil.AccessDenied):
            pass
    
    # 按内存排序
    processes.sort(key=lambda x: x['memory_rss'], reverse=True)
    
    print(f"\nTop {top_n} 内存进程:")
    print(f"{'PID':<10} {'名称':<30} {'RSS':<15} {'内存%':<10}")
    print("-" * 65)
    
    for proc in processes[:top_n]:
        print(f"{proc['pid']:<10} {proc['name']:<30} "
              f"{format_bytes(proc['memory_rss']):<15} {proc['memory_percent']:<10.2f}")

if __name__ == '__main__':
    import sys
    
    if len(sys.argv) > 1 and sys.argv[1] == 'top':
        get_high_memory_processes()
    else:
        monitor_memory(interval=2, duration=30)

2.2 内存泄漏检测

# ===== 内存泄漏检测 =====

# 1. valgrind - 内存泄漏检测（Linux）
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes ./your_program

# 2. gdb + gcore - 调试时生成 core dump
gdb -p <PID>
# (gdb) generate-core-file
# (gdb) quit

# 3. pmap - 查看进程内存映射
pmap -x <PID>
# -x: 扩展格式

# 4. 查看 /proc/<PID>/smaps（详细内存信息）
cat /proc/<PID>/smaps | grep -A 10 "heap"
# 或使用 smem
cat /proc/<PID>/smaps_rollup

# 5. Python 内存泄漏检测

#!/usr/bin/env python3
"""Python 内存泄漏检测"""

import gc
import sys
import tracemalloc
import psutil
import os
import time

def get_memory_usage():
    """获取当前进程内存使用"""
    process = psutil.Process(os.getpid())
    mem_info = process.memory_info()
    return {
        'rss': mem_info.rss,  # 物理内存
        'vms': mem_info.vms,  # 虚拟内存
        'rss_mb': mem_info.rss / 1024 / 1024,
        'vms_mb': mem_info.vms / 1024 / 1024
    }

def track_memory_leak():
    """跟踪内存增长"""
    print("内存泄漏跟踪测试")
    print("=" * 60)
    
    # 记录初始内存
    gc.collect()
    initial = get_memory_usage()
    print(f"初始内存: {initial['rss_mb']:.2f} MB")
    
    # 创建一些对象但不释放
    leaked_objects = []
    
    for i in range(100):
        # 创建大对象
        data = [x for x in range(10000)]
        leaked_objects.append(data)
        
        if i % 10 == 0:
            gc.collect()
            current = get_memory_usage()
            growth = current['rss_mb'] - initial['rss_mb']
            print(f"第 {i:3d} 次: {current['rss_mb']:.2f} MB (增长: +{growth:.2f} MB)")
        
        time.sleep(0.1)
    
    # 最终内存
    gc.collect()
    final = get_memory_usage()
    total_growth = final['rss_mb'] - initial['rss_mb']
    print(f"\n最终内存: {final['rss_mb']:.2f} MB")
    print(f"总增长: +{total_growth:.2f} MB")
    print(f"泄漏对象数: {len(leaked_objects)}")

def trace_memory():
    """使用 tracemalloc 跟踪内存"""
    print("\n使用 tracemalloc 跟踪内存分配")
    print("=" * 60)
    
    tracemalloc.start()
    
    # 记录初始状态
    snapshot1 = tracemalloc.take_snapshot()
    
    # 执行一些操作
    data = []
    for i in range(1000):
        data.append([x for x in range(100)])
    
    # 记录最终状态
    snapshot2 = tracemalloc.take_snapshot()
    
    # 对比差异
    top_stats = snapshot2.compare_to(snapshot1, 'lineno')
    
    print("\n内存增长 Top 10:")
    for stat in top_stats[:10]:
        print(stat)

def find_memory_leaks():
    """查找可疑的内存泄漏"""
    print("\n查找内存泄漏对象")
    print("=" * 60)
    
    # 强制垃圾回收
    gc.collect()
    
    # 获取所有不可达对象
    unreachable = gc.collect()
    print(f"回收了 {unreachable} 个不可达对象")
    
    # 查看垃圾对象
    garbage = gc.garbage
    print(f"垃圾对象数量: {len(garbage)}")
    
    # 按类型统计
    types = {}
    for obj in gc.get_objects():
        obj_type = type(obj).__name__
        types[obj_type] = types.get(obj_type, 0) + 1
    
    print("\n对象类型统计 (Top 20):")
    sorted_types = sorted(types.items(), key=lambda x: x[1], reverse=True)
    for obj_type, count in sorted_types[:20]:
        print(f"  {obj_type}: {count}")

if __name__ == '__main__':
    track_memory_leak()
    trace_memory()
    find_memory_leaks()

2.3 Go 内存泄漏检测

// Go 内存泄漏检测

package main

import (
    "fmt"
    "net/http"
    _ "net/http/pprof"
    "runtime"
    "time"
)

// 模拟泄漏的全局变量
var globalCache = make(map[string][]byte)
var leakSlice [][]byte

func main() {
    // 启动 pprof 服务器
    go func() {
        http.ListenAndServe("localhost:6060", nil)
    }()
    
    // 定期打印内存统计
    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()
    
    var m runtime.MemStats
    
    for {
        <-ticker.C
        runtime.ReadMemStats(&m)
        
        fmt.Printf("Alloc = %v MiB\t", m.Alloc/1024/1024)
        fmt.Printf("TotalAlloc = %v MiB\t", m.TotalAlloc/1024/1024)
        fmt.Printf("Sys = %v MiB\t", m.Sys/1024/1024)
        fmt.Printf("NumGC = %v\n", m.NumGC)
    }
}

// pprof 使用方法：
// 1. 访问 http://localhost:6060/debug/pprof/heap
// 2. 使用 go tool pprof 分析：
//    go tool pprof http://localhost:6060/debug/pprof/heap
// 3. 在 pprof 中输入:
//    top -cum    // 按累计内存排序
//    list <func>  // 查看函数内存分配

三、火焰图分析

3.1 perf +火焰图

# ===== perf + 火焰图 =====

# 1. 安装火焰图工具
# git clone https://github.com/brendangregg/FlameGraph.git
# export PATH=$PATH:/path/to/FlameGraph

# 2. 记录 CPU 数据
perf record -F 99 -p <PID> -g -- sleep 30
# -F 99: 采样频率 99Hz
# -p <PID>: 监控指定进程
# -g: 记录调用栈
# sleep 30: 持续 30 秒

# 3. 生成报告
perf report

# 4. 生成火焰图
perf script -i perf.data > perf.unfold
./stackcollapse-perf.pl perf.unfold > perf.folded
./flamegraph.pl perf.folded > cpu_flamegraph.svg

# 5. 监控整个系统
perf record -F 99 -ag -- sleep 60
perf script | ./inferno-collapse-perf > out.folded
./flamegraph.pl out.folded > system_flamegraph.svg

3.2 py-spy 火焰图

# ===== py-spy 火焰图 =====

# 安装
# pip install py-spy

# 记录 CPU 火焰图
py-spy record -o profile.svg --pid <PID>
# 或
py-spy record -o profile.svg -- python your_script.py

# 记录内存火焰图
py-spy record -o memory.svg --memory --pid <PID>

# 查看实时采样
py-spy top --pid <PID>

3.3 Node.js 火焰图

# ===== Node.js 火焰图 =====

# 1. 使用 clinic.js
# npm install -g clinic

# 2. CPU 分析
clinic doctor -- node server.js

# 3. 火焰图
clinic flame -- node server.js

# 4. 手动生成火焰图
# 启动服务器
node --prof server.js

# 模拟负载
# ab -n 1000 -c 10 http://localhost:3000/

# 处理日志
node --prof-process isolate-*.log | flamegraph | svg

3.4 Java 火焰图

# ===== Java 火焰图 =====

# 1. 使用 async-profiler
# 下载: https://github.com/jvm-profiling-tools/async-profiler

# 2. 记录 CPU 火焰图
./profiler.sh -d 30 -f profile.svg -e cpu <PID>

# 3. 记录内存分配
./profiler.sh -d 30 -f profile.svg -e alloc -i 10% <PID>

# 4. 查看运行中的 Java 进程
jps -l
# 输出: 12345 com.example.MyApplication

四、常见问题场景

4.1 Python GIL 问题

# ===== Python GIL 问题 =====

# 问题：多线程 CPU 密集型任务不加速

# 错误示例
import threading
import time

def cpu_task(n):
    """CPU 密集型任务"""
    result = 0
    for i in range(n):
        result += i * i
    return result

# 使用多线程（不加速！）
threads = []
start = time.time()
for _ in range(4):
    t = threading.Thread(target=cpu_task, args=(10000000,))
    threads.append(t)
    t.start()
for t in threads:
    t.join()
print(f"多线程耗时: {time.time() - start:.2f}s")  # ~4s

# 正确方案 1：多进程
import multiprocessing

def cpu_task_mp(n):
    return sum(i * i for i in range(n))

if __name__ == '__main__':
    with multiprocessing.Pool(4) as pool:
        start = time.time()
        results = pool.map(cpu_task_mp, [10000000] * 4)
        print(f"多进程耗时: {time.time() - start:.2f}s")  # ~1s

# 正确方案 2：使用 C 扩展
# 将 CPU 密集型代码放到 Cython 或 C 中

# 正确方案 3：asyncio + 线程池（IO 密集）
import asyncio
from concurrent.futures import ThreadPoolExecutor

async def io_task():
    # IO 操作
    await asyncio.sleep(1)
    return "done"

async def main():
    loop = asyncio.get_event_loop()
    with ThreadPoolExecutor(max_workers=4) as executor:
        tasks = [loop.run_in_executor(executor, blocking_io_task) for _ in range(10)]
        results = await asyncio.gather(*tasks)

4.2 Java 内存泄漏

// ===== Java 内存泄漏场景 =====

import java.util.*;
import java.lang.ref.WeakReference;

public class MemoryLeakExamples {
    
    // 泄漏 1：静态集合持有对象
    static List<Object> cache = new ArrayList<>();
    
    // 泄漏 2：未关闭的资源
    public void leakExample1() {
        // 文件流未关闭
        FileInputStream fis = new FileInputStream("file.txt");
        // 如果不在 finally 中关闭，异常时泄漏
    }
    
    // 泄漏 3：监听器未移除
    public void leakExample2() {
        // 添加监听器但从不移除
        button.addActionListener(e -> {/* ... */});
        // 导致 button 持有 listener 引用，listener 持有外部类引用
    }
    
    // 泄漏 4：String.intern()
    public void leakExample3() {
        // 大量调用 intern() 可能导致字符串常量池过大
        String s = new String("large_string").intern();
    }
    
    // 泄漏 5：ThreadLocal 未清理
    ThreadLocal<byte[]> buffer = new ThreadLocal<>();
    
    // 正确的做法
    public void correctPattern() {
        try {
            // 使用资源
        } finally {
            // 确保清理
            buffer.remove();
        }
    }
    
    // 使用 WeakReference 避免内存泄漏
    public void correctWithWeakRef() {
        // 弱引用可以被 GC 回收
        WeakReference<Object> ref = new WeakReference<>(new Object());
        // 当没有其他强引用时，对象会被回收
    }
}

4.3 Node.js 内存泄漏

// Node.js 内存泄漏

// 泄漏 1：全局变量
global.someCache = new Map();  // 永不释放

// 泄漏 2：闭包引用
function createLeak() {
    const largeArray = new Array(1000000);
    
    return function() {
        // 闭包引用 largeArray
        return largeArray.length;
    };
}

const leakedFn = createLeak();
// largeArray 永远不会被释放

// 泄漏 3：事件监听器累积
class EventEmitter {
    constructor() {
        this.events = {};
    }
    
    on(event, listener) {
        if (!this.events[event]) {
            this.events[event] = [];
        }
        this.events[event].push(listener);
        // 从不清理！
    }
}

// 泄漏 4：缓存未设置上限
const cache = new Map();
function getData(key) {
    if (!cache.has(key)) {
        cache.set(key, fetchFromDB(key));
    }
    return cache.get(key);
}
// cache 无限增长

// 正确的缓存实现
const LRU = require('lru-cache');
const cache = new LRU({ max: 1000 });  // 限制大小

// 使用 heapdump 分析
const heapdump = require('heapdump');

// 生成快照
heapdump.writeSnapshot('./heapdump.heapsnapshot');

// 在代码中
process.on('SIGUSR2', () => {
    heapdump.writeSnapshot();
});

五、生产案例

5.1 案例：Python 内存泄漏

# ===== 案例：Python 内存泄漏修复 =====

# 问题：服务运行几天后内存持续增长，最终 OOM

# 1. 复现问题
import requests
import time
import tracemalloc

tracemalloc.start()

def process_request():
    # 模拟处理请求
    data = requests.get('https://api.example.com/data').json()
    
    # 问题代码：每次请求都保存结果
    global results
    results.append(data)  # 结果永远不删除！
    
    return len(results)

# 测试
results = []
for i in range(100):
    process_request()
    if i % 10 == 0:
        current, peak = tracemalloc.get_traced_memory()
        print(f"请求 {i}: 内存 {current / 1024 / 1024:.1f} MB, 峰值 {peak / 1024 / 1024:.1f} MB")

# 诊断：使用 memory_profiler
# pip install memory_profiler
# mprof run python script.py
# mprof plot

# 2. 修复方案
from functools import lru_cache
from collections import deque

class RequestProcessor:
    def __init__(self, max_size=1000):
        # 使用有界队列
        self.results = deque(maxlen=max_size)
        self.cache = lru_cache(maxsize=1000)
    
    def process_request(self, request_id):
        # 使用缓存
        cache_key = f"request:{request_id}"
        
        if cache_key in self.cache:
            return self.cache(cache_key)
        
        # 处理请求
        data = self._fetch_data(request_id)
        
        # 有界存储
        self.results.append(data)
        
        # 有界缓存
        self.cache(cache_key, data)
        
        return data
    
    @lru_cache(maxsize=1000)
    def cache(self, key, value):
        return value
    
    def _fetch_data(self, request_id):
        # 模拟数据获取
        return {'id': request_id, 'data': 'x' * 1000}

# 3. 使用弱引用
import weakref

class Cache:
    def __init__(self):
        self._cache = {}
    
    def get(self, key):
        return self._cache.get(key)
    
    def set(self, key, value):
        # 使用弱引用，允许 GC
        self._cache[key] = weakref.ref(value)

# 4. 定期清理
import gc

class PeriodicCleaner:
    def __init__(self, interval=3600):
        self.interval = interval
        self.last_clean = time.time()
    
    def check(self):
        now = time.time()
        if now - self.last_clean > self.interval:
            gc.collect()
            self.last_clean = now
            print("执行了 GC 清理")

5.2 案例：Java 线程池泄漏

// 案例：线程池配置不当导致内存泄漏

import java.util.concurrent.*;
import java.util.*;

public class ThreadPoolLeak {
    
    // 错误：线程池无界
    ExecutorService badPool = Executors.newCachedThreadPool();
    
    // 正确：有界队列 + 有界线程池
    ExecutorService goodPool = new ThreadPoolExecutor(
        4,                      // corePoolSize
        8,                      // maxPoolSize
        60L, TimeUnit.SECONDS,  // keepAliveTime
        new LinkedBlockingQueue<>(100),  // 有界队列
        new ThreadPoolExecutor.CallerRunsPolicy()  // 拒绝策略
    );
    
    // 泄漏场景：不等待任务完成就关闭
    public void badShutdown() {
        ExecutorService pool = Executors.newFixedThreadPool(4);
        
        for (int i = 0; i < 100; i++) {
            final int taskId = i;
            pool.submit(() -> {
                try {
                    Thread.sleep(1000);
                    System.out.println("Task " + taskId + " done");
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                }
            });
        }
        
        pool.shutdown();  // 不等待完成就关闭！
        // 正在执行的任务被中断，可能导致资源泄漏
    }
    
    // 正确：等待任务完成
    public void goodShutdown() {
        ExecutorService pool = Executors.newFixedThreadPool(4);
        
        try {
            for (int i = 0; i < 100; i++) {
                final int taskId = i;
                pool.submit(() -> {
                    // 任务逻辑
                });
            }
        } finally {
            pool.shutdown();
            try {
                if (!pool.awaitTermination(60, TimeUnit.SECONDS)) {
                    pool.shutdownNow();
                }
            } catch (InterruptedException e) {
                pool.shutdownNow();
                Thread.currentThread().interrupt();
            }
        }
    }
    
    // 内存泄漏：ThreadLocal 未清理
    static ThreadLocal<List<Object>> threadLocalData = new ThreadLocal<>();
    
    public static void wrong() {
        // 使用但从不清理
        threadLocalData.set(new ArrayList<>());
        // ThreadLocalMap 中的 Entry 持有对象引用
    }
    
    public static void correct() {
        try {
            threadLocalData.set(new ArrayList<>());
            // 使用
        } finally {
            // 清理！
            threadLocalData.remove();
        }
    }
}

5.3 案例：Redis 连接泄漏

# 案例：Redis 连接泄漏

import redis
import time
import threading

# 错误：每次请求创建新连接
def bad_get_user(user_id):
    r = redis.Redis(host='localhost', port=6379)  # 每次都新建！
    return r.get(f'user:{user_id}')

# 错误：在循环中创建连接
def bad_batch_get(user_ids):
    results = []
    for user_id in user_ids:
        r = redis.Redis(host='localhost', port=6379)  # 泄漏！
        results.append(r.get(f'user:{user_id}'))
    return results

# 正确：使用连接池
class UserCache:
    def __init__(self):
        # 复用连接池
        self.pool = redis.ConnectionPool(
            host='localhost',
            port=6379,
            max_connections=50,
            decode_responses=True
        )
    
    def get_user(self, user_id):
        r = redis.Redis(connection_pool=self.pool)
        return r.get(f'user:{user_id}')
    
    def batch_get(self, user_ids):
        r = redis.Redis(connection_pool=self.pool)
        pipe = r.pipeline()
        
        for user_id in user_ids:
            pipe.get(f'user:{user_id}')
        
        return pipe.execute()

# 正确：在多线程环境中使用单例
class RedisClient:
    _instance = None
    _lock = threading.Lock()
    
    def __new__(cls):
        if cls._instance is None:
            with cls._lock:
                if cls._instance is None:
                    cls._instance = super().__new__(cls)
                    cls._instance.pool = redis.ConnectionPool(
                        host='localhost',
                        port=6379,
                        max_connections=50
                    )
        return cls._instance
    
    @property
    def client(self):
        return redis.Redis(connection_pool=self.pool)

# 使用
client = RedisClient()
user_data = client.client.get('user:123')

六、监控与告警

6.1 Prometheus 监控

# ===== Prometheus 监控配置 =====

# node_exporter 指标
# - node_cpu_seconds_total
# - node_memory_MemAvailable_bytes
# - node_memory_MemTotal_bytes
# - process_cpu_seconds_total
# - process_resident_memory_bytes

# prometheus.yml
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

# 告警规则
groups:
  - name: resource-alerts
    rules:
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高"
          description: "实例 {{ $labels.instance }} CPU 使用率 {{ $value }}%"
      
      - alert: HighMemory
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率过高"
          description: "实例 {{ $labels.instance }} 内存使用率 {{ $value }}%"
      
      - alert: MemoryLeakSuspected
        expr: (node_memory_MemAvailable_bytes - node_memory_MemFree_bytes) / node_memory_MemTotal_bytes > 0.01
        for: 30m
        labels:
          severity: info
        annotations:
          summary: "怀疑内存泄漏"
          description: "实例 {{ $labels.instance }} 内存持续增长"

6.2 Grafana 仪表板

// Grafana Dashboard JSON（简化版）
{
  "panels": [
    {
      "title": "CPU 使用率",
      "type": "graph",
      "targets": [
        {
          "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
          "legendFormat": "{{instance}}"
        }
      ]
    },
    {
      "title": "内存使用率",
      "type": "graph",
      "targets": [
        {
          "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
          "legendFormat": "{{instance}}"
        }
      ]
    },
    {
      "title": "进程内存 Top 10",
      "type": "table",
      "targets": [
        {
          "expr": "topk(10, process_resident_memory_bytes)",
          "format": "table"
        }
      ]
    }
  ]
}

七、总结

7.1 排查流程

7.2 排查命令速查

命令	用途
`top`	实时 CPU/内存监控
`htop`	更友好的 top
`vmstat`	虚拟内存统计
`pidstat`	进程 CPU/内存统计
`pmap`	进程内存映射
`perf top`	CPU 采样分析
`valgrind`	内存泄漏检测
`py-spy`	Python 火焰图

7.3 常见问题解决

问题	原因	解决
CPU 单核 100%	死循环/GIL	优化算法/多进程
CPU 多核 100%	锁竞争	减少锁粒度
内存持续增长	泄漏/缓存无界	清理/有界缓存
内存周期性增长	GC 未触发	调整 GC 参数
进程 OOM	内存泄漏	修复代码

7.4 最佳实践

实践	说明
监控先行	使用 Prometheus/Grafana
火焰图分析	精准定位热点
有界缓存	使用 LRU/TTL
连接池	复用连接
定期 GC	Python/Java 清理
资源清理	finally/with 语句

本文基于常见性能问题编写。如有问题欢迎评论区讨论！

openEuler 社区

openEuler 是由开放原子开源基金会孵化的全场景开源操作系统项目，面向数字基础设施四大核心场景（服务器、云计算、边缘计算、嵌入式），全面支持 ARM、x86、RISC-V、loongArch、PowerPC、SW-64 等多样性计算架构

更多推荐

从云端到边缘：基于土星云SE110S的智能视频分析轻量化部署方案（下）

openEuler 社区

GPU服务器到底能干啥？一文讲透它的真实应用场景

openEuler 社区

IT 服务管理软件选型：国产替代浪潮下，企业怎么重新审视 ITSM 工具

过去几年，国产替代这个话题从半导体芯片蔓延到了企业软件领域。操作系统、办公软件、数据库、ERP……越来越多的企业开始重新审视自己的 IT 软件栈，评估哪些可以替换为国产产品，哪些替换风险太高暂时维持现状。ITSM 软件也在这个评估清单里。这个评估带来了一个有价值的副产品：很多企业发现，他们使用的 ITSM 工具，往往是多年前采购的，功能已经和现在的需求严重脱节——要么功能太简单，要么太复杂用不起来