首先我的服务器配置是

  • 显卡型号:NVIDIA RTX PRO 6000 Black Edition(专业级高端显卡)
  • 显存规格:总显存 97887 MiB(约 95.6 GB),当前已使用 87788 MiB(约 85.7 GB),剩余约 9.9 GB
  • 驱动 / 算力:NVIDIA 驱动版本 580.119.02,CUDA 版本 13.0(适配高版本深度学习框架)
nvidia-smi
Thu Mar 12 20:23:09 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.119.02             Driver Version: 580.119.02     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    Off |   00000000:01:00.0 Off |                  Off |
| 30%   36C    P8             19W /  600W |   87788MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           57161      C   VLLM::EngineCore                      87728MiB |
|    0   N/A  N/A          140517      G   /usr/lib/xorg/Xorg                       40MiB |
+-----------------------------------------------------------------------------------------+

第一步安装虚拟环境的配置

  1. 原本的python环境是python 1.13.3
  2. 安装miniconda3 用来准备运行环境
  3. 配置系统的镜像源(这里使用的是华为的镜像源,因为是内网环境,对镜像源有要求)
sudo sed -i "s@http://.*security.ubuntu.com@http://mirrors.tools.huawei.com@g" ubuntu.sources
sudo sed -i "s@http://.*archive.ubuntu.com@http://mirrors.tools.huawei.com@g" ubuntu.sources



pip config set global.trusted-host mirrors.tools.huawei.com
pip config set global.index-url http://mirrors.tools.huawei.com/pypi/simple/

在miniconda3的解压包下找到.condarc文件,里面也配一下镜像源

channels:
  - defaults
show_channel_urls: true
default_channels:
   - main
channel_alias: http://conda.rnd.huawei.com/repository/conda-proxy
channel_priority: strict
env_dirs:
   - /opt/miniconda3/envs

环境配置好之后创建一个合适版本的的虚拟环境

conda create --name VLLM python=3.13 -y 创建虚拟环境

第二步安装vllm的前置条件

  • sudo apt update   (执行可以忽略)
  • sudo apt install -y build-essential dkm  (执行可以忽略)
  • sudo update-initramfs -u 
  • 安装nivdia驱动 执行nvidia-smi有内容算成功
  • 下载nivdia-smi对应的cuda驱动 https://developer.nvidia.com/cuda-toolkit-archive 对应自己的系统的版本 比如我的是cuda_13.0.0_580.65.06_linux.run 我的系统版本就是13.0,我的下载到/home/zhike/data/program下
  1. sudo chmod a+x cuda_13.0.0_580.65.06_linux.run 
  2. ssh bash cuda_13.0.0_580.65.06_linux.run
  3. 配置环境变量
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-13.0/lib64
    export PATH=$PATH:/usr/local/cuda-13.0/bin
    export CUDA_HOME=$CUDA_HOME:/usr/local/cuda-13.0

第三步下载pytorch

  • 下载pytorch选择对应的cuda和python版本 https://download.pytorch.org/whl/torch/ 
  • source ~/.bashrc
  • conda activate vLLM 进去虚拟环境后进行安装torch的whl文件
  • pip install torch-2.10.0+cu130-cp313-cp313-manylinux_2_28_x86_64.whl
  • 安装成功后pip install vLLM
  • 安装成功后执行 vllm --version 后显示版本说明安装成功了

下载 Qwen3-Coder-Next 大模型

  • 将Qwen3-Coder-Next 下载想要放置的位置
  • 使用huggingface 或者Modelscope 或者git下载下来(安装的vllm自带了huggingface)
  • 切记要是使用源文件,不要指针文件
  • 然后cd到下载到的文件目录下执行启动程序

第四步启动

  • 这个是可以启动量化后的Qwen3-Coder-Next,但是原生的跑不起来
  • python -m vllm.entrypoints.api_server     --model .     --trust-remote-code     --port 8000     --host 0.0.0.0     --load-format safetensors     --dtype bfloat16     --tokenizer-mode slow     --gpu-memory-utilization 0.9     --max-model-len 8192     > /tmp/vllm.log
  • 这个控制了量化,启动原生的Qwen3-Coder-Next
  • python -m vllm.entrypoints.api_server --model . --trust-remote-code --port 8000 --host 0.0.0.0 --dtype float16 --gpu-memory-utilization 0.85 --max-model-len 2048 --quantization fp8 --enforce-eager --swap-space 10
  • 这个可以启动原生的Qwen3-Coder-Next并且打开chat模式
  • python -m vllm.entrypoints.openai.api_server --model . --trust-remote-code --port 8000 --host 100.102.39.213 --dtype float16 --gpu-memory-utilization 0.85 --max-model-len 2048 --quantization fp8 --enforce-eager --swap-space 10 --served-model-name qwen3 
    

使用fp8量化的配置参数参考①:

nohup python -m vllm.entrypoints.openai.api_server \
  --model /home/zhike/data/ModelScope/Qwen3-Coder-Next \
  --trust-remote-code \
  --port 8000 \
  --host 100.102.39.213 \
  --gpu-memory-utilization 0.92 \
  --quantization fp8 \
  --served-model-name Qwen/Qwen3-Coder-Next \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --max-model-len 65536 \      # ← 放弃 128K,改用 64K(更现实)
  --enable-prefix-caching \
  --max-num-seqs 8 \          # ← ≥4 并发,实际可达 8
  --max-num-batched-tokens 65536 \
  --block-size 32 \           # ← 提升至 32(Qwen 友好)
  --max-logprobs 20 \
  --disable-log-stats > ~/logs/vllm/vllm_$(date +%Y%m%d_%H%M%S).log 2>&1 &

量化nvfpv问题

转换代码:

import os
import ssl

# === 关键:强制使用官方 Hugging Face Hub ===
os.environ["HF_ENDPOINT"] = "https://huggingface.co"

# 如仍有 SSL 问题(极少见),再额外禁用验证(仅内网环境):
# os.environ["CURL_CA_BUNDLE"] = ""
# os.environ["REQUESTS_CA_BUNDLE"] = ""
# ssl._create_default_https_context = ssl._create_unverified_context

# === 正常导入 ===
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset

# 1. 配置路径
MODEL_PATH = "/home/zhike/data/ModelScope/Qwen3-Coder-Next"
OUTPUT_PATH = "/home/zhike/data/ModelScope/Qwen3-Coder-Next-NVFP4"

# 2. 加载模型和分词器
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

# 3. 准备校准数据集 → 此处不再报错!
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
dataset = dataset.select(range(512))

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
dataset = dataset.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=2048, truncation=True, add_special_tokens=False)
dataset = dataset.map(tokenize, remove_columns=dataset.column_names)

# 4. 配置 NVFP4 量化方案
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head"])

# 5. 执行量化
oneshot(model=model, dataset=dataset, recipe=recipe, max_seq_length=2048)

# 6. 保存模型
model.save_pretrained(OUTPUT_PATH, save_compressed=True)
tokenizer.save_pretrained(OUTPUT_PATH)
print(f"✅ NVFP4 模型已保存至: {OUTPUT_PATH}")

报错:

(EngineCore pid=2958568) /bin/sh: 1: :/usr/local/cuda-13.0/bin/nvcc: not found
(EngineCore pid=2958568) ninja: build stopped: subcommand failed.
(EngineCore pid=2958568) 
[rank0]:[W423 11:03:59.936941636 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=2958399) Traceback (most recent call last):
(APIServer pid=2958399)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=2958399)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 710, in <module>
(APIServer pid=2958399)     uvloop.run(run_server(args))
(APIServer pid=2958399)     ~~~~~~~~~~^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=2958399)     return __asyncio.run(
(APIServer pid=2958399)            ~~~~~~~~~~~~~^
(APIServer pid=2958399)         wrapper(),
(APIServer pid=2958399)         ^^^^^^^^^^
(APIServer pid=2958399)     ...<2 lines>...
(APIServer pid=2958399)         **run_kwargs
(APIServer pid=2958399)         ^^^^^^^^^^^^
(APIServer pid=2958399)     )
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/asyncio/runners.py", line 195, in run
(APIServer pid=2958399)     return runner.run(main)
(APIServer pid=2958399)            ~~~~~~~~~~^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/asyncio/runners.py", line 118, in run
(APIServer pid=2958399)     return self._loop.run_until_complete(task)
(APIServer pid=2958399)            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
(APIServer pid=2958399)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=2958399)     return await main
(APIServer pid=2958399)            ^^^^^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=2958399)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=2958399)     async with build_async_engine_client(
(APIServer pid=2958399)                ~~~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         args,
(APIServer pid=2958399)         ^^^^^
(APIServer pid=2958399)         client_config=client_config,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ) as engine_client:
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/contextlib.py", line 214, in __aenter__
(APIServer pid=2958399)     return await anext(self.gen)
(APIServer pid=2958399)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=2958399)     async with build_async_engine_client_from_engine_args(
(APIServer pid=2958399)                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         engine_args,
(APIServer pid=2958399)         ^^^^^^^^^^^^
(APIServer pid=2958399)         usage_context=usage_context,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)         client_config=client_config,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ) as engine:
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/contextlib.py", line 214, in __aenter__
(APIServer pid=2958399)     return await anext(self.gen)
(APIServer pid=2958399)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=2958399)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=2958399)         vllm_config=vllm_config,
(APIServer pid=2958399)     ...<6 lines>...
(APIServer pid=2958399)         client_index=client_index,
(APIServer pid=2958399)     )
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=2958399)     return cls(
(APIServer pid=2958399)         vllm_config=vllm_config,
(APIServer pid=2958399)     ...<9 lines>...
(APIServer pid=2958399)         client_index=client_index,
(APIServer pid=2958399)     )
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=2958399)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=2958399)                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         vllm_config=vllm_config,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ...<4 lines>...
(APIServer pid=2958399)         client_index=client_index,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     )
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=2958399)     return func(*args, **kwargs)
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=2958399)     return AsyncMPClient(*client_args)
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=2958399)     return func(*args, **kwargs)
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/core_client.py", line 887, in __init__
(APIServer pid=2958399)     super().__init__(
(APIServer pid=2958399)     ~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         asyncio_mode=True,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ...<3 lines>...
(APIServer pid=2958399)         client_addresses=client_addresses,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     )
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=2958399)     with launch_core_engines(
(APIServer pid=2958399)          ~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         vllm_config, executor_class, log_stats, addresses
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ) as (engine_manager, coordinator, addresses, tensor_queue):
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/contextlib.py", line 148, in __exit__
(APIServer pid=2958399)     next(self.gen)
(APIServer pid=2958399)     ~~~~^^^^^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
(APIServer pid=2958399)     wait_for_engine_startup(
(APIServer pid=2958399)     ~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         handshake_socket,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ...<6 lines>...
(APIServer pid=2958399)         coordinator.proc if coordinator else None,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     )
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
(APIServer pid=2958399)     raise RuntimeError(
(APIServer pid=2958399)     ...<3 lines>...
(APIServer pid=2958399)     )
(APIServer pid=2958399) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

执行: python -m vllm.entrypoints.openai.api_server   --model /home/zhike/data/ModelScope/Qwen3-Coder-Next-NVFP4   --trust-remote-code   --port 8000   --host 100.102.39.213   --gpu-memory-utilization 0.90   --served-model-name Qwen/Qwen3-Coder-Next   --enable-auto-tool-choice   --tool-call-parser qwen3_coder   --max-model-len 32768   --enable-prefix-caching   --max-num-seqs 16   --max-num-batched-tokens 131072   --block-size 64   --max-logprobs 20   --disable-log-stats   --enforce-eage

本地查询:

(base) zhike@zhike:~/data$ which nvcc
/usr/local/cuda-13.0/bin/nvcc
(base) zhike@zhike:~/data$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jul_16_07:30:01_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.48
Build cuda_13.0.r13.0/compiler.36260728_0
(base) zhike@zhike:~/data$ nvidia-smi
Thu Apr 23 11:47:29 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.119.02             Driver Version: 580.119.02     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    Off |   00000000:01:00.0 Off |                  Off |
| 30%   35C    P8             19W /  600W |   91455MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3926      G   /usr/lib/xorg/Xorg                       75MiB |
|    0   N/A  N/A         3014542      C   VLLM::EngineCore                      91352MiB |
+-----------------------------------------------------------------------------------------+
(base) zhike@zhike:~/data$ echo $PATH
/usr/local/cuda-13.0/bin:/home/zhike/.local/bin:/opt/miniconda3/bin:/opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-13.0/bin

(base) zhike@zhike:~/data$ echo 'export CUDA_HOME=/usr/local/cuda-13.0' 
export CUDA_HOME=/usr/local/cuda-13.0

解决:
出现找不到nvcc的情况

unset CUDA_HOME NVCC CUDA_PATH

# 2. 显式指定正确路径(注意不要带前导/后导冒号)
export CUDA_HOME=/usr/local/cuda-13.0
export NVCC=$CUDA_HOME/bin/nvcc
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

# 3. 验证 nvcc 是否可正常调用
$NVCC --version

rm -rf ~/.cache/flashinfer
# 若使用 VLLM 默认缓存目录,也可清理:
rm -rf ~/.cache/vllm

执行:

python -m vllm.entrypoints.openai.api_server --model /home/zhike/data/ModelScope/coder-next-nvfp4 --trust-remote-code --port 8000 --host 0.0.0.0 --gpu-memory-utilization 0.85 --served-model-name Qwen3-Coder-Next-nvfp4 --enable-auto-tool-choice --tool-call-parser qwen3_coder --max-model-len 65536 --max-num-batched-tokens 128288 --max-num-seqs 8 --block-size 64 --enable-prefix-caching --max-logprobs 20 --disable-log-stats --dtype auto --enforce-eager

参数

值示例

类型

作用与说明

⚠️ 注意事项/优化建议

--model

/home/zhike/data/ModelScope/coder-next-nvfp4

路径

指定本地模型目录路径(已量化为 NVFP4 格式)

✅ 确保路径存在 config.jsontokenizer.jsonmodel.safetensors 等完整文件;路径中不得含空格或中文

--trust-remote-code

布尔开关

允许加载模型/分词器中的自定义 Python 代码(如 Qwen3-Coder 的 CodeLlama 模块)

🔒 仅用于可信来源模型;生产环境建议配合代码审计

--port

8000

整数

HTTP 服务监听端口

✅ 需防火墙放行;避免与其它服务冲突(如 8001 用于健康检查)

--host

0.0.0.0

字符串

绑定的网络接口地址

0.0.0.0 允许内网访问; 公网暴露需搭配 Nginx/SSL 与认证

--gpu-memory-utilization

0.85

浮点数(0.0~1.0)

预留给 vLLM 的 GPU 显存比例(非全部显存!)

⚠️ NVFP4 长上下文推荐 0.7~0.750.85 在 64K 上下文+高并发下仍可能 OOM;预留空间需 ≥ 10~15%

--served-model-name

Qwen3-Coder-Next-nvfp4

字符串

API 中展示的模型名称(即 /v1/models 返回的 id

✅ 客户端调用时 model="Qwen3-Coder-Next-nvfp4"不影响实际加载路径

--enable-auto-tool-choice

布尔开关

启用工具调用自动识别(如函数/代码执行)

✅ 需配合 --tool-call-parser 使用

--tool-call-parser

qwen3_coder

字符串

指定工具调用输出解析器(对应模型类型)

🔧 仅支持 qwen3_coder / qwen2 / mistral 等;错误选择会导致工具调用失败

--max-model-len

65536

整数(token)

上下文最大长度(KV Cache 预分配上限)

⚠️ 核心显存瓶颈! vLLM 会按此值 8seq × 256K 预分配 KV Cache
• 64K ≈ 10 ~12GB(NVFP4)<br>• 256K 可达 35~40GB
✅ 日常开发建议 32768~65536

--max-num-batched-tokens

128288

整数

单批次最大 token 总数batch_size × seq_len

⚠️ 必须 ≥ --max-model-len(否则无法处理单条长请求)
128288 仅支持 2 条 64K 请求
• 若需更高吞吐 → 提高至 262144(支持 4 条 64K)

--max-num-seqs

8

整数

单批次最大并发序列数

⚠️ 与 --max-num-batched-tokens 共同决定并发能力:
max_num_seq ≤ min(8, floor(128288 / 65536)) = 1 → 实际仅支持 1 条请求
✅ 优化:若用 --max-num-batched-tokens 262144,则 --max-num-seqs=4 更合理

--block-size

64

整数(token)

PagedAttention 分块大小(页大小)

✅ 默认推荐值(32/64/128)
64 是 NVFP4/长上下文友好选择(减少内部碎片)

--enable-prefix-caching

布尔开关

启用 前缀缓存(相同 prompt 首部分复用计算)

✅ 大幅提升多轮/同构请求效率(如 IDE 插件)
⚠️ 首次启动会生成缓存哈希表(内存略增)

--max-logprobs

20

整数

返回的最大 logprobs 令牌数(用于 token 概率分析)

✅ 推荐 5~20;设为 0 可禁用以节省微小开销

--disable-log-stats

布尔开关

关闭 vLLM 后台性能日志(每秒打印 token/s、延迟等)

✅ 生产环境建议开启(减少 I/O);调试时关闭( --no-disable-log-stats

--dtype

auto

字符串

模型权重精度

auto 会自动识别 NVFP4 → 使用 torch.float16torch.bfloat16
⚠️ 不建议手动设为 fp16/ bf16(NVFP4 本身已量化)

--enforce-eager

布尔开关

强制禁用 CUDA Graph(始终用 PyTorch eager 模式)

⚠️ 不推荐生产环境!
• 启用后:峰值显存 ↑ 15% ~30%,推理速度 ↓ 10%~25%
• 仅用于调试/兼容性问题(如某些 GPU 架构报错)

Logo

openEuler 是由开放原子开源基金会孵化的全场景开源操作系统项目,面向数字基础设施四大核心场景(服务器、云计算、边缘计算、嵌入式),全面支持 ARM、x86、RISC-V、loongArch、PowerPC、SW-64 等多样性计算架构

更多推荐