本地服务器部署vllm+Qwen3-Coder-Next的模型

在miniconda3的解压包下找到.condarc文件，里面也配一下镜像源。环境配置好之后创建一个合适版本的的虚拟环境。出现找不到nvcc的情况。

我叫Double

591人浏览 · 2026-05-13 23:52:59

我叫Double · 2026-05-13 23:52:59 发布

首先我的服务器配置是

显卡型号：NVIDIA RTX PRO 6000 Black Edition（专业级高端显卡）
显存规格：总显存 97887 MiB（约 95.6 GB），当前已使用 87788 MiB（约 85.7 GB），剩余约 9.9 GB
驱动 / 算力：NVIDIA 驱动版本 580.119.02，CUDA 版本 13.0（适配高版本深度学习框架）

nvidia-smi
Thu Mar 12 20:23:09 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.119.02             Driver Version: 580.119.02     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    Off |   00000000:01:00.0 Off |                  Off |
| 30%   36C    P8             19W /  600W |   87788MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           57161      C   VLLM::EngineCore                      87728MiB |
|    0   N/A  N/A          140517      G   /usr/lib/xorg/Xorg                       40MiB |
+-----------------------------------------------------------------------------------------+

第一步安装虚拟环境的配置

原本的python环境是python 1.13.3
安装miniconda3 用来准备运行环境
配置系统的镜像源（这里使用的是华为的镜像源，因为是内网环境，对镜像源有要求）

sudo sed -i "s@http://.*security.ubuntu.com@http://mirrors.tools.huawei.com@g" ubuntu.sources
sudo sed -i "s@http://.*archive.ubuntu.com@http://mirrors.tools.huawei.com@g" ubuntu.sources



pip config set global.trusted-host mirrors.tools.huawei.com
pip config set global.index-url http://mirrors.tools.huawei.com/pypi/simple/

在miniconda3的解压包下找到.condarc文件，里面也配一下镜像源

channels:
  - defaults
show_channel_urls: true
default_channels:
   - main
channel_alias: http://conda.rnd.huawei.com/repository/conda-proxy
channel_priority: strict
env_dirs:
   - /opt/miniconda3/envs

环境配置好之后创建一个合适版本的的虚拟环境

conda create --name VLLM python=3.13 -y 创建虚拟环境

第二步安装vllm的前置条件

sudo apt update （执行可以忽略）
sudo apt install -y build-essential dkm （执行可以忽略）
sudo update-initramfs -u
安装nivdia驱动执行nvidia-smi有内容算成功
下载nivdia-smi对应的cuda驱动 https://developer.nvidia.com/cuda-toolkit-archive 对应自己的系统的版本比如我的是cuda_13.0.0_580.65.06_linux.run 我的系统版本就是13.0，我的下载到/home/zhike/data/program下

sudo chmod a+x cuda_13.0.0_580.65.06_linux.run

ssh bash cuda_13.0.0_580.65.06_linux.run
配置环境变量
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-13.0/lib64
export PATH=$PATH:/usr/local/cuda-13.0/bin
export CUDA_HOME=$CUDA_HOME:/usr/local/cuda-13.0

第三步下载pytorch

下载pytorch选择对应的cuda和python版本 https://download.pytorch.org/whl/torch/
source ~/.bashrc
conda activate vLLM 进去虚拟环境后进行安装torch的whl文件
pip install torch-2.10.0+cu130-cp313-cp313-manylinux_2_28_x86_64.whl
安装成功后pip install vLLM
安装成功后执行 vllm --version 后显示版本说明安装成功了

下载 Qwen3-Coder-Next 大模型

将Qwen3-Coder-Next 下载想要放置的位置
使用huggingface 或者Modelscope 或者git下载下来（安装的vllm自带了huggingface）
切记要是使用源文件，不要指针文件
然后cd到下载到的文件目录下执行启动程序

第四步启动

这个是可以启动量化后的Qwen3-Coder-Next，但是原生的跑不起来

python -m vllm.entrypoints.api_server     --model .     --trust-remote-code     --port 8000     --host 0.0.0.0     --load-format safetensors     --dtype bfloat16     --tokenizer-mode slow     --gpu-memory-utilization 0.9     --max-model-len 8192     > /tmp/vllm.log

这个控制了量化，启动原生的Qwen3-Coder-Next

python -m vllm.entrypoints.api_server --model . --trust-remote-code --port 8000 --host 0.0.0.0 --dtype float16 --gpu-memory-utilization 0.85 --max-model-len 2048 --quantization fp8 --enforce-eager --swap-space 10

这个可以启动原生的Qwen3-Coder-Next并且打开chat模式

python -m vllm.entrypoints.openai.api_server --model . --trust-remote-code --port 8000 --host 100.102.39.213 --dtype float16 --gpu-memory-utilization 0.85 --max-model-len 2048 --quantization fp8 --enforce-eager --swap-space 10 --served-model-name qwen3

使用fp8量化的配置参数参考①：

nohup python -m vllm.entrypoints.openai.api_server \
  --model /home/zhike/data/ModelScope/Qwen3-Coder-Next \
  --trust-remote-code \
  --port 8000 \
  --host 100.102.39.213 \
  --gpu-memory-utilization 0.92 \
  --quantization fp8 \
  --served-model-name Qwen/Qwen3-Coder-Next \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --max-model-len 65536 \      # ← 放弃 128K，改用 64K（更现实）
  --enable-prefix-caching \
  --max-num-seqs 8 \          # ← ≥4 并发，实际可达 8
  --max-num-batched-tokens 65536 \
  --block-size 32 \           # ← 提升至 32（Qwen 友好）
  --max-logprobs 20 \
  --disable-log-stats > ~/logs/vllm/vllm_$(date +%Y%m%d_%H%M%S).log 2>&1 &

量化nvfpv问题

转换代码：

import os
import ssl

# === 关键：强制使用官方 Hugging Face Hub ===
os.environ["HF_ENDPOINT"] = "https://huggingface.co"

# 如仍有 SSL 问题（极少见），再额外禁用验证（仅内网环境）：
# os.environ["CURL_CA_BUNDLE"] = ""
# os.environ["REQUESTS_CA_BUNDLE"] = ""
# ssl._create_default_https_context = ssl._create_unverified_context

# === 正常导入 ===
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset

# 1. 配置路径
MODEL_PATH = "/home/zhike/data/ModelScope/Qwen3-Coder-Next"
OUTPUT_PATH = "/home/zhike/data/ModelScope/Qwen3-Coder-Next-NVFP4"

# 2. 加载模型和分词器
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

# 3. 准备校准数据集 → 此处不再报错！
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
dataset = dataset.select(range(512))

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
dataset = dataset.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=2048, truncation=True, add_special_tokens=False)
dataset = dataset.map(tokenize, remove_columns=dataset.column_names)

# 4. 配置 NVFP4 量化方案
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head"])

# 5. 执行量化
oneshot(model=model, dataset=dataset, recipe=recipe, max_seq_length=2048)

# 6. 保存模型
model.save_pretrained(OUTPUT_PATH, save_compressed=True)
tokenizer.save_pretrained(OUTPUT_PATH)
print(f"✅ NVFP4 模型已保存至: {OUTPUT_PATH}")

报错：

(EngineCore pid=2958568) /bin/sh: 1: :/usr/local/cuda-13.0/bin/nvcc: not found
(EngineCore pid=2958568) ninja: build stopped: subcommand failed.
(EngineCore pid=2958568) 
[rank0]:[W423 11:03:59.936941636 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=2958399) Traceback (most recent call last):
(APIServer pid=2958399)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=2958399)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 710, in <module>
(APIServer pid=2958399)     uvloop.run(run_server(args))
(APIServer pid=2958399)     ~~~~~~~~~~^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=2958399)     return __asyncio.run(
(APIServer pid=2958399)            ~~~~~~~~~~~~~^
(APIServer pid=2958399)         wrapper(),
(APIServer pid=2958399)         ^^^^^^^^^^
(APIServer pid=2958399)     ...<2 lines>...
(APIServer pid=2958399)         **run_kwargs
(APIServer pid=2958399)         ^^^^^^^^^^^^
(APIServer pid=2958399)     )
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/asyncio/runners.py", line 195, in run
(APIServer pid=2958399)     return runner.run(main)
(APIServer pid=2958399)            ~~~~~~~~~~^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/asyncio/runners.py", line 118, in run
(APIServer pid=2958399)     return self._loop.run_until_complete(task)
(APIServer pid=2958399)            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
(APIServer pid=2958399)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=2958399)     return await main
(APIServer pid=2958399)            ^^^^^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=2958399)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=2958399)     async with build_async_engine_client(
(APIServer pid=2958399)                ~~~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         args,
(APIServer pid=2958399)         ^^^^^
(APIServer pid=2958399)         client_config=client_config,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ) as engine_client:
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/contextlib.py", line 214, in __aenter__
(APIServer pid=2958399)     return await anext(self.gen)
(APIServer pid=2958399)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=2958399)     async with build_async_engine_client_from_engine_args(
(APIServer pid=2958399)                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         engine_args,
(APIServer pid=2958399)         ^^^^^^^^^^^^
(APIServer pid=2958399)         usage_context=usage_context,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)         client_config=client_config,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ) as engine:
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/contextlib.py", line 214, in __aenter__
(APIServer pid=2958399)     return await anext(self.gen)
(APIServer pid=2958399)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=2958399)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=2958399)         vllm_config=vllm_config,
(APIServer pid=2958399)     ...<6 lines>...
(APIServer pid=2958399)         client_index=client_index,
(APIServer pid=2958399)     )
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=2958399)     return cls(
(APIServer pid=2958399)         vllm_config=vllm_config,
(APIServer pid=2958399)     ...<9 lines>...
(APIServer pid=2958399)         client_index=client_index,
(APIServer pid=2958399)     )
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=2958399)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=2958399)                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         vllm_config=vllm_config,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ...<4 lines>...
(APIServer pid=2958399)         client_index=client_index,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     )
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=2958399)     return func(*args, **kwargs)
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=2958399)     return AsyncMPClient(*client_args)
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=2958399)     return func(*args, **kwargs)
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/core_client.py", line 887, in __init__
(APIServer pid=2958399)     super().__init__(
(APIServer pid=2958399)     ~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         asyncio_mode=True,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ...<3 lines>...
(APIServer pid=2958399)         client_addresses=client_addresses,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     )
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=2958399)     with launch_core_engines(
(APIServer pid=2958399)          ~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         vllm_config, executor_class, log_stats, addresses
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ) as (engine_manager, coordinator, addresses, tensor_queue):
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/contextlib.py", line 148, in __exit__
(APIServer pid=2958399)     next(self.gen)
(APIServer pid=2958399)     ~~~~^^^^^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
(APIServer pid=2958399)     wait_for_engine_startup(
(APIServer pid=2958399)     ~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         handshake_socket,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ...<6 lines>...
(APIServer pid=2958399)         coordinator.proc if coordinator else None,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     )
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
(APIServer pid=2958399)     raise RuntimeError(
(APIServer pid=2958399)     ...<3 lines>...
(APIServer pid=2958399)     )
(APIServer pid=2958399) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

执行： python -m vllm.entrypoints.openai.api_server --model /home/zhike/data/ModelScope/Qwen3-Coder-Next-NVFP4 --trust-remote-code --port 8000 --host 100.102.39.213 --gpu-memory-utilization 0.90 --served-model-name Qwen/Qwen3-Coder-Next --enable-auto-tool-choice --tool-call-parser qwen3_coder --max-model-len 32768 --enable-prefix-caching --max-num-seqs 16 --max-num-batched-tokens 131072 --block-size 64 --max-logprobs 20 --disable-log-stats --enforce-eage

本地查询：

(base) zhike@zhike:~/data$ which nvcc
/usr/local/cuda-13.0/bin/nvcc
(base) zhike@zhike:~/data$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jul_16_07:30:01_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.48
Build cuda_13.0.r13.0/compiler.36260728_0
(base) zhike@zhike:~/data$ nvidia-smi
Thu Apr 23 11:47:29 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.119.02             Driver Version: 580.119.02     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    Off |   00000000:01:00.0 Off |                  Off |
| 30%   35C    P8             19W /  600W |   91455MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3926      G   /usr/lib/xorg/Xorg                       75MiB |
|    0   N/A  N/A         3014542      C   VLLM::EngineCore                      91352MiB |
+-----------------------------------------------------------------------------------------+
(base) zhike@zhike:~/data$ echo $PATH
/usr/local/cuda-13.0/bin:/home/zhike/.local/bin:/opt/miniconda3/bin:/opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-13.0/bin

(base) zhike@zhike:~/data$ echo 'export CUDA_HOME=/usr/local/cuda-13.0' 
export CUDA_HOME=/usr/local/cuda-13.0

解决：
出现找不到nvcc的情况

unset CUDA_HOME NVCC CUDA_PATH

# 2. 显式指定正确路径（注意不要带前导/后导冒号）
export CUDA_HOME=/usr/local/cuda-13.0
export NVCC=$CUDA_HOME/bin/nvcc
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

# 3. 验证 nvcc 是否可正常调用
$NVCC --version

rm -rf ~/.cache/flashinfer
# 若使用 VLLM 默认缓存目录，也可清理：
rm -rf ~/.cache/vllm

执行：

python -m vllm.entrypoints.openai.api_server --model /home/zhike/data/ModelScope/coder-next-nvfp4 --trust-remote-code --port 8000 --host 0.0.0.0 --gpu-memory-utilization 0.85 --served-model-name Qwen3-Coder-Next-nvfp4 --enable-auto-tool-choice --tool-call-parser qwen3_coder --max-model-len 65536 --max-num-batched-tokens 128288 --max-num-seqs 8 --block-size 64 --enable-prefix-caching --max-logprobs 20 --disable-log-stats --dtype auto --enforce-eager

参数	值示例	类型	作用与说明	⚠️ 注意事项/优化建议
`--model`	`/home/zhike/data/ModelScope/coder-next-nvfp4`	路径	指定本地模型目录路径（已量化为 NVFP4 格式）	✅ 确保路径存在 `config.json`、 `tokenizer.json`、 `model.safetensors` 等完整文件；路径中不得含空格或中文
`--trust-remote-code`	—	布尔开关	允许加载模型/分词器中的自定义 Python 代码（如 Qwen3-Coder 的 `CodeLlama` 模块）	🔒 仅用于可信来源模型；生产环境建议配合代码审计
`--port`	`8000`	整数	HTTP 服务监听端口	✅ 需防火墙放行；避免与其它服务冲突（如 8001 用于健康检查）
`--host`	`0.0.0.0`	字符串	绑定的网络接口地址	✅ `0.0.0.0` 允许内网访问；公网暴露需搭配 Nginx/SSL 与认证
`--gpu-memory-utilization`	`0.85`	浮点数（0.0~1.0）	预留给 vLLM 的 GPU 显存比例（非全部显存！）	⚠️ NVFP4 长上下文推荐 `0.7~0.75`； `0.85` 在 64K 上下文+高并发下仍可能 OOM；预留空间需 ≥ `10~15%`
`--served-model-name`	`Qwen3-Coder-Next-nvfp4`	字符串	API 中展示的模型名称（即 `/v1/models` 返回的 `id`）	✅ 客户端调用时 `model="Qwen3-Coder-Next-nvfp4"`；不影响实际加载路径
`--enable-auto-tool-choice`	—	布尔开关	启用工具调用自动识别（如函数/代码执行）	✅ 需配合 `--tool-call-parser` 使用
`--tool-call-parser`	`qwen3_coder`	字符串	指定工具调用输出解析器（对应模型类型）	🔧 仅支持 `qwen3_coder` / `qwen2` / `mistral` 等；错误选择会导致工具调用失败
`--max-model-len`	`65536`	整数（token）	上下文最大长度（KV Cache 预分配上限）	⚠️ 核心显存瓶颈！ vLLM 会按此值 8seq × 256K 预分配 KV Cache • 64K ≈ 10 ~12GB（NVFP4）<br>• 256K 可达 35~40GB ✅ 日常开发建议 `32768~65536`
`--max-num-batched-tokens`	`128288`	整数	单批次最大 token 总数（ `batch_size × seq_len`）	⚠️ 必须 ≥ `--max-model-len`（否则无法处理单条长请求） • `128288` 仅支持 2 条 64K 请求 • 若需更高吞吐 → 提高至 `262144`（支持 4 条 64K）
`--max-num-seqs`	`8`	整数	单批次最大并发序列数	⚠️ 与 `--max-num-batched-tokens` 共同决定并发能力： `max_num_seq ≤ min(8, floor(128288 / 65536)) = 1` → 实际仅支持 1 条请求！ ✅ 优化：若用 `--max-num-batched-tokens 262144`，则 `--max-num-seqs=4` 更合理
`--block-size`	`64`	整数（token）	PagedAttention 分块大小（页大小）	✅ 默认推荐值（32/64/128） • 64 是 NVFP4/长上下文友好选择（减少内部碎片）
`--enable-prefix-caching`	—	布尔开关	启用前缀缓存（相同 prompt 首部分复用计算）	✅ 大幅提升多轮/同构请求效率（如 IDE 插件） ⚠️ 首次启动会生成缓存哈希表（内存略增）
`--max-logprobs`	`20`	整数	返回的最大 `logprobs` 令牌数（用于 token 概率分析）	✅ 推荐 `5~20`；设为 `0` 可禁用以节省微小开销
`--disable-log-stats`	—	布尔开关	关闭 vLLM 后台性能日志（每秒打印 token/s、延迟等）	✅ 生产环境建议开启（减少 I/O）；调试时关闭（ `--no-disable-log-stats`）
`--dtype`	`auto`	字符串	模型权重精度	✅ `auto` 会自动识别 NVFP4 → 使用 `torch.float16` 或 `torch.bfloat16` ⚠️ 不建议手动设为 `fp16`/ `bf16`（NVFP4 本身已量化）
`--enforce-eager`	—	布尔开关	强制禁用 CUDA Graph（始终用 PyTorch eager 模式）	⚠️ 不推荐生产环境！ • 启用后：峰值显存 ↑ 15% ~30%，推理速度 ↓ 10%~25% • 仅用于调试/兼容性问题（如某些 GPU 架构报错）

openEuler 社区

openEuler 是由开放原子开源基金会孵化的全场景开源操作系统项目，面向数字基础设施四大核心场景（服务器、云计算、边缘计算、嵌入式），全面支持 ARM、x86、RISC-V、loongArch、PowerPC、SW-64 等多样性计算架构

更多推荐

亦唐科技如何应对全球电子制造业的智能化转型

通过智能化贴片机的研发、大数据与云计算的赋能、智能生产线的集成，亦唐科技不仅提升了产品的竞争力，还为全球电子制造业提供了更高效、精准、灵活的智能化生产设备。在这个变革的浪潮中，亦唐科技作为中国领先的贴片机制造商，如何在智能化趋势中抓住机遇，突破技术瓶颈，以应对全球电子制造业的智能化转型，已成为其发展的关键。通过持续的技术创新与智能化研发，亦唐科技正不断提升产品的智能化水平，以满足行业日益增长的需求

openEuler 社区

上海软件开发服务商那么多，企业数字化转型期该如何精准选择

2026年，上海有上千家软件开发公司，报价从5万到500万不等，都说自己“技术领先”“服务靠谱”。除了这些硬指标，畔游科技还有一个独特优势：一站式服务，从物联网开发、软件定制、小程序/APP、高端网站、VI/UI设计到品牌营销，六大板块由同一团队闭环交付，企业无需对接多家供应商，彻底解决“接口不通、责任推诿”的烦恼。软件开发不是一锤子买卖，系统上线后的bug修复、功能迭代、服务器维护、数据备份等等