华为鲲鹏arm+昇腾310p+欧拉系统+vllm-ascend:v0.20.2rc1-310p+qwen3-32b


直接上结论:
成功的组合:
权重文件:https://www.modelscope.cn/models/Eco-Tech/Qwen3-32B-w8a8sc-310-vllm
推理镜像:http://quay.io/ascend/vllm-ascend:v0.20.2rc1-310p
参考文档:
https://docs.vllm.ai/projects/ascend/en/latest/tutorials/hardwares/310p.html#
失败的推理镜像:
https://www.hiascend.com/developer/ascendhub/detail/44d97ca10b0845b582336f2161d1c3a8
http://quay.io/ascend/vllm-ascend:v0.18.0-310p
[root@localhost Qwen3-32B-w8a8sc-310-vllm]# lscpu
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: HiSilicon
BIOS Vendor ID: HiSilicon
Model name: Kunpeng-920
BIOS Model name: HUAWEI Kunpeng 920S
Model: 0
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 1
Stepping: 0x1
Frequency boost: disabled
CPU max MHz: 2600.0000
CPU min MHz: 200.0000
BogoMIPS: 200.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
Caches (sum of all):
L1d: 2 MiB (32 instances)
L1i: 2 MiB (32 instances)
L2: 16 MiB (32 instances)
L3: 32 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-31
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Not affected
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Not affected
Srbds: Not affected
Tsx async abort: Not affected
[root@localhost Qwen3-32B-w8a8sc-310-vllm]# lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff 2G online yes 0-1
0x0000002080000000-0x0000003fffffffff 126G online yes 130-255
Memory block size: 1G
Total online memory: 128G
Total offline memory: 0B
[root@localhost Qwen3-32B-w8a8sc-310-vllm]# free -h
total used free shared buff/cache available
Mem: 124Gi 36Gi 4.0Gi 258Mi 85Gi 88Gi
Swap: 127Gi 465Mi 127Gi
[root@localhost Qwen3-32B-w8a8sc-310-vllm]#
[root@localhost Qwen3-32B-w8a8sc-310-vllm]# cat /etc/os-release
NAME="openEuler"
VERSION="22.03 (LTS-SP4)"
ID="openEuler"
VERSION_ID="22.03"
PRETTY_NAME="openEuler 22.03 (LTS-SP4)"
ANSI_COLOR="0;31"
[root@localhost Qwen3-32B-w8a8sc-310-vllm]#
[root@localhost Qwen3-32B-w8a8sc-310-vllm]# cat /etc/docker/daemon.json
{"runtimes": {"ascend": {"path": "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime","runtimeArgs": []}},"data-root":"/data/docker/","default-runtime": "ascend","registry-mirrors": ["https://registry.docker-cn.com","https://docker.m.daocloud.io","https://dockerproxy.com","https://docker.mirrors.ustc.edu.cn","https://docker.nju.edu.cn","https://iju9kaj2.mirror.aliyuncs.com","http://hub-mirror.c.163.com","https://cr.console.aliyun.com","https://hub.docker.com","http://mirrors.ustc.edu.cn","https://docker.xuanyuan.me"]}
[root@localhost Qwen3-32B-w8a8sc-310-vllm]#
[root@localhost Qwen3-32B-w8a8sc-310-vllm]# lsblk -do name,size,rota
NAME SIZE ROTA
sda 3.6T 1
nvme0n1 465.8G 0
[root@localhost Qwen3-32B-w8a8sc-310-vllm]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 3.6T 0 disk /data
nvme0n1 259:0 0 465.8G 0 disk
├─nvme0n1p1 259:1 0 600M 0 part /boot/efi
├─nvme0n1p2 259:2 0 1G 0 part /boot
├─nvme0n1p3 259:3 0 128G 0 part [SWAP]
└─nvme0n1p4 259:4 0 336.2G 0 part /
[root@localhost Qwen3-32B-w8a8sc-310-vllm]# df -hT
Filesystem Type Size Used Avail Use% Mounted on
devtmpfs devtmpfs 4.0M 0 4.0M 0% /dev
tmpfs tmpfs 63G 8.0K 63G 1% /dev/shm
tmpfs tmpfs 25G 28M 25G 1% /run
tmpfs tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
/dev/nvme0n1p4 ext4 330G 37G 277G 12% /
/dev/sda ext4 3.6T 630G 2.8T 19% /data
tmpfs tmpfs 63G 508K 63G 1% /tmp
/dev/nvme0n1p2 ext4 974M 253M 655M 28% /boot
/dev/nvme0n1p1 vfat 599M 6.5M 593M 2% /boot/efi
tmpfs tmpfs 13G 36K 13G 1% /run/user/1000
tmpfs tmpfs 13G 0 13G 0% /run/user/0
overlay overlay 3.6T 630G 2.8T 19% /data/docker/overlay2/3cec6fec0ecc250c52a46f08b8cc0fddd6910ef5fa0be1c975ead9c3820b927d/merged
shm tmpfs 10G 700K 10G 1% /data/docker/containers/ca97df9cd1f57f858e1799b265d9fe48ec1271332b7fc83909ff2a82d9db7058/mounts/shm
[root@localhost Qwen3-32B-w8a8sc-310-vllm]# pwd
/data/models/Qwen3-32B-w8a8sc-310-vllm
[root@localhost Qwen3-32B-w8a8sc-310-vllm]# ll
total 12
-rw-r--r--. 1 root root 1577 Jun 16 09:47 docker-compose.yml
-rw-r--r--. 1 root root 2178 Jun 15 17:38 README.md
drwxr-xr-x. 3 root root 4096 Jun 15 17:38 TP4
[root@localhost Qwen3-32B-w8a8sc-310-vllm]# cat docker-compose.yml
services:qwen3-32b:container_name: qwen3-32bimage: quay.io/ascend/vllm-ascend:v0.20.2rc1-310prestart: unless-stoppeddevices:- /dev/davinci0:/dev/davinci0- /dev/davinci1:/dev/davinci1- /dev/davinci2:/dev/davinci2- /dev/davinci3:/dev/davinci3- /dev/davinci_manager:/dev/davinci_manager- /dev/devmm_svm:/dev/devmm_svm- /dev/hisi_hdc:/dev/hisi_hdcshm_size: 10gports:- "8080:8080"volumes:- /usr/local/dcmi:/usr/local/dcmi- /usr/local/bin/npu-smi:/usr/local/bin/npu-smi- /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/- /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info- /etc/ascend_install.info:/etc/ascend_install.info- /data/models:/root/.cacheenvironment:- ASCEND_RT_VISIBLE_DEVICES=0,1,2,3- TZ=Asia/Shanghaientrypoint: ["/bin/bash", "-c"]command:- >vllm serve /root/.cache/Qwen3-32B-w8a8sc-310-vllm/TP4/Qwen3-32B-w8a8sc-310-vllm-tp4--host 0.0.0.0--port 8080--tensor-parallel-size 4--gpu_memory_utilization 0.90--max_num_seqs 32--served_model_name qwen--dtype float16--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}'--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}'--quantization ascend--max_model_len 32768--no-enable-prefix-caching--load_format sharded_state
[root@localhost Qwen3-32B-w8a8sc-310-vllm]# npu-smi info
+--------------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2 Version: 24.1.rc2 |
+-------------------------------+-----------------+------------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page) |
| Chip Device | Bus-Id | AICore(%) Memory-Usage(MB) |
+===============================+=================+======================================================+
| 0 310P3 | OK | NA 44 13423 / 13423 |
| 0 0 | 0000:01:00.0 | 0 29140/ 44280 |
+-------------------------------+-----------------+------------------------------------------------------+
| 0 310P3 | OK | NA 43 13393 / 13393 |
| 1 1 | 0000:01:00.0 | 0 28303/ 43693 |
+===============================+=================+======================================================+
| 64 310P3 | OK | NA 45 13393 / 13393 |
| 0 2 | 0000:02:00.0 | 0 28765/ 44280 |
+-------------------------------+-----------------+------------------------------------------------------+
| 64 310P3 | OK | NA 43 13393 / 13393 |
| 1 3 | 0000:02:00.0 | 0 28547/ 43693 |
+===============================+=================+======================================================+
+-------------------------------+-----------------+------------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===============================+=================+======================================================+
| 0 0 | 553624 | VLLMWorker_TP | 97 |
| 0 0 | 553605 | VLLMWorker_TP | 98 |
| 0 0 | 553598 | VLLMWorker_TP | 27274 |
| 0 0 | 553855 | VLLMWorker_TP | 97 |
| 0 1 | 553605 | VLLMWorker_TP | 27274 |
+===============================+=================+======================================================+
| 64 0 | 553624 | VLLMWorker_TP | 27274 |
| 64 1 | 553855 | VLLMWorker_TP | 27274 |
+===============================+=================+======================================================+
[root@localhost Qwen3-32B-w8a8sc-310-vllm]# du -sh *
4.0K docker-compose.yml
4.0K README.md
32G TP4
[root@localhost Qwen3-32B-w8a8sc-310-vllm]# cd TP4
[root@localhost TP4]# ll
total 4
drwxr-xr-x. 2 root root 4096 Jun 16 09:02 Qwen3-32B-w8a8sc-310-vllm-tp4
[root@localhost TP4]# cd Qwen3-32B-w8a8sc-310-vllm-tp4/
[root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4]# ll
total 33414612
-rw-r--r--. 1 root root 727 Jun 15 17:38 config.json
-rw-r--r--. 1 root root 73 Jun 15 17:38 configuration.json
-rw-r--r--. 1 root root 35014 Jun 16 09:02 fusion_result.json
-rw-r--r--. 1 root root 239 Jun 15 17:38 generation_config.json
-rw-r--r--. 1 root root 8551452512 Jun 15 20:42 model-rank-0-part-0.safetensors
-rw-r--r--. 1 root root 8550975968 Jun 15 20:47 model-rank-1-part-0.safetensors
-rw-r--r--. 1 root root 8551050464 Jun 15 21:17 model-rank-2-part-0.safetensors
-rw-r--r--. 1 root root 8548654816 Jun 15 20:16 model-rank-3-part-0.safetensors
-rw-r--r--. 1 root root 125759 Jun 15 17:38 quant_model_description.json
-rw-r--r--. 1 root root 1246 Jun 15 17:38 README.md
-rw-r--r--. 1 root root 9732 Jun 15 17:38 tokenizer_config.json
-rw-r--r--. 1 root root 11422654 Jun 15 17:38 tokenizer.json
-rw-r--r--. 1 root root 2776833 Jun 15 17:38 vocab.json
[root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4]# du -sh .
32G .
[root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4]# docker images|grep 310
quay.io/ascend/vllm-ascend v0.20.2rc1-310p 1ba018a56294 12 days ago 15.8GB
quay.io/ascend/vllm-ascend v0.18.0-310p e17f09537569 6 weeks ago 14.3GB
swr.cn-south-1.myhuaweicloud.com/ascendhub/bge-large-zh-v1.5 7.1.T9-310p-aarch64 bf32177dea37 10 months ago 18GB
[root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4]#
[root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ca97df9cd1f5 quay.io/ascend/vllm-ascend:v0.20.2rc1-310p "/bin/bash -c 'vllm …" 10 minutes ago Up 10 minutes 0.0.0.0:8080->8080/tcp qwen3-32b
[root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4]#
[root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4]# curl http://192.168.20.123:8080/v1/completions -H "Content-Type: application/json" -d '{
"prompt": "你好,你是谁,擅长做什么",
"max_completion_tokens": 64,
"temperature": 0.0
}'
{"id":"cmpl-be1ec386f55233a4","object":"text_completion","created":1781575140,"model":"qwen","choices":[{"index":0,"text":"? 您好!我是通义千问,是阿里巴巴集团旗下的通","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":7,"total_tokens":23,"completion_tokens":16,"prompt_tokens_details":null},"kv_transfer_params":null}[root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4]#
[root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4]#
[root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4]# docker-compose -v
Docker Compose version v2.32.4
[root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4]#
[root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4]# docker -v
Docker version 18.09.0, build d51e3ad
[root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4]# docker info
Containers: 2
Running: 1
Paused: 0
Stopped: 1
Images: 8
Server Version: 18.09.0
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Hugetlb Pagesize: 2MB, 64KB, 32MB, 1GB, 64KB, 32MB, 2MB, 1GB (default is 2MB)
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: ascend runc
Default Runtime: ascend
Init Binary: docker-init
containerd version: 871075eb7cc979944ba2d987719cb534bbb87e5c
runc version: N/A
init version: N/A (expected: )
Security Options:
seccomp
Profile: default
Kernel Version: 5.10.0-254.0.0.158.oe2203sp4.aarch64
Operating System: openEuler 22.03 (LTS-SP4)
OSType: linux
Architecture: aarch64
CPUs: 32
Total Memory: 124.3GiB
Name: localhost.localdomain
ID: WPOB:6KZK:4WJK:C6PV:5SO4:4JOY:5VTN:5XCI:CG2Z:BCS6:EQGT:GYNC
Docker Root Dir: /data/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
https://registry.docker-cn.com/
https://docker.m.daocloud.io/
https://dockerproxy.com/
https://docker.mirrors.ustc.edu.cn/
https://docker.nju.edu.cn/
https://iju9kaj2.mirror.aliyuncs.com/
http://hub-mirror.c.163.com/
https://cr.console.aliyun.com/
https://hub.docker.com/
http://mirrors.ustc.edu.cn/
https://docker.xuanyuan.me/
Live Restore Enabled: true
tips:vllm推理框架下无法同时启动bge的模型,原因是npu卡被独占,无法共享使用。

openEuler 是由开放原子开源基金会孵化的全场景开源操作系统项目,面向数字基础设施四大核心场景(服务器、云计算、边缘计算、嵌入式),全面支持 ARM、x86、RISC-V、loongArch、PowerPC、SW-64 等多样性计算架构
更多推荐

所有评论(0)