DeepSeek V4 on Huawei Ascend: A Practical Guide to Running Frontier AI on Domestic Hardware
The release of DeepSeek V4 preview marks a significant shift in the AI hardware landscape. For the first time, a frontier-class model has first-class support for Huawei Ascend NPUs — meaning you can run competitive AI inference without a single NVIDIA GPU.
This is a big deal for Chinese developers, research institutions, and enterprises who have been constrained by GPU availability. Let me walk through what this means and how to get started.
Hardware Landscape
Ascend 910B vs. NVIDIA A100 (Key Specs)
┌────────────────────┬────────────────────┬────────────────────┐
│ Spec │ Ascend 910B │ NVIDIA A100 │
├────────────────────┼────────────────────┼────────────────────┤
│ Compute (FP16) │ 320 TFLOPS │ 312 TFLOPS │
│ Memory │ 64GB HBM2e │ 80GB HBM2e │
│ Memory Bandwidth │ 1.5 TB/s │ 2.0 TB/s │
│ Interconnect │ HCCS 56GB/s │ NVLink 600GB/s │
│ TDP │ 310W │ 400W │
│ Availability │ High (domestic) │ Constrained* │
└────────────────────┴────────────────────┴────────────────────┘
* NVIDIA export restrictions to certain markets
The numbers tell an interesting story. Raw compute is comparable — the 910B actually edges ahead on FP16 TFLOPS. The gap is in memory bandwidth and interconnects, which affects large-batch inference and multi-card scaling. But for single-card inference and small-batch serving, the gap is narrowing fast.
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ DeepSeek V4 on Ascend — Deployment Stack │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Client Layer │ │
│ │ (Chat UI / API Client / curl) │ │
│ └────────────────────┬──────────────────────────────────┘ │
│ │ HTTP/WebSocket │
│ ┌────────────────────▼──────────────────────────────────┐ │
│ │ Serving Layer │ │
│ │ vLLM-Ascend / TGI-Ascend │ │
│ └────────────────────┬──────────────────────────────────┘ │
│ │ CANN (Compute Architecture) │
│ ┌────────────────────▼──────────────────────────────────┐ │
│ │ CANN Stack │ │
│ │ ├── ACL (Ascend Compute Language) │ │
│ │ ├── GE (Graph Engine) │ │
│ │ └── Runtime Driver │ │
│ └────────────────────┬──────────────────────────────────┘ │
│ │ │
│ ┌────────────────────▼──────────────────────────────────┐ │
│ │ Hardware │ │
│ │ Ascend 910B / 910 Pro │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Deployment Guide
Prerequisites
# System requirements
- OS: Ubuntu 22.04 / EulerOS
- Kernel: 5.10+
- NPU: Ascend 910B (at least 1 card)
- Memory: 64GB+ system RAM
- Disk: 200GB+ free space
Step 1: Install CANN Toolkit
# Download CANN from Huawei's support site
chmod +x Ascend-cann-toolkit_*.run
./Ascend-cann-toolkit_*.run --install --quiet
# Verify installation
npu-smi info
# Should show available Ascend NPUs
Step 2: Set Up Docker Environment
docker pull deepseek-ai/deepseek-v4-ascend:latest
docker run --rm -it \
--device=/dev/davinci0 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend:/usr/local/Ascend \
-p 8000:8000 \
deepseek-ai/deepseek-v4-ascend:latest
Step 3: Start Inference Server
# Inside the container
python -m vllm.entrypoints.openai.api_server \
--model /models/deepseek-v4-preview \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9
Step 4: Test It
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-v4-preview",
"messages": [{"role": "user", "content": "Hello, what can you do?"}]
}'
Performance Observations
Early benchmarks from the community show promising results:
| Model | Hardware | Tokens/s | Memory | Notes |
|---|---|---|---|---|
| V4 Preview (7B) | 1× Ascend 910B | ~45 t/s | 14GB | Fast, fits single card |
| V4 Preview (14B) | 1× Ascend 910B | ~22 t/s | 28GB | Usable for production |
| V4 Preview (70B) | 4× Ascend 910B | ~15 t/s | 63GB | Requires quantization |
| V4 Preview (70B) | 1× A100 80GB | ~35 t/s | 70GB | Reference baseline |
The gap narrows with optimized CANN kernels. For the 7B and 14B models, the experience is genuinely production-ready.
Six Tips for Developers
- Use vLLM-Ascend, not raw CANN — The community fork of vLLM with Ascend backend handles most of the optimization work for you
- Enable Flash Attention — The Ascend implementation (
--enable-flash-attn) gives 1.5-2x speedup on longer sequences - Watch your batch size — Memory bandwidth is the bottleneck; small batches (1-4) give the best latency/throughput trade-off
- Use BF16, not INT8 — While INT8 is faster, the quality degradation on Ascend is more noticeable than on CUDA due to different quantization calibration
- Update CANN regularly — Each release brings significant performance improvements. 7.0.0 was good; 8.0.0+ is noticeably better
- Join the community — The Ascend AI community on GitHub and Chinese developer forums is active and helpful
The Bigger Picture
DeepSeek V4 on Ascend is more than just another deployment option. It represents a decoupling moment — when AI model development and AI hardware ecosystem development can proceed independently. For Chinese developers, this means access to frontier AI without geopolitical constraints. For the global community, it means a more diverse and resilient hardware ecosystem.
The gap with CUDA isn’t closed yet. But it’s narrowing, and the rate of improvement is accelerating.