DeepSeek V4 on Huawei Ascend: A Practical Guide to Running Frontier AI on Domestic Hardware

The release of DeepSeek V4 preview marks a significant shift in the AI hardware landscape. For the first time, a frontier-class model has first-class support for Huawei Ascend NPUs — meaning you can run competitive AI inference without a single NVIDIA GPU.

This is a big deal for Chinese developers, research institutions, and enterprises who have been constrained by GPU availability. Let me walk through what this means and how to get started.

Hardware Landscape

Ascend 910B vs. NVIDIA A100 (Key Specs)
┌────────────────────┬────────────────────┬────────────────────┐
│      Spec          │  Ascend 910B       │  NVIDIA A100       │
├────────────────────┼────────────────────┼────────────────────┤
│ Compute (FP16)     │  320 TFLOPS        │  312 TFLOPS        │
│ Memory             │  64GB HBM2e        │  80GB HBM2e        │
│ Memory Bandwidth   │  1.5 TB/s          │  2.0 TB/s          │
│ Interconnect       │  HCCS 56GB/s       │  NVLink 600GB/s    │
│ TDP                │  310W              │  400W              │
│ Availability       │  High (domestic)   │  Constrained*      │
└────────────────────┴────────────────────┴────────────────────┘
* NVIDIA export restrictions to certain markets

The numbers tell an interesting story. Raw compute is comparable — the 910B actually edges ahead on FP16 TFLOPS. The gap is in memory bandwidth and interconnects, which affects large-batch inference and multi-card scaling. But for single-card inference and small-batch serving, the gap is narrowing fast.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│              DeepSeek V4 on Ascend — Deployment Stack        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Client Layer                                          │  │
│  │  (Chat UI / API Client / curl)                        │  │
│  └────────────────────┬──────────────────────────────────┘  │
│                       │ HTTP/WebSocket                      │
│  ┌────────────────────▼──────────────────────────────────┐  │
│  │  Serving Layer                                         │  │
│  │  vLLM-Ascend / TGI-Ascend                             │  │
│  └────────────────────┬──────────────────────────────────┘  │
│                       │ CANN (Compute Architecture)         │
│  ┌────────────────────▼──────────────────────────────────┐  │
│  │  CANN Stack                                            │  │
│  │  ├── ACL (Ascend Compute Language)                    │  │
│  │  ├── GE (Graph Engine)                                │  │
│  │  └── Runtime Driver                                   │  │
│  └────────────────────┬──────────────────────────────────┘  │
│                       │                                     │
│  ┌────────────────────▼──────────────────────────────────┐  │
│  │  Hardware                                              │  │
│  │  Ascend 910B / 910 Pro                                │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Deployment Guide

Prerequisites

# System requirements
- OS: Ubuntu 22.04 / EulerOS
- Kernel: 5.10+
- NPU: Ascend 910B (at least 1 card)
- Memory: 64GB+ system RAM
- Disk: 200GB+ free space

Step 1: Install CANN Toolkit

# Download CANN from Huawei's support site
chmod +x Ascend-cann-toolkit_*.run
./Ascend-cann-toolkit_*.run --install --quiet

# Verify installation
npu-smi info
# Should show available Ascend NPUs

Step 2: Set Up Docker Environment

docker pull deepseek-ai/deepseek-v4-ascend:latest

docker run --rm -it \
  --device=/dev/davinci0 \
  --device=/dev/davinci_manager \
  --device=/dev/hisi_hdc \
  -v /usr/local/Ascend:/usr/local/Ascend \
  -p 8000:8000 \
  deepseek-ai/deepseek-v4-ascend:latest

Step 3: Start Inference Server

# Inside the container
python -m vllm.entrypoints.openai.api_server \
  --model /models/deepseek-v4-preview \
  --trust-remote-code \
  --dtype bfloat16 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9

Step 4: Test It

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-preview",
    "messages": [{"role": "user", "content": "Hello, what can you do?"}]
  }'

Performance Observations

Early benchmarks from the community show promising results:

Model	Hardware	Tokens/s	Memory	Notes
V4 Preview (7B)	1× Ascend 910B	~45 t/s	14GB	Fast, fits single card
V4 Preview (14B)	1× Ascend 910B	~22 t/s	28GB	Usable for production
V4 Preview (70B)	4× Ascend 910B	~15 t/s	63GB	Requires quantization
V4 Preview (70B)	1× A100 80GB	~35 t/s	70GB	Reference baseline

The gap narrows with optimized CANN kernels. For the 7B and 14B models, the experience is genuinely production-ready.

Six Tips for Developers

Use vLLM-Ascend, not raw CANN — The community fork of vLLM with Ascend backend handles most of the optimization work for you
Enable Flash Attention — The Ascend implementation (--enable-flash-attn) gives 1.5-2x speedup on longer sequences
Watch your batch size — Memory bandwidth is the bottleneck; small batches (1-4) give the best latency/throughput trade-off
Use BF16, not INT8 — While INT8 is faster, the quality degradation on Ascend is more noticeable than on CUDA due to different quantization calibration
Update CANN regularly — Each release brings significant performance improvements. 7.0.0 was good; 8.0.0+ is noticeably better
Join the community — The Ascend AI community on GitHub and Chinese developer forums is active and helpful

The Bigger Picture

DeepSeek V4 on Ascend is more than just another deployment option. It represents a decoupling moment — when AI model development and AI hardware ecosystem development can proceed independently. For Chinese developers, this means access to frontier AI without geopolitical constraints. For the global community, it means a more diverse and resilient hardware ecosystem.

The gap with CUDA isn’t closed yet. But it’s narrowing, and the rate of improvement is accelerating.