Fix VLLM RuntimeError: NCCL error: unhandled system error

Running VLLM with tensor-parallel-size more than 1, triggered this error:

RuntimeError: NCCL error: unhandled 
system error (run with NCCL_DEBUG=INFO for details)
Exception: WorkerProc initialization failed due to a
n exception in a background process. See stack trace for root cause.
(EngineCore_0 pid=236) Process EngineCore_0:

This error is not about NCCL_P2P_DISABLE=1, but this vague error because when tensor-parallel-size using multiple GPUs, its need memory for sharing each other.

So, the solution is to add --shm-size 10g. Remove all the environment variable passed to docker to investigate. Be careful, environment variable that caused VLLM error may cancellout the other env.

Here is some example that works

docker run --rm -it \
  --gpus all \
  --network host -p 8000:8000 -p 8080:8080 --shm-size 10g \
  -e NCCL_P2P_DISABLE=1 \
  -v /model/llama-3-2-1b:/model \
  nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1 \
  python3 -m vllm.entrypoints.openai.api_server \
    --model /model \
    --tensor-parallel-size 2 \
    --served-model-name model \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192

Leave a Reply Cancel reply