Categories
Ubuntu

Fix VLLM RuntimeError: NCCL error: unhandled system error

Running VLLM with tensor-parallel-size more than 1, triggered this error:

RuntimeError: NCCL error: unhandled 
system error (run with NCCL_DEBUG=INFO for details)
Exception: WorkerProc initialization failed due to a
n exception in a background process. See stack trace for root cause.
(EngineCore_0 pid=236) Process EngineCore_0:

This error is not about NCCL_P2P_DISABLE=1, but this vague error because when tensor-parallel-size using multiple GPUs, its need memory for sharing each other.

So, the solution is to add --shm-size 10g. Remove all the environment variable passed to docker to investigate. Be careful, environment variable that caused VLLM error may cancellout the other env.

Here is some example that works

docker run --rm -it \
  --gpus all \
  --network host -p 8000:8000 -p 8080:8080 --shm-size 10g \
  -e NCCL_P2P_DISABLE=1 \
  -v /model/llama-3-2-1b:/model \
  nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1 \
  python3 -m vllm.entrypoints.openai.api_server \
    --model /model \
    --tensor-parallel-size 2 \
    --served-model-name model \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192
Categories
Ubuntu

Run GPT OSS 20B on VLLM with RTX 4090

Here is a quick way to run OpenAI GPT OSS 20B in RTX 4090 GPU

docker run --name vllm --gpus all -v /YOUR_PATH_TO_MODEL/models--gpt-oss-20b:/model -e VLLM_ATTENTION_BACKEND='TRITON_ATTN_VLLM_V1' \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:gptoss \
    --model /model --served-model-name model

You can download the model with

hf download openai/gpt-oss-20b --local-dir ./
Categories
Ubuntu

Set fan speed Nvidia GPU Ubuntu Server Headless

Here are the quick command to adjust NVIDIA GPU on headless ubuntu.

Run this and reboot

sudo nvidia-xconfig --allow-empty-initial-configuration --enable-all-gpus --cool-bits=7

Run Display then execute your NVIDIA-settings fan speed

X :1 &
export DISPLAY=:1

Or, simple way is copy this as fan.sh at home path then set permission with chmod a+x ~/fan.sh.

The usage `~/fan.sh 50 50`, which will adjust fan speed for 2x RTX 4090

❯ cat fan.sh       
#!/bin/bash

# Check if an argument is provided
if [ -z "$1" ] || [ -z "$2" ]; then
    echo "Usage: $0 <fan_speed_gpu0> <fan_speed_gpu1>"
    echo "Please provide fan speed percentages (0-100)."
    exit 1
fi

# Validate input (must be a number between 0 and 100)
if ! [[ "$1" =~ ^[0-9]+$ ]] || [ "$1" -lt 0 ] || [ "$1" -gt 100 ]; then
    echo "Error: Fan speed for GPU0 must be an integer between 0 and 100."
    exit 1
fi

if ! [[ "$2" =~ ^[0-9]+$ ]] || [ "$2" -lt 0 ] || [ "$2" -gt 100 ]; then
    echo "Error: Fan speed for GPU1 must be an integer between 0 and 100."
    exit 1
fi

FAN_SPEED=$1
FAN_SPEED_TWO=$2

# Ensure X server is running
if ! pgrep -x "Xorg" > /dev/null && ! pgrep -x "X" > /dev/null; then
    echo "X server not running, starting a new one..."
    export XDG_SESSION_TYPE=x11
    export DISPLAY=:0
    startx -- $DISPLAY &
    sleep 5
else
    echo "X server is already running."
    export DISPLAY=:0
fi

# Set fan control state and speed for GPU 0
echo "Setting fan speed to $FAN_SPEED% for GPU 0..."
nvidia-settings -a "[gpu:0]/GPUFanControlState=1"
nvidia-settings -a "[fan:0]/GPUTargetFanSpeed=$FAN_SPEED"
nvidia-settings -a "[fan:1]/GPUTargetFanSpeed=$FAN_SPEED"

# Set fan control state and speed for GPU 1
echo "Setting fan speed to $FAN_SPEED_TWO% for GPU 1..."
nvidia-settings -a "[gpu:1]/GPUFanControlState=1"
nvidia-settings -a "[fan:2]/GPUTargetFanSpeed=$FAN_SPEED_TWO"
nvidia-settings -a "[fan:3]/GPUTargetFanSpeed=$FAN_SPEED_TWO"

echo "Fan speed set to $FAN_SPEED% (GPU 0) and $FAN_SPEED_TWO% (GPU 1)."