Categories
Google Cloud

Fix Vertex AI Custom Job torch_xla $PJRT_DEVICE is not set.

Fix the problem running Vertex AI local-run with GPU based training docker asia-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-3.py310:latest producing error with Transformer Trainer()

gcloud ai custom-jobs local-run --gpu --executor-image-uri=asia-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-3.py310:latest --local-package-path=YOUR_PYTHON_PACKAGE --script=YOUR_SCRIPT_PYTHON_FILE

The error appear

/opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
Setting up Trainer...
Starting training...
  0%|          | 0/3060 [00:00<?, ?it/s]terminate called after throwing an instance of 'std::runtime_error'
  what():  torch_xla/csrc/runtime/runtime.cc:31 : $PJRT_DEVICE is not set.

exit status 139
ERROR: (gcloud.ai.custom-jobs.local-run) 
        Docker failed with error code 139.
        Command: docker run --rm --runtime nvidia -v -e  --ipc host 

This problem what(): torch_xla/csrc/runtime/runtime.cc:31 : $PJRT_DEVICE is not set. apparently because the PyTorch issue.

Categories
Google Cloud

Running Vertex AI Docker Training Locally

Downloading the vertex AI docker directly and running it locally as docker run will trigger `exit 1` error. The quick solution is to use

gcloud ai custom-jobs local-run

The detail at https://cloud.google.com/vertex-ai/docs/training/containerize-run-code-local

Categories
Google Cloud

Fix Google Cloud Vertex AI Attempted to access the data pointer on an invalid python storage.

After training using transformer, calling model.save_pretrained(path) trigger this error in Vertex AI Deep Learning VM. I’m using NVIDIA L4 instances and Jupyter Notebook.

This problem is not because transformers version after I tried several version.

This error happen because the model still in cuda GPU memory. To fix it, move the model to CPU first.

model.to('cpu')
model.save_pretrained('path')
Categories
Ubuntu

Fix docker: Error response from daemon: could not select device driver “” with capabilities: [[gpu]]

To solve this problem of running Docker with gpu “docker: Error response from daemon: could not select device driver “” with capabilities: [[gpu]]”, you need to install the nvidia container toolkit

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart containerd

Reference:

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-yum-or-dnf

Categories
LLM

Solve Punica Installation and output must be a CUDA tensor error

Punica is very interesting project that showing running multiple LORAs model in single GPU. There are few things need to be done to make this project works in your local and avoiding issue like

  • _kernels.rms_norm(o, x, w, eps) RuntimeError: output must be a CUDA tensor
  • /torch/utils/cpp_extension.py”, line 2120, in _run_ninja_build
  • raise RuntimeError(message) from e
  • RuntimeError: Error compiling objects for extension
  • error: subprocess-exited-with-error
  • rich modules not installed and so on

Here are the steps

  1. Change NVCC version, I’m downgrade it into CUDA 12.1.
  2. Install G++ and GCC (version 10)
MAX_GCC_VERSION=10
sudo apt install gcc-$MAX_GCC_VERSION g++-$MAX_GCC_VERSION
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-$MAX_GCC_VERSION $MAX_GCC_VERSION

sudo apt install g++

3. Install the right torch version based on your CUDA version

pip install torch==2.5.1+cu121 --index-url https://download.pytorch.org/whl/cu121

4. Build from source!

pip install ninja numpy torch

# Clone punica
git clone https://github.com/punica-ai/punica.git
cd punica
git submodule sync
git submodule update --init

# If you encouter problem while compilation, set TORCH_CUDA_ARCH_LIST to your CUDA architecture.
# I'm using RTX4090, so ADA is 8.9. Check your version
export TORCH_CUDA_ARCH_LIST="8.9" 

# Build and install punica
pip install -v --no-build-isolation .

Why build from source works? Because its required to compile  a new CUDA kernel design SGMV