Categories
Google Cloud

Fix Vertex AI Custom Job torch_xla $PJRT_DEVICE is not set.

Fix the problem running Vertex AI local-run with GPU based training docker asia-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-3.py310:latest producing error with Transformer Trainer()

gcloud ai custom-jobs local-run --gpu --executor-image-uri=asia-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-3.py310:latest --local-package-path=YOUR_PYTHON_PACKAGE --script=YOUR_SCRIPT_PYTHON_FILE

The error appear

/opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
Setting up Trainer...
Starting training...
  0%|          | 0/3060 [00:00<?, ?it/s]terminate called after throwing an instance of 'std::runtime_error'
  what():  torch_xla/csrc/runtime/runtime.cc:31 : $PJRT_DEVICE is not set.

exit status 139
ERROR: (gcloud.ai.custom-jobs.local-run) 
        Docker failed with error code 139.
        Command: docker run --rm --runtime nvidia -v -e  --ipc host 

This problem what(): torch_xla/csrc/runtime/runtime.cc:31 : $PJRT_DEVICE is not set. apparently because the PyTorch issue.

Categories
Google Cloud

Running Vertex AI Docker Training Locally

Downloading the vertex AI docker directly and running it locally as docker run will trigger `exit 1` error. The quick solution is to use

gcloud ai custom-jobs local-run

The detail at https://cloud.google.com/vertex-ai/docs/training/containerize-run-code-local

Categories
Google Cloud

Fix Google Cloud Vertex AI Attempted to access the data pointer on an invalid python storage.

After training using transformer, calling model.save_pretrained(path) trigger this error in Vertex AI Deep Learning VM. I’m using NVIDIA L4 instances and Jupyter Notebook.

This problem is not because transformers version after I tried several version.

This error happen because the model still in cuda GPU memory. To fix it, move the model to CPU first.

model.to('cpu')
model.save_pretrained('path')
Categories
Ubuntu

Fix docker: Error response from daemon: could not select device driver “” with capabilities: [[gpu]]

To solve this problem of running Docker with gpu “docker: Error response from daemon: could not select device driver “” with capabilities: [[gpu]]”, you need to install the nvidia container toolkit

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart containerd

Reference:

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-yum-or-dnf

Categories
LLM

Solve Punica Installation and output must be a CUDA tensor error

Punica is very interesting project that showing running multiple LORAs model in single GPU. There are few things need to be done to make this project works in your local and avoiding issue like

  • _kernels.rms_norm(o, x, w, eps) RuntimeError: output must be a CUDA tensor
  • /torch/utils/cpp_extension.py”, line 2120, in _run_ninja_build
  • raise RuntimeError(message) from e
  • RuntimeError: Error compiling objects for extension
  • error: subprocess-exited-with-error
  • rich modules not installed and so on

Here are the steps

  1. Change NVCC version, I’m downgrade it into CUDA 12.1.
  2. Install G++ and GCC (version 10)
MAX_GCC_VERSION=10
sudo apt install gcc-$MAX_GCC_VERSION g++-$MAX_GCC_VERSION
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-$MAX_GCC_VERSION $MAX_GCC_VERSION

sudo apt install g++

3. Install the right torch version based on your CUDA version

pip install torch==2.5.1+cu121 --index-url https://download.pytorch.org/whl/cu121

4. Build from source!

pip install ninja numpy torch

# Clone punica
git clone https://github.com/punica-ai/punica.git
cd punica
git submodule sync
git submodule update --init

# If you encouter problem while compilation, set TORCH_CUDA_ARCH_LIST to your CUDA architecture.
# I'm using RTX4090, so ADA is 8.9. Check your version
export TORCH_CUDA_ARCH_LIST="8.9" 

# Build and install punica
pip install -v --no-build-isolation .

Why build from source works? Because its required to compile  a new CUDA kernel design SGMV

Categories
LLM

Fix VLLM Torch libnvJitLink.so.12 issue

Installation for latest VLLM or SGLANG pip install vllm -U trigger error like below

ImportError: python3.11/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12

To solve this problem is uninstall the torch packages and downgrade to 2.60 with cu121. I’m using nvcc version 12.2 and NVIDIA driver 12.7

pip uninstall torch torchvision torchaudio -y

pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
Categories
Machine Learning

Install P2P Dual RTX 4090 Ubuntu 24.04

A good news, we can enable P2P for Dual RTX 4090 or more. When running simpleP2P script, we will got this results. Don’t worry, next is how to enable it easily!

[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

First thing, make sure to enable Resize BAR and disable IOMMU in BIOS (I’m using ASUS WRX80SAGE).

Next, uninstall all nvidia drivers (*yes, thats right!)

# Uninstall all nvidia
sudo apt-get --purge remove "*nvidia*"
sudo apt-get --purge remove "*cuda*" "*cudnn*" "*cublas*" "*cufft*" "*cufile*" "*curand*" "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*" "*libnccl*"

# disable iommu
ll /sys/class/iommu/

# install dependencies
sudo apt install git cmake

# reboot
sudo reboot
Categories
LLM

Solve Unsloth AttributeError: ‘NoneType’ object has no attribute ‘attn_bias’

I got this error when running SFT Training using Unsloth

trainer.py:3731, in Trainer.compute_loss(self, model, inputs, return_outputs, num_items_in_batch)
   3729         loss_kwargs["num_items_in_batch"] = num_items_in_batch
   3730     inputs = {**inputs, **loss_kwargs}
-> 3731 outputs = model(**inputs)
   3732 # Save past state if it exists
...
    198     causal_mask = xformers.attn_bias.BlockDiagonalCausalMask\
    199         .from_seqlens([q_len]*bsz)\
    200         .make_local_attention(window_size = sliding_window)

AttributeError: 'NoneType' object has no attribute 'attn_bias'

The solution is

pip install pip3-autoremove
pip-autoremove torch torchvision torchaudio -y 
pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121
pip unsloth
Categories
Tensorflow

Solve GPU not detected in Tensorflow 2

When you running this code, your Nvidia GPU is not detected

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

[]

The solution is

pip install "tensorflow[and-cuda]==2.15.1"
Categories
Ubuntu

Fix headphone bluetooth keep disconnecting Ubuntu

After applying Pipewire with AAC support in Ubuntu 24.04, I got problem where my Bose QuiteComfort is repeatingly disconnecting. The log keep saying input failed to connect

grep blue /var/log/syslog

I did several things like re-install packages and configure the bluetooth /etc/bluetooth/main.conf with AutoEnable=true and ControllerMode = bredr still not working. Also edit /etc/bluetooth/input.conf with `UserspaceHID=true`.

Turns out this solution at https://knowledgebase.frame.work/ubuntu-bluetooth-S1PGxfho works well!

sudo rm -r /var/lib/bluetooth/
rm ~/.config/pulse/* 
sudo apt reinstall --purge bluez gnome-bluetooth
systemctl restart bluetooth

This will remove your headphones from bluetooth (*reset). Then change your headphones mode into pairing mode.

Then do this to enter the bluetooth terminal

bluetoothctl

Then lets find and pair

power on
scan on
pair Device ID
devices
trust ID

If any notification asking for authorization, then choose “yes”. This is usually causing the problem where paired bluetooth is not authorized. next time its connected, its get problem!