Categories
Machine Learning

Zeppelin and Spark installation on Ubuntu 24.04

To install the latest Zeppelin (I recommend 0.11.0, since 0.11.1 have bugs) and Spark 3.5.1, we need to do several steps

  1. Required packages
sudo apt-get install openjdk-11-jdk build-essential 

2. Spark installation

wget -c https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

sudo tar -xvvf spark-3.5.1-bin-hadoop3.tgz -C /opt/

sudo mv /opt/spark-3.5.1-bin-hadoop3 /opt/spark

3. Zeppelin installation

wget -c https://downloads.apache.org/zeppelin/zeppelin-0.11.0/zeppelin-0.11.0-bin-all.tgz

sudo tar -xvvf zeppelin-0.11.0-bin-all.tgz -C /opt/

sudo mv /opt/zeppelin-0.11.0-bin-all /opt/zeppelin

4. Install MambaForge

wget -c https://github.com/conda-forge/miniforge/releases/download/24.1.2-0/Mambaforge-24.1.2-0-Linux-x86_64.sh

bash Mambaforge-24.1.2-0-Linux-x86_64.sh -b -p ~/anaconda3

~/anaconda3/bin/mamba init

source ~/.bashrc

mamba create -n test python=3.10

5. Run and Configure Zeppelin

sudo ./opt/zeppelin/bin/zeppelin-daemon.sh start

Then go to Interpreter settings (on top right menu) and search for Spark. Make adjustment

  1. SPARK_HOME = /opt/spark
  2. PYTHON = /home/dev/anaconda3/envs/test/bin/python
  3. PYTHON_DRIVER = /home/dev/anaconda3/envs/test/bin/python

Now you done and enjoy using Zeppelin in your ubuntu!

Categories
Machine Learning

Fix Zeppelin Spark-interpreter-0.11.1.jar and Scala

When installing Zeppelin 0.11.1 and Spark 3.3.3 over Docker in my github ( https://github.com/yodiaditya/docker-rapids-spark-zeppelin ) I receive error

Caused by: org.apache.zeppelin.interpreter.InterpreterException: Fail to open SparkInterpreter
	at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:140)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
	... 12 more
Caused by: scala.reflect.internal.FatalError: Error accessing /opt/zeppelin/interpreter/spark/._spark-interpreter-0.11.1.jar
	at scala.tools.nsc.classpath.AggregateClassPath.$anonfun$list$3(AggregateClassPath.scala:113)

Apparently, the solution for this is very simple.

All you need just delete the file that cause the problem. In my case

rm /opt/zeppelin/interpreter/spark/._spark-interpreter-0.11.1.jar

rm /opt/zeppelin/interpreter/spark/scala-2.12/._spark-scala-2.12-0.11.1.jar
Categories
Machine Learning

Solve Pandas Error: PerformanceWarning: DataFrame is highly fragmented.

When running toPandas() or another operation, I received this error

usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series
/usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series
/usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series
/usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series
/usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series
/usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series
/usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series
/usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series

The quick solution

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
# spark.conf.set("spark.sql.execution.arrow.enabled", "true")
Categories
Ubuntu

Solve /sbin/ldconfig.real: /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8 is not a symbolic link

I received this error when installing packages in Ubuntu 23.10. To solve this issue, you can fix the CUDNN installation steps

  1. Check the ~/.bashrc
export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.8/lib64/

2. Copy CUDNN the right way

sudo cp -av include/cudnn*.h /usr/local/cuda/include
sudo cp -av lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

This should fixed the not a symbolic link problems!

Categories
Anaconda

Install Miniforge /Mamba to replace Anaconda in Ubuntu

Moving to Miniforge / Mamba will help to doing packages installation faster. The first step is to uninstall Anaconda from your machine

Reverse any Anaconda scripts

conda activate
conda init --reverse --all

Remove Anaconda folders

rm -rf ~/anaconda3
rm -rf ~/.conda
rm -rf ~/.condarc

The Miniforge / Mamba Installation

wget -c https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh 

Load the environment

source ~/.bashrc
conda install conda-libmamba-solver

Now you are good!