Month: April 2024

Zeppelin and Spark installation on Ubuntu 24.04

Post author By yodi
Post date April 28, 2024
No Comments on Zeppelin and Spark installation on Ubuntu 24.04

To install the latest Zeppelin (I recommend 0.11.0, since 0.11.1 have bugs) and Spark 3.5.1, we need to do several steps

Required packages

sudo apt-get install openjdk-11-jdk build-essential

Or you can install using previous java version if encountered with error

sudo apt-get install openjdk-8-jdk-headless

2. Spark installation

wget -c https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

sudo tar -xvvf spark-3.5.1-bin-hadoop3.tgz -C /opt/

sudo mv /opt/spark-3.5.1-bin-hadoop3 /opt/spark

3. Zeppelin installation

wget -c https://downloads.apache.org/zeppelin/zeppelin-0.11.0/zeppelin-0.11.0-bin-all.tgz

sudo tar -xvvf zeppelin-0.11.0-bin-all.tgz -C /opt/

sudo mv /opt/zeppelin-0.11.0-bin-all /opt/zeppelin

4. Install MambaForge

wget -c https://github.com/conda-forge/miniforge/releases/download/24.1.2-0/Mambaforge-24.1.2-0-Linux-x86_64.sh

bash Mambaforge-24.1.2-0-Linux-x86_64.sh -b -p ~/anaconda3

~/anaconda3/bin/mamba init

source ~/.bashrc

mamba create -n test python=3.10

5. Run and Configure Zeppelin

sudo ./opt/zeppelin/bin/zeppelin-daemon.sh start

Then go to Interpreter settings (on top right menu) and search for Spark. Make adjustment

SPARK_HOME = /opt/spark
PYTHON = /home/dev/anaconda3/envs/test/bin/python
PYTHON_DRIVER = /home/dev/anaconda3/envs/test/bin/python

Now you done and enjoy using Zeppelin in your ubuntu!

Tags Zeppelin and Spark installation on Ubuntu 24.04

Machine Learning

Fix Zeppelin Spark-interpreter-0.11.1.jar and Scala

Post author By yodi
Post date April 24, 2024
1 Comment on Fix Zeppelin Spark-interpreter-0.11.1.jar and Scala

When installing Zeppelin 0.11.1 and Spark 3.3.3 over Docker in my github ( https://github.com/yodiaditya/docker-rapids-spark-zeppelin ) I receive error

Caused by: org.apache.zeppelin.interpreter.InterpreterException: Fail to open SparkInterpreter
	at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:140)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
	... 12 more
Caused by: scala.reflect.internal.FatalError: Error accessing /opt/zeppelin/interpreter/spark/._spark-interpreter-0.11.1.jar
	at scala.tools.nsc.classpath.AggregateClassPath.$anonfun$list$3(AggregateClassPath.scala:113)

Apparently, the solution for this is very simple.

All you need just delete the file that cause the problem. In my case

rm /opt/zeppelin/interpreter/spark/._spark-interpreter-0.11.1.jar

rm /opt/zeppelin/interpreter/spark/scala-2.12/._spark-scala-2.12-0.11.1.jar

Machine Learning

Solve Pandas Error: PerformanceWarning: DataFrame is highly fragmented.

Post author By yodi
Post date April 19, 2024
No Comments on Solve Pandas Error: PerformanceWarning: DataFrame is highly fragmented.

When running toPandas() or another operation, I received this error

usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series
/usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series
/usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series
/usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series
/usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series
/usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series
/usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series
/usr/lib/spark/python/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series

The quick solution

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
# spark.conf.set("spark.sql.execution.arrow.enabled", "true")

Tags Solve Pandas Error: PerformanceWarning: DataFrame is highly fragmented.

Ubuntu

Solve /sbin/ldconfig.real: /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8 is not a symbolic link

Post author By yodi
Post date April 11, 2024
No Comments on Solve /sbin/ldconfig.real: /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8 is not a symbolic link

I received this error when installing packages in Ubuntu 23.10. To solve this issue, you can fix the CUDNN installation steps

Check the ~/.bashrc

export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.8/lib64/

2. Copy CUDNN the right way

If you can’t found folder ‘lib64’ inside CUDNN, just rename ‘lib’ into ‘lib64’

sudo cp -av include/cudnn*.h /usr/local/cuda/include
sudo cp -av lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

This should fixed the not a symbolic link problems!

Tags Solve /sbin/ldconfig.real: /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8 is not a symbolic link

Anaconda

Install Miniforge /Mamba to replace Anaconda in Ubuntu

Post author By yodi
Post date April 9, 2024
No Comments on Install Miniforge /Mamba to replace Anaconda in Ubuntu

Moving to Miniforge / Mamba will help to doing packages installation faster. The first step is to uninstall Anaconda from your machine

Reverse any Anaconda scripts

conda activate
conda init --reverse --all

Remove Anaconda folders

rm -rf ~/anaconda3
rm -rf ~/.conda
rm -rf ~/.condarc

The Miniforge / Mamba Installation

wget -c https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh

Load the environment

source ~/.bashrc
conda install conda-libmamba-solver

Now you are good!

Tags Install Miniforge /Mamba to replace Anaconda in Ubuntu