Categories
ML

Fix PySpark contains a task of very large size. The maximum recommended task size is 1000 KiB

If you got this error when running your notebook with warning

contains a task of very large size. The maximum recommended task size is 1000 KiB

This mean PySpark warning you to increase the partition or parallelism (and might memory as well).

Example code to configure it, where you can adjust based on your workstation memory. In my case, is 192GB is my max memory

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-memory 192g --executor-memory 16g pyspark-shell'

# add this one in your spark configuration
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

Full implementation

import os
from pyspark import SparkContext

os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-memory 192g --executor-memory 16g --executor-cores 10 pyspark-shell'
os.environ['PYARROW_IGNORE_TIMEZONE'] = '1'

builder = SparkSession.builder
builder = builder.config("spark.driver.maxResultSize", "5G")

spark = builder.master("local[*]").appName("FMClassifier_MovieLens").getOrCreate()
spark.conf.set("spark.sql.analyzer.failAmbiguousSelfJoin", "false")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

Additional bonus description if you would like to increase the instances as well


spark = SparkSession.builder.config('spark.executor.instances', 4).getOrCreate()
spark.conf.set("spark.sql.analyzer.failAmbiguousSelfJoin", "false")

Leave a Reply

Your email address will not be published. Required fields are marked *