Hadoop Dependencies Installation#
Starting with version 3.2, Spark has reduced the number of Hadoop libraries which are part of the downloadable Spark
distribution. Unfortunately, some of the libraries which have been removed are required by some Flowman plugins (for
example the S3 and Delta plugin need the hadoop-commons
library). Since at the same time Flowman will for good
reasons not include these missing libraries, you have to install these yourself and put them into the
$SPARK_HOME/jars
folder.
Automated Installation#
In order to simplify getting the appropriate Hadoop libraries and placing them into the correct Spark directory,
Flowman provides a small script called install-hadoop-dependencies
, which will download and install the missing
jars:
export SPARK_HOME=your-spark-home
cd $FLOWMAN_HOME
bin/install-hadoop-dependencies
Note that you need to have appropriate write permissions into the $SPARK_HOME/jars
directory, so you possibly need
to execute this with super-user privileges.
Also note that this script will download and install the Hadoop libraries with the build version of Flowman, not the version of the already existing Hadoop libraries.