Installation Guide#
This setup guide will walk you to an installation of Apache Spark and Flowman on your local Linux box. If you are using Windows, you will find some hints for setting up the required “Hadoop Winutils”, but we generally recommend to use Linux. You can also run a Flowman Docker image, which is the simplest way to get up to speed.
1. Requirements#
Flowman brings many dependencies with the installation archive, but everything related to Hadoop or Spark needs to be provided by your platform. This approach ensures that the existing Spark and Hadoop installation is used together with all patches and extensions available on your platform. Specifically this means that Flowman requires the following components present on your system:
Java 11 (or 1.8 when using Spark 2.4)
Apache Spark with a matching minor version
Apache Hadoop with a matching minor version
Note that Flowman can be built for different Hadoop and Spark versions, and the major and minor version of the build needs to match the ones of your platform
Download & Install Spark#
As of this writing, the latest release of Flowman is 1.0.0 and is available prebuilt for Spark 3.3.2 on the Spark homepage. So we download the appropriate Spark distribution from the Apache archive and unpack it.
# Create a nice playground which doesn't mess up your system
mkdir playground
cd playground
# Download and unpack Spark & Hadoop
curl -L https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz | tar xvzf -# Create a nice link
ln -snf spark-3.3.2-bin-hadoop3 spark
The Spark package already contains Hadoop, so with this single download you already have both installed and integrated with each other.
Once you have installed Spark, you should set the environment variable SPARK_HOME
, so Flowman can find it
export SPARK_HOME=<your-spark-installation-directory>
It might be a good idea to add a corresponding line to your .bashrc
or .profile
.
Download & Install Hadoop Utils for Windows#
If you are trying to run Flowman on Windows, you also need the Hadoop Winutils, which is a set of
DLLs required for the Hadoop libraries to be working. You can get a copy at https://github.com/kontext-tech/winutils.
Once you downloaded the appropriate version, you need to place the DLLs into a directory $HADOOP_HOME/bin
, where
HADOOP_HOME
refers to some arbitrary location of your choice on your Windows PC. You also need to set the following
environment variables:
HADOOP_HOME
should point to the parent directory of thebin
directoryPATH
should also contain$HADOOP_HOME/bin
The documentation contains a dedicated section for Windows users
2. Downloading Flowman#
Since version 0.14.1, prebuilt releases are provided on the Flowman Homepage or on GitHub. This probably is the simplest way to grab a working Flowman package. Note that for each release, there are different packages being provided, for different Spark and Hadoop versions. The naming is straight forward:
flowman-dist-<version>-oss-spark<spark-version>-hadoop<hadoop-version>-bin.tar.gz
You simply have to use the package which fits to the Spark and Hadoop versions of your environment. For example, the package of Flowman 1.0.0 and for Spark 3.3 and Hadoop 3.3 would be
flowman-dist-1.0.0-oss-spark3.3-hadoop3.3-bin.tar.gz
and the full URL then would be
https://github.com/dimajix/flowman/releases/download/1.0.0/flowman-dist-1.0.0-oss-spark3.3-hadoop3.3-bin.tar.gz
Supported Spark Environments#
Flowman is available for many different Spark/Hadoop environments. The following variants are available:
Distribution | Spark | Hadoop | Java | Scala | Variant |
---|---|---|---|---|---|
Open Source | 2.4.8 | 2.6 | 1.8 | 2.11 | oss-spark2.4-hadoop2.6 |
Open Source | 2.4.8 | 2.7 | 1.8 | 2.11 | oss-spark2.4-hadoop2.7 |
Open Source | 3.0.3 | 2.7 | 11 | 2.12 | oss-spark3.0-hadoop2.7 |
Open Source | 3.0.3 | 3.2 | 11 | 2.12 | oss-spark3.0-hadoop3.2 |
Open Source | 3.1.2 | 2.7 | 11 | 2.12 | oss-spark3.1-hadoop2.7 |
Open Source | 3.1.2 | 3.2 | 11 | 2.12 | oss-spark3.1-hadoop3.2 |
Open Source | 3.2.3 | 2.7 | 11 | 2.12 | oss-spark3.2-hadoop2.7 |
Open Source | 3.2.3 | 3.3 | 11 | 2.12 | oss-spark3.2-hadoop3.3 |
Open Source | 3.3.2 | 2.7 | 11 | 2.12 | oss-spark3.3-hadoop2.7 |
Open Source | 3.3.2 | 3.3 | 11 | 2.12 | oss-spark3.3-hadoop3.3 |
Open Source | 3.4.0 | 3.3 | 11 | 2.12 | oss-spark3.4-hadoop3.3 |
AWS EMR 6.10 | 3.3.1 | 3.3 | 1.8 | 2.12 | emr6.10-spark3.3-hadoop3.3 |
Azure Synapse | 3.3.1 | 3.3 | 1.8 | 2.12 | synapse3.3-spark3.3-hadoop3.3 |
Cloudera CDH 6.3 | 2.4.0 | 3.0 | 1.8 | 2.11 | cdh6-spark2.4-hadoop3.0 |
Cloudera CDP 7.1 | 2.4.8 | 3.1 | 1.8 | 2.11 | cdp7-spark2.4-hadoop3.1 |
Cloudera CDP 7.1 | 3.2.1 | 3.1 | 11 | 2.12 | cdp7-spark3.2-hadoop3.1 |
Cloudera CDP 7.1 | 3.3.0 | 3.1 | 11 | 2.12 | cdp7-spark3.3-hadoop3.1 |
Building Flowman#
As an alternative to downloading a pre-built distribution of Flowman, you might also want to build Flowman yourself in order to match your environment. A task which is not difficult for someone who has basic experience with Maven.
3. Local Installation#
Flowman is distributed as a tar.gz
file, which simply needs to be extracted at some location on your computer or
server. This can be done via
tar xvzf flowman-dist-X.Y.Z-bin.tar.gz
Directory Layout#
Once you downloaded and unpacked Flowman, you will get a new directory which looks as follows:
├── bin
├── conf
├── examples
│ ├── sftp-upload
│ │ ├── config
│ │ ├── data
│ │ └── job
│ └── weather
│ ├── config
│ ├── job
│ ├── mapping
│ ├── model
│ └── target
├── lib
├── libexec
├── yaml-schema
└── plugins
├── flowman-aws
├── flowman-azure
├── flowman-impala
├── flowman-kafka
├── flowman-mariadb
└── flowman-...
The
bin
directory contains the Flowman executablesThe
conf
directory contains global configuration filesThe
lib
directory contains all Java jarsThe
libexec
directory contains some internal helper scriptsThe
plugins
directory contains more subdirectories, each containing a single pluginThe
yaml-schema
directory contains YAML schema files for syntax highlighting and auto-completion inside the code editor of your choice.The
examples
directory contains some example projects
4. Install additional Hadoop dependencies#
Starting with version 3.2, Spark has reduced the number of Hadoop libraries which are part of the downloadable Spark
distribution. Unfortunately, some of the libraries which have been removed are required by some Flowman plugins (for
example the S3 and Delta plugin need the hadoop-commons
library). Since at the same time Flowman will for good
reasons not include these missing libraries, you have to install these yourself and put them into the
$SPARK_HOME/jars
folder.
In order to simplify getting the appropriate Hadoop libraries and placing them into the correct Spark directory,
Flowman provides a small script called install-hadoop-dependencies
, which will download and install the missing
jars:
export SPARK_HOME=your-spark-home
cd $FLOWMAN_HOME
bin/install-hadoop-dependencies
Note that you need to have appropriate write permissions into the $SPARK_HOME/jars
directory, so you possibly need
to execute this with super-user privileges.
5. Configuration (optional)#
You probably need to perform some basic global configuration of Flowman. The relevant files are stored in the conf
directory.
flowman-env.sh
#
The flowman-env.sh
script sets up an execution environment on a system level. Here some very fundamental Spark
and Hadoop properties can be configured, like for example
SPARK_HOME
,HADOOP_HOME
and related environment variablesKRB_PRINCIPAL
andKRB_KEYTAB
for using KerberosGeneric Java options like HTTP proxy and more
Example#
#!/usr/bin/env bash
# Specify Java home (just in case)
#
#export JAVA_HOME
# Explicitly override Flowmans home. These settings are detected automatically,
# but can be overridden
#
#export FLOWMAN_HOME
#export FLOWMAN_CONF_DIR
# Specify any environment settings and paths
#
#export SPARK_HOME
#export HADOOP_HOME
#export HADOOP_CONF_DIR=${HADOOP_CONF_DIR=$HADOOP_HOME/conf}
#export YARN_HOME
#export HDFS_HOME
#export MAPRED_HOME
#export HIVE_HOME
#export HIVE_CONF_DIR=${HIVE_CONF_DIR=$HIVE_HOME/conf}
# Set the Kerberos principal in YARN cluster
#
#KRB_PRINCIPAL=
#KRB_KEYTAB=
# Specify the YARN queue to use
#
#YARN_QUEUE=
# Use a different spark-submit (for example spark2-submit in Cloudera)
#
#SPARK_SUBMIT=
# Apply any proxy settings from the system environment
#
if [[ "$PROXY_HOST" != "" ]]; then
SPARK_DRIVER_JAVA_OPTS="
-Dhttp.proxyHost=${PROXY_HOST}
-Dhttp.proxyPort=${PROXY_PORT}
-Dhttps.proxyHost=${PROXY_HOST}
-Dhttps.proxyPort=${PROXY_PORT}
$SPARK_DRIVER_JAVA_OPTS"
SPARK_EXECUTOR_JAVA_OPTS="
-Dhttp.proxyHost=${PROXY_HOST}
-Dhttp.proxyPort=${PROXY_PORT}
-Dhttps.proxyHost=${PROXY_HOST}
-Dhttps.proxyPort=${PROXY_PORT}
$SPARK_EXECUTOR_JAVA_OPTS"
SPARK_OPTS="
--conf spark.hadoop.fs.s3a.proxy.host=${PROXY_HOST}
--conf spark.hadoop.fs.s3a.proxy.port=${PROXY_PORT}
$SPARK_OPTS"
fi
# Set AWS credentials if required. You can also specify these in project config
#
if [[ "$AWS_ACCESS_KEY_ID" != "" ]]; then
SPARK_OPTS="
--conf spark.hadoop.fs.s3a.access.key=${AWS_ACCESS_KEY_ID}
--conf spark.hadoop.fs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY}
$SPARK_OPTS"
fi
system.yml
#
After the execution environment has been set up, the system.yml
is the first configuration file processed by the Java
application. Its main purpose is to load some fundamental plugins, which are already required by the next level of
configuration
Example#
plugins:
- flowman-impala
default-namespace.yml
#
On top of the very global settings, Flowman also supports so-called namespaces. Each project is executed within the context of one namespace, which would be the default namespace if nothing else is specified. Each namespace contains some configuration, such that different namespaces might represent different tenants or different staging environments.
Example#
name: "default"
history:
kind: jdbc
connection: flowman_state
retries: 3
timeout: 1000
hooks:
- kind: web
jobSuccess: http://some-host.in.your.net/success&job=$URL.encode($job)&force=$force
connections:
flowman_state:
driver: $System.getenv('FLOWMAN_HISTORY_DRIVER', 'org.apache.derby.jdbc.EmbeddedDriver')
url: $System.getenv('FLOWMAN_HISTORY_URL', $String.concat('jdbc:derby:', $System.getenv('FLOWMAN_HOME'), '/logdb;create=true'))
username: $System.getenv('FLOWMAN_HISTORY_USER', '')
password: $System.getenv('FLOWMAN_HISTORY_PASSWORD', '')
config:
- spark.sql.warehouse.dir=$System.getenv('FLOWMAN_HOME')/hive/warehouse
- hive.metastore.uris=
- javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=$System.getenv('FLOWMAN_HOME')/hive/db;create=true
- datanucleus.rdbms.datastoreAdapterClassName=org.datanucleus.store.rdbms.adapter.DerbyAdapter
plugins:
- flowman-aws
- flowman-azure
- flowman-kafka
- flowman-mariadb
store:
kind: file
location: $System.getenv('FLOWMAN_HOME')/examples
log4j.properties
#
In order to control the console (logging) output at a very detailed level, you can provide your own version of a
Log4j configuration file inside the conf
directory. You will find templates both for Log4j 1.x and 2.x.
6. Running Flowman#
Now when you have installed Spark and Flowman, you can easily start Flowman via
cd <your-flowman-directory>
export SPARK_HOME=<our-spark-directory>
bin/flowshell -f examples/weather