Lesson 12 — Deployment#
So far, we have seen how to create, execute and debug a project in Flowman. But this still leaves the question open
how a development workflow could look like. Of course, you could simply install Flowman on some server and then
copy all project files to the server before using flowexec
to execute some jobs.
But Flowman also offers a more streamlined process using Apache Maven as a build system. This workflow will easily integrate itself into an existing CI/CD infrastructure.
1. What to Expect#
Objectives#
You learn a robust development workflow including creating and deploying artifacts
You know how to use the Flowman Maven Plugin
You can find the full source code of this lesson on GitHub
Description#
Since this chapter focuses on the workflow and not core features, we will reuse the project from chapter 5.
We will restructure the project to be processed by the Flowman Maven plugin. The result will support the following workflow:
Development on your local machine.
Build a deployable artifact. This can be done on your local machine, but also on some CI/CD server like Jenkins.
Deploy artifact to some remote location.
Prerequisites#
This lesson is not executed within the Docker container. It should be executed directly on your local machine. You need Java 11 and Maven installed on your machine.
2. Project Setup#
In order to use Maven with the Flowman plugin, we need to slightly restructure the project: We move all project
related files into a subdirectory weather
(the name of the project). We will also add a directory conf
containing
the default-namespace.yml
configuration file. Eventually, we add the files pom.xml
for Maven and deployment.xml
for the Flowman Maven plugin.
2.1 Project Structure#
This final directory structure looks as follows
├── conf
│ └── default-namespace.yml
├── weather
│ ├── config
│ │ ├── aws.yml
│ │ ...
│ ├── job
│ │ └── main.yml
│ ├── mapping
│ │ ├── measurements.yml
│ │ ...
│ ├── model
│ │ ├── measurements-raw.yml
│ │ ...
│ ├── project.yml
│ ├── schema
│ │ ├── measurements.json
│ │ ...
│ ├── target
│ ├── aggregates.yml
│ │ ...
├── deployment.xml
├── pom.xml
└── README.md
2.2 Maven Build Process#
The pom.xml
generated by the archetype will look as follows:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.dimajix.flowman.tutorial</groupId>
<artifactId>flowman-tutorial-weather</artifactId>
<version>1.0.0-SNAPSHOT</version>
<packaging>pom</packaging>
<name>Flowman Weather Data</name>
<description>Small demo project for Flowman using publicly available weather data</description>
<properties>
<!-- Encoding related settings -->
<encoding>UTF-8</encoding>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<flowman.version>1.0.0</flowman.version>
</properties>
<build>
<plugins>
<plugin>
<groupId>com.dimajix.flowman.maven</groupId>
<artifactId>flowman-maven-plugin</artifactId>
<version>0.3.0</version>
<extensions>true</extensions>
<configuration>
<deploymentDescriptor>deployment.yml</deploymentDescriptor>
</configuration>
</plugin>
</plugins>
</build>
</project>
As you can see, the Maven project looks almost trivial, but the flowman-maven-plugin
will take care of lots of
functionality.
2.3 Deployment Descriptor#
In addition to the Maven pom.xml
you will also find a deployment.yml
file which contains the packaging details
for the Flowman Maven plugin. Its contents look as follows:
flowman:
version: ${flowman.version}
plugins:
- flowman-avro
- flowman-aws
# List of subdirectories containing Flowman projects
projects:
- weather
# List of packages to be built
packages:
# The first package is called "dist"
dist:
kind: dist
# The second package is called "jar"
jar:
# The package is a "fatjar" package, i.e. a single jar file containing both Flowman and your project
kind: fatjar
execution:
javaOptions:
- -Dhttp.proxyHost=${http.proxyHost}
- -Dhttp.proxyPort=${http.proxyPort}
- -Dhttps.proxyHost=${https.proxyHost}
- -Dhttps.proxyPort=${https.proxyPort}
This deployment descriptor will create two packages, using the Maven coordinates (groupId
, artifactId
and version
) of
the pom.xml
file. Each package is created as a separate classifier:
The
jar
package will create a Maven artifact with coordinatescom.dimajix.flowman.tutorial:flowman-tutorial-weather:1.0-SNAPSHOT:jar:jar
, i.e.
Property | Value |
---|---|
groupId |
com.dimajix.flowman.tutorial |
artifactId |
flowman-tutorial-weather |
version |
1.0-SNAPSHOT |
classifier |
jar |
packaging |
jar |
The jar file is a so-called “fat jar” and contains both all Flowman code and your project files. This self-contained
file can be directly with spark-submit
.
The
dist
package will create a Maven artifact with coordinatescom.dimajix.flowman.tutorial:flowman-tutorial-weather:1.0-SNAPSHOT:tar.gz:dist
, i.e.
Property | Value |
---|---|
groupId |
com.dimajix.flowman.tutorial |
artifactId |
flowman-tutorial-weather |
version |
1.0-SNAPSHOT |
classifier |
dist |
packaging |
tar.gz |
The dist
package will create a tar.gz
file, which contains all Flowman libraries, executables and plugins along
with your project. For running Flowman from this package, you first need to unpack the tar.gz
file, and then use
the Flowman binaries like flowexec.
We will later use these Maven coordinates in the deployment step to retrieve the desired artifact from the artifact repository (like Nexus).
3. Building#
Once you are happy with your results, you can build a self-contained redistributable package with Maven via
mvn clean install
This will run all tests and create (possibly multiple) packages contained inside the target
directory. The type and
details of the package are defined in the deployment.yml
file. The example above will create the following two
artifacts:
The
jar
package will create a Maven artifact with coordinatescom.dimajix.flowman.tutorial:flowman-tutorial-weather:1.0-SNAPSHOT:jar:jar
, i.e.
Property | Value |
---|---|
groupId |
com.dimajix.flowman.tutorial |
artifactId |
flowman-tutorial-weather |
version |
1.0-SNAPSHOT |
classifier |
jar |
packaging |
jar |
The
dist
package will create a Maven artifact with coordinatescom.dimajix.flowman.tutorial:flowman-tutorial-weather:1.0-SNAPSHOT:tar.gz:dist
, i.e.
Property | Value |
---|---|
groupId |
com.dimajix.flowman.tutorial |
artifactId |
flowman-tutorial-weather |
version |
1.0-SNAPSHOT |
classifier |
dist |
packaging |
tar.gz |
What type of package is preferable (dist
or fatjar
) depends on your infrastructure and deployment pipelines. People
with a dedicated Hadoop cluster (Cloudera, AWS EMR) will probably be happy with a dist
package, while folks with a
serverless infrastructure (Azure Synapse, AWS EMR serverless) will probably prefer a completely self-contained
fatjar
package.
Note for Windows users: Maven will also execute all tests in your Flowman project. The Hadoop dependency will require the so-called Winutils to be installed on your machine.
4. Publishing#
This step possibly should be performed via a CI/CD pipeline (for example, Jenkins). Of course, the details heavily depend on your infrastructure, but basically the following command will do the job:
mvn deploy
This will deploy the packaged self-contained redistributable archive to a remote repository manager like Nexus. Of
course, you will need to configure appropriate credentials in your Maven settings.xml
(this is a user-specific
settings file, and not part of the project).
5. Deploying to Production#
This is the most difficult part and completely depends on your build and deployment infrastructure and on your target environment (Kubernetes, Cloudera, EMR, …). But generally, the following steps need to be performed:
5.1 Fetch redistributable package from remote repository#
You can use Maven again to retrieve the correct package via
mvn dependency:get -Dartifact=<groupId>:<artifactId>:<version>:<packaging>:<classifier> -Ddest=<your-dest-directory>
For example, for downloading the tar.gz
package of our example into the /tmp
directory, you would need to perform
the following command:
mvn dependency:get -Dartifact=com.dimajix.flowman.tutorial:flowman-tutorial-weather:1.0-SNAPSHOT:tar.gz:dist -Ddest=/tmp
Similarly, for fetching the fat jar, you need to run the following Maven command:
mvn dependency:get -Dartifact=com.dimajix.flowman.tutorial:flowman-tutorial-weather:1.0-SNAPSHOT:jar:jar -Ddest=/tmp
5.2 Unpack redistributable package at appropriate location#
If you pulled a tar.gz
file containing a full Flowman “dist” package, then you will need to install it.
You can easily unpack the package, which will provide a complete Flowman installation (minus Spark and Hadoop):
tar xvzf <artifactId>-<version>-dist-bin.tar.gz
5.3 Run on your infrastructure#
Within the installation directory, you can easily run Flowman via
bin/flowexec -f flow test run
Or you can, of course, also start the Flowman Shell via
bin/flowshell -f flow
6. Next Lesson#
In the next lesson, we will learn what kind of execution metrics are collected by Flowman, how to define new data dependent metrics, and how to publish them to Prometheus.