Using Flowman Maven Plugin#

This documentation describes a more streamlined development workflow using Apache Maven as the deployment tool. Maven was chosen simply because one can assume that this is present in a Big Data environment, so no additional installation on developer machines or CI/CD infrastructure is required.

Flowman Development Workflow

In contrast to the more conservative approach described in Classic Workflow with Maven, we will use the Flowman Maven plugin, which will significantly reduce the complexity of the Maven pom.xml.

0. Prerequisites#

Obviously, you need Java (JDK 11 or later) and Maven (version 3.6.3 or later).

1. Creating a new project from a Maven Archetype#

First, you need to create a new Flowman project. You can either copy/paste from one of the official Flowman examples, or you can create a new project from a Maven archetype provided. This can be done as follows:

mvn archetype:generate \
  -DarchetypeGroupId=com.dimajix.flowman.maven \
  -DarchetypeArtifactId=flowman-archetype-quickstart \
  -DgroupId=<your-group-id> \ 
  -DartifactId=<your-artifact-id>

This will create a new directory <your-artifact-id>, which looks as follows:

├── conf
│   ├── default-namespace.yml
│   └── flowman-env.sh
├── flow
│   ├── config
│   │   ├── aws.yml
│   │   ├── config.yml
│   │   ├── connections.yml
│   │   └── environment.yml
│   ├── documentation.yml
│   ├── job
│   │   └── main.yml
│   ├── mapping
│   │   └── measurements.yml
│   ├── model
│   │   ├── measurements-raw.yml
│   │   └── measurements.yml
│   ├── project.yml
│   ├── schema
│   │   └── measurements.json
│   ├── target
│   │   ├── documentation.yml
│   │   └── measurements.yml
│   └── test
│       └── test-measurements.yml
├── deployment.xml
├── pom.xml
└── README.md

The project provides a skeleton structure with the following entities:

  • A couple of relations (one source measurements_raw and two sinks measurements and measurements_raw)

  • A couple of mapping to extract measurement information from measurements_raw

  • Two targets for writing the extracted measurements as files and to a JDBC database

  • One main job containing both targets

  • A small test suite in the flow/test directory

  • Some configuration options in the flow/config directory

Maven Build Process#

The pom.xml generated by the archetype will look as follows:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>my.company</groupId>
    <artifactId>quickstart</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>pom</packaging>

    <name>quickstart</name>

    <properties>
        <encoding>UTF-8</encoding>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <build>
        <plugins>
            <plugin>
                <groupId>com.dimajix.flowman.maven</groupId>
                <artifactId>flowman-maven-plugin</artifactId>
                <version>0.4.0</version>
                <extensions>true</extensions>
                <configuration>
                    <deploymentDescriptor>deployment.yml</deploymentDescriptor>
                </configuration>

                <!-- Additional plugin dependencies for specific deployment targets -->
                <dependencies>
                    <!-- Support for deploying to S3 storage -->
                    <dependency>
                        <groupId>com.dimajix.flowman.maven</groupId>
                        <artifactId>flowman-provider-aws</artifactId>
                        <version>0.1.0</version>
                    </dependency>
                </dependencies>
            </plugin>
        </plugins>
    </build>
</project>

As you can see, the Maven project looks almost trivial, but the flowman-maven-plugin will take care of lots of functionality.

Deployment Descriptor#

In addition to the Maven pom.xml you will also find a deployment.yml file which contains the packaging details for the Flowman Maven plugin. Its contents look as follows:

flowman:
  # Specify the Flowman version to use
  version: 1.1.0
  plugins:
    # Specify the list of plugins to use
    - flowman-avro
    - flowman-aws
    - flowman-mariadb
    - flowman-mysql


# List of subdirectories containing Flowman projects
projects:
  - flow


# Specify possibly multiple redistributable packages to be built
packages:
  # The first package is called "distd"
  distd:
    # The package is a "dist" package, i.e. a tar.gz file containing both Flowman and your project
    kind: dist

  # The second package is called "jard"
  jard:
    # The package is a "fatjar" package, i.e. a single jar file containing both Flowman and your project
    kind: fatjar

This deployment descriptor will create two packages, using the Maven coordinates (groupId, artifactId and version) of the pom.xml file. Each package is created as a separate classifier:

  • The jard package will create a Maven artifact with coordinates my.company:quickstart:1.0-SNAPSHOT:jar:jard, i.e.

Property Value
groupId my.company
artifactId quickstart
version 1.0-SNAPSHOT
classifier jard
packaging jar

The jar file is a so-called “fat jar” and contains both all Flowman code and your project files. This self-contained file can be directly with spark-submit.

  • The distd package will create a Maven artifact with coordinates my.company:quickstart:1.0-SNAPSHOT:tar.gz:distd, i.e.

Property Value
groupId my.company
artifactId quickstart
version 1.0-SNAPSHOT
classifier distd
packaging tar.gz

The dist package will create a tar.gz file, which contains all Flowman libraries, executables and plugins along with your project. For running Flowman from this package, you first need to unpack the tar.gz file, and then use the Flowman binaries like flowexec.

We will later use these Maven coordinates in the deployment step to retrieve the desired artifact from the artifact repository (like Nexus).

2. Implementing your logic#

With this small project, you can now start implementing your business logic. The project contains some predefined relations, mappings, jobs and targets. These will not be of any direct use by you, but they give you some guidance how to implement your logic with the Flowman framework.

You should focus on the following entities:

  • Relations, which define the data sources and sinks

  • Targets, which define the execution targets to be executed

  • Jobs, which bundle multiple related targets into a single executable job Moreover, you might want to adjust environment and connection settings in the config subdirectory.

Once you have implemented your initial logic, you better remove all parts from the original skeleton, specifically you should remove (or replace) all mappings, relations, jobs and targets.

3. Testing your logic#

Once you have implemented your business logic and tidied up the original skeleton relations, mappings, etc., you should perform a first test on your local machine. In order to do so, you can either use a local installation of Flowman (a good approach on Linux machines) or run Flowman within a Docker container (the simplest method for all environments, like Linux, Windows and macOS).

Chose how to set up Flowman locally#

1. Running with installed Flowman#

In order to run tests with a local Flowman installation, you first need to set up Flowman on your local machine as described in the documentation.

2. Running with Docker#

A much simpler option than setting up a local Flowman development installation is to use the pre-built Docker images. This approach is recommended especially for Windows users, but is also very simple for Linux and Mac users.

docker run --rm -ti --mount type=bind,source=<your-project-dir>,target=/opt/flowman/project dimajix/flowman:1.0.0-oss-spark3.3-hadoop3.3 bash

Using Flowman Shell#

Once you have decided on the approach (local installation or Docker) for running Flowman, you can easily start the Flowman shell via

bin/flowshell -f <your-project-dir>

Please read more about using the Flowman Shell in the corresponding documentation.

Whenever you change something in your project, you can easily reload the project in the shell via

project reload

4. Building a complete package#

Once you are happy with your results, you can build a self-contained redistributable package with Maven via

mvn clean install

This will run all tests and create (possibly multiple) packages contained inside the target directory. The type and details of the package are defined in the deployment.yml file. The example above will create the following two artifacts:

  • The jard package will create a Maven artifact with coordinates my.company:quickstart:1.0-SNAPSHOT:jar:jard, i.e.

Property Value
groupId my.company
artifactId quickstart
version 1.0-SNAPSHOT
classifier jard
packaging jar
  • The distd package will create a Maven artifact with coordinates my.company:quickstart:1.0-SNAPSHOT:tar.gz:distd, i.e.

Property Value
groupId my.company
artifactId quickstart
version 1.0-SNAPSHOT
classifier distd
packaging tar.gz

What type of package is preferable (dist or fatjar) depends on your infrastructure and deployment pipelines. People with a dedicated Hadoop cluster (Cloudera, AWS EMR) will probably be happy with a dist package, while folks with a serverless infrastructure (Azure Synapse, AWS EMR serverless) will probably prefer a completely self-contained fatjar package.

Note for Windows users: Maven will also execute all tests in your Flowman project. The Hadoop dependency will require the so-called Winutils to be installed on your machine, please read more about setting up your Windows environment.

5. Pushing to remote Repository#

This step possibly should be performed via a CI/CD pipeline (for example, Jenkins). Of course, the details heavily depend on your infrastructure, but basically the following command will do the job:

mvn deploy

This will deploy the packaged self-contained redistributable archive to a remote repository manager like Nexus. Of course, you will need to configure appropriate credentials in your Maven settings.xml (this is a user-specific settings file, and not part of the project).

6. Deploying to Production#

This is the most difficult part and completely depends on your build and deployment infrastructure and on your target environment (Kubernetes, Cloudera, EMR, …). But generally, the following steps need to be performed:

1. Fetch redistributable package from remote repository#

You can use Maven again to retrieve the correct package via

mvn dependency:get -Dartifact=<groupId>:<artifactId>:<version>:<packaging>:<classifier> -Ddest=<your-dest-directory>

For example, for downloading the tar.gz package of our example into the /tmp directory, you would need to perform the following command:

mvn dependency:get -Dartifact=my.company:quickstart:1.0:tar.gz:distd -Ddest=/tmp

2. Unpack redistributable package at appropriate location#

If you pulled a tar.gz file containing a full Flowman “dist” package, then you will need to install it. You can easily unpack the package, which will provide a complete Flowman installation (minus Spark and Hadoop):

tar xvzf <artifactId>-<version>-dist-bin.tar.gz

3. Run on your infrastructure#

Within the installation directory, you can easily run Flowman via

bin/flowexec -f flow test run

Or you can, of course, also start the Flowman Shell via

bin/flowshell -f flow