Using Standard Maven Plugins#
This documentation describes a conservative development workflow using Apache Maven as the deployment tool. Maven was chosen simply because one can assume that this is present in a Big Data environment, so no additional installation on developer machines or CI/CD infrastructure is required.
1. Creating a new project from a Maven Archetype#
First, you need to create a new Flowman project. You can either copy/paste from one of the official Flowman examples, or you can create a new project from a Maven archetype provided. This can be done as follows:
mvn archetype:generate \
-DarchetypeGroupId=com.dimajix.flowman.maven \
-DarchetypeArtifactId=flowman-archetype-assembly \
-DgroupId=<your-group-id> \
-DartifactId=<your-artifact-id>
This will create a new directory <your-artifact-id>
, which looks as follows:
├── conf
│ ├── default-namespace.yml
│ └── flowman-env.sh
├── flow
│ ├── config
│ │ ├── aws.yml
│ │ ├── config.yml
│ │ ├── connections.yml
│ │ └── environment.yml
│ ├── documentation.yml
│ ├── job
│ │ └── main.yml
│ ├── mapping
│ │ └── measurements.yml
│ ├── model
│ │ ├── measurements-raw.yml
│ │ └── measurements.yml
│ ├── project.yml
│ ├── schema
│ │ └── measurements.json
│ ├── target
│ │ ├── documentation.yml
│ │ └── measurements.yml
│ └── test
│ └── test-measurements.yml
├── assembly.xml
├── pom.xml
└── README.md
The project provides a skeleton structure with the following entities:
A couple of relations (one source
measurements_raw
and two sinksmeasurements
andmeasurements_raw
)A couple of mapping to extract measurement information from
measurements_raw
Two targets for writing the extracted measurements as files and to a JDBC database
One
main
job containing both targetsA small test suite in the
flow/test
directorySome configuration options in the
flow/config
directory
Maven Build Process#
The pom.xml
generated by the archetype will look as follows:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>my.company</groupId>
<artifactId>quickstart</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>pom</packaging>
<name>quickstart</name>
<parent>
<groupId>com.dimajix.flowman</groupId>
<artifactId>flowman-parent</artifactId>
<version>1.0.0-SNAPSHOT</version>
</parent>
<build>
<resources>
<!-- Define which resources should be processed. Do not forget to also add any new directories to assembly.xml! -->
<resource>
<directory>flow</directory>
<targetPath>${project.build.outputDirectory}/flow</targetPath>
<filtering>true</filtering>
</resource>
<resource>
<directory>conf</directory>
<targetPath>${project.build.outputDirectory}/conf</targetPath>
<filtering>true</filtering>
</resource>
</resources>
<plugins>
<plugin>
<!-- 1. Unpack Flowman distribution, which provides a working setup -->
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<phase>process-sources</phase>
<goals>
<goal>unpack</goal>
</goals>
<configuration>
<outputDirectory>target</outputDirectory>
<artifactItems>
<artifactItem>
<groupId>com.dimajix.flowman</groupId>
<artifactId>flowman-dist</artifactId>
<version>${flowman.version}</version>
<type>tar.gz</type>
<classifier>bin</classifier>
</artifactItem>
</artifactItems>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<!-- 2. Process project resources -->
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<configuration>
<delimiters>
<delimiter>@</delimiter>
</delimiters>
<useDefaultDelimiters>false</useDefaultDelimiters>
</configuration>
<executions>
<execution>
<id>default-resources</id>
<phase>process-resources</phase>
<goals>
<goal>resources</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<!-- 3. Run all tests -->
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<executions>
<execution>
<id>exec-flowman-test</id>
<phase>test</phase>
<goals>
<goal>exec</goal>
</goals>
</execution>
</executions>
<configuration>
<includeProjectDependencies>true</includeProjectDependencies>
<classpathScope>compile</classpathScope>
<executable>java</executable>
<skip>${skipTests}</skip>
<environmentVariables>
<!-- Set FLOWMAN_HOME to the unpacked dist directory -->
<FLOWMAN_HOME>${project.build.directory}/flowman-${flowman.version}</FLOWMAN_HOME>
<!-- Use the configuration provided in the "conf" directory -->
<FLOWMAN_CONF_DIR>${project.build.outputDirectory}/conf</FLOWMAN_CONF_DIR>
</environmentVariables>
<arguments>
<argument>-classpath</argument>
<classpath/>
<argument>com.dimajix.flowman.tools.exec.Driver</argument>
<argument>-f</argument>
<argument>${project.build.outputDirectory}/flow</argument>
<argument>test</argument>
<argument>run</argument>
</arguments>
</configuration>
</plugin>
<plugin>
<!-- 4. Create final deployable package containing Flowman and Project -->
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
<configuration>
<finalName>${project.artifactId}-${project.version}</finalName>
<descriptors>
<descriptor>assembly.xml</descriptor>
</descriptors>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<!-- Main Flowman executable for running tests -->
<dependency>
<groupId>com.dimajix.flowman</groupId>
<artifactId>flowman-spark-dependencies</artifactId>
<type>pom</type>
</dependency>
<dependency>
<groupId>com.dimajix.flowman</groupId>
<artifactId>flowman-tools</artifactId>
</dependency>
</dependencies>
</project>
This pom.xml
will instruct Maven to perform the following build steps:
Download and unpack Flowman distribution via
maven-dependency-plugin
. This step will provide a working setup.Process all project resources (your flow definition) via the
maven-resource-plugin
. This will replace any Maven variable in your project.Run all Flowman tests via the
exec-maven-plugin
.Create final deployable package containing Flowman and your project using the
maven-assembly-plugin
.
The final artifact of the example pom.xml
above will have the following Maven coordinates
Property | Value |
---|---|
groupId |
my.company |
artifactId |
quickstart |
version |
1.0-SNAPSHOT |
classifier |
bin |
packaging |
tar.gz |
This approach is very flexible (since it uses standard Maven plugins), but also a little bit conservative since you need to take care of all the small details within Maven itself.
2. Implementing your logic#
With this small project, you can now start implementing your business logic. The project contains some predefined relations, mappings, jobs and targets. These will not be of any direct use by you, but they give you some guidance how to implement your logic with the Flowman framework.
You should focus on the following entities:
Relations, which define the data sources and sinks
Targets, which define the execution targets to be executed
Jobs, which bundle multiple related targets into a single executable job Moreover, you might want to adjust environment and connection settings in the
config
subdirectory.
Once you have implemented your initial logic, you better remove all parts from the original skeleton, specifically you should remove (or replace) all mappings, relations, jobs and targets.
3. Testing your logic#
Once you have implemented your business logic and tidied up the original skeleton relations, mappings, etc., you should perform a first test on your local machine. In order to do so, you can either use a local installation of Flowman (a good approach on Linux machines) or run Flowman within a Docker container (the simplest method for all environments, like Linux, Windows and macOS).
Chose how to set up Flowman locally#
1. Running with installed Flowman#
In order to run tests with a local Flowman installation, you first need to set up Flowman on your local machine as described in the documentation.
2. Running with Docker#
A much simpler option than setting up a local Flowman development installation is to use the pre-built Docker images. This approach is recommended especially for Windows users, but is also very simple for Linux and Mac users.
docker run --rm -ti --mount type=bind,source=<your-project-dir>,target=/opt/flowman/project dimajix/flowman:1.0.0-oss-spark3.3-hadoop3.3 bash
Using Flowman Shell#
Once you have decided on the approach (local installation or Docker) for running Flowman, you can easily start the Flowman shell via
bin/flowshell -f <your-project-dir>
Please read more about using the Flowman Shell in the corresponding documentation.
Whenever you change something in your project, you can easily reload the project in the shell via
project reload
4. Building a complete package#
Once you are happy with your results, you can build a self-contained redistributable package with Maven via
mvn clean install
This will run all tests and create package <artifactId>-<version>-dist-bin.tar.gz
contained inside the target
directory. The package will contain both Flowman and your project. It will not include Spark or Hadoop, these still
need to be provided by your environment.
The pom.xml
of the example will create an artifact with the following Maven coordinates:
Property | Value |
---|---|
groupId |
my.company |
artifactId |
quickstart |
version |
1.0-SNAPSHOT |
classifier |
bin |
packaging |
tar.gz |
Note for Windows users: Maven will also execute all tests in your Flowman project. The Hadoop dependency will require the so-called Winutils to be installed on your machine, please read more about setting up your Windows environment.
5. Pushing to remote Repository#
This step possibly should be performed via a CI/CD pipeline (for example, Jenkins). Of course, the details heavily depend on your infrastructure, but basically the following command will do the job:
mvn deploy
This will deploy the packaged self-contained redistributable archive to a remote repository manager like Nexus. Of
course, you will need to configure appropriate credentials in your Maven settings.xml
(this is a user-specific
settings file, and not part of the project).
6. Deploying to Production#
This is the most difficult part and completely depends on your build and deployment infrastructure and on your target environment (Kubernetes, Cloudera, EMR, …). But generally, the following steps need to be performed:
1. Fetch redistributable package from remote repository#
You can use Maven again to retrieve the correct package via
mvn dependency:get -Dartifact=<groupId>:<artifactId>:<version>:bin -Ddest=<your-dest-directory>
For the example above, you would need to execute the following command:
mvn dependency:get -Dartifact=my.company:quickstart:1.0-SNAPSHOT:bin -Ddest=/tmp
2. Unpack redistributable package at appropriate location#
You can easily unpack the package, which will provide a complete Flowman installation (minus Spark and Hadoop):
tar xvzf <artifactId>-<version>-dist-bin.tar.gz
3. Run on your infrastructure#
Within the installation directory, you can easily run Flowman via
bin/flowexec -f flow test run
Of course, you can also start the Flowman Shell via
bin/flowshell -f flow