Flowman Quickstart Guide¶
This quickstart guide will walk you through the first steps working with Flowman. We will be using Flowman in Docker to save setup and configuration of Spark and Flowman.
Of course running Flowman inside Docker will probably prevent a proper integration with any existing Hadoop environment like Cloudera or EMR. This requires a local installation of Flowman on your Linux (or Windows) machine, which is described in more detail in installation guide.
1. Start Docker image¶
Of course, you need a working Docker installation, which should be quite easy. Then you can start a Docker image containing Flowman via the following command:
docker run --rm -ti dimajix/flowman:0.27.0-oss-spark3.3-hadoop3.3 bash
Note that this simply starts a bash shell, but Flowman is only away some fingertips.
You probably might want to mount some local directory into the Docker container running Flowman. For example you may want to make local data accessible to Flowman or you may simply want to execute some Flowman project stored on your local machine. This can be easily achieved as follows:
docker run --rm -ti -v /home/kaya/flowman/example/weather:/home/flowman/local dimajix/flowman:0.27.0-oss-spark3.3-hadoop3.3 bash
This will mount the local directory
/home/kaya/flowman/example/weather on your host computer into the Docker
2. Flowman Shell¶
The example data is stored in a publicly accessible S3 bucket. Since the data is publicly available and the project is configured to use anonymous AWS authentication, you do not need to provide your AWS credentials (you even do not even need to have an account on AWS)
Start interactive Flowman shell¶
We start Flowman by running the interactive Flowman shell. While this is not the tool that would be used in automatic
batch processing (
flowexec is the right tool for that scenario), it gives us a good idea how ETL projects in Flowman
cd flowman bin/flowshell -f examples/weather
Now we can inspect some of the relations defined in the project. First we list all relations
flowman:weather> relation list
Now we can peek inside the relations
measurements_raw. Since the second relation is partitioned
by years, we explicitly specify the year via the option
flowman:weather> relation show stations_raw flowman:weather> relation show measurements_raw -p year=2011
Running a Job¶
Now we want to execute the projects main job. Again the job is parametrized by year, so we need to specify the year that we’d like to process.
flowman:weather> job build main year=2011
Now we’d like to inspect some of the mappings which have been used during job execution. Since some mappings depend
on job-specific variables, we need to create a job context, which can be done by
job enter <job-name> <job-args>
flowman:weather> job enter main year=2011
Note how the prompt has changed and will now include the job name. Now we can inspect some mappings:
flowman:weather/main> mapping list flowman:weather/main> mapping show measurements_raw flowman:weather/main> mapping show measurements-extracted flowman:weather/main> mapping show stations_raw
Finally we’d like to leave the job context again.
flowman:weather/main> job leave
The job execution has written its results into some relations again. We can now inspect them again
flowman:weather> relation show stations flowman:weather> relation show measurements flowman:weather> relation show aggregates -p year=2011
Flowman also provides an execution history. In the trivial deployment, this information is stored locally in a Derby database, but other databases like MySQL, MariaDB etc are also supported.
flowman:weather> history job search flowman:weather> history target search -J 1
Flowman cannot only execute all the data transformations specified in the example project, it can also generate a documentation, which will be stored as an html file
flowman:weather> documentation generate
This will create a file in the directory
examples/weather/generated-documentation/project.html which can be viewed
by any web browser of your choice.
Finally, we quit the Flowman shell via the
3. Flowman Batch Execution¶
So far we have only used the Flowman shell for interactive work with projects. Actually, the shell was developed as a
second step to help to analyze problems and debugging data flows. The primary command for working with Flowman projects
flowexec which is used for non-interactive batch execution, for example within cron-jobs.
It shares a lot of code with the Flowman shell, so the commands are often exactly the same. The main difference is
flowexec you specify the commands on the command line while
flowshell provides its own prompt.
For example for running the “build” lifecycle of the weather project for the year 2014, you only need to run:
bin/flowexec -f examples/weather job build main year=2014
A very special Thank You! goes to all of you who try to follow the example hands-on on your local machine. If you have problems with following the example, please leave me a note — it’s always difficult to streamline such a process, and I might have overseen some issues.