What is Flowman¶
Flowman is a Spark based data build tool that simplifies the act of writing data transformation application. Flowman can be seen as a ETL tool, with a strong focus on transformation and schema management.
The main idea is that developers define all input/output tables and the whole transformation logic in purely declarative YAML files instead of writing complex Spark jobs in Scala or Python. The main advantage of this approach is that many technical details of a correct and robust implementation are encapsulated and the user can concentrate on the data transformations themselves.
In addition to writing and executing data transformations, Flowman can also be used for managing physical data models, i.e. Hive tables but also JDBC tables. Flowman will create such tables from a specification with the correct schema, and Flowman also provides mechanisms to automatically migrate these tables when the schema changes due to updated transformation logic (i.e. new columns are added, data types are changed, etc).
This helps to keep all aspects (like transformations and schema information) in a single place managed by a single application.
Flowman suits well to the requirements of a modern Big Data stack serving multiple different purposes like reporting, analytics, ML and more. Building on Sparks ability to integrate different data sources, Flowman will serve as the central place in your value chain for data preparations for the next steps.
- Declarative syntax in YAML files
- Full lifecycle management of data models (create, migrate and destroy Hive tables, JDBC tables or file based storage)
- Flexible expression language
- Jobs for managing build targets (like copying files or uploading data via sftp)
- Automatic dependency analysis to build targets in the correct order
- Powerful yet simple command line tool for batch execution
- Powerful Command line tool for interactive data flow analysis
- History server that provides an overview of past jobs and targets including lineage
- Metric system with the ability to publish these to servers like Prometheus
- Extendable via Plugins
Where to go from here¶
Quickstart & Tutorial¶
A small quickstart guide will lead you through a simple example. After you have finished the introduction, you may want to proceed with the Flowman tutorial to get more in-depth knowledge step by step.
Flowman provides a command line utility (CLI) for running flows. Details are described in the following sections:
So called specifications describe the logical data flow, data sources and more. A full specification contains multiple entities like mappings, data models and jobs to be executed. More detail on all these items is described in the following sections:
- Specification Overview: An introduction for writing new flows
- Mappings: Documentation of available data transformations
- Relations: Documentation of available data sources and sinks
- Targets: Documentation of available build targets
- Schema: Documentation of available schema descriptions
- Jobs: Documentation of creating jobs and building targets