Modules#

Flowman YAML specifications can be split up into an arbitrary number of files. From a project perspective these files form modules, and the collection of all modules create a project.

Modules (either as individual files or as directories) are specified in the project main file

Each module supports the following top level entries:

config:
    ...
    
environment:
    ...

profiles:
    ...

relations:
    ...
    
connections:
    ...
    
mappings:   
    ...

targets:
    ...

tests:
    ...

templates:
    ...

jobs:
    ...

Each top level entry may appear at most once in every file, but multiple files can have the same top level entries. This again helps to split up the whole specifications into multiple files in order to help organize your data flow.

Module Sections#

As explained above, each file belonging to a module can contain multiple sections. The meaning and contents of each section are explained below

config Section#

The config section contains a list of Hadoop, Spark or Flowman configuration properties, for example

config:
  - spark.hadoop.fs.s3a.endpoint=s3.eu-central-1.amazonaws.com
  - spark.hadoop.fs.s3a.access.key=$System.getenv('AWS_ACCESS_KEY_ID')
  - spark.hadoop.fs.s3a.secret.key=$System.getenv('AWS_SECRET_ACCESS_KEY')
  - spark.hadoop.fs.s3a.proxy.host=$System.getenv('S3_PROXY_HOST', $System.getenv('AWS_PROXY_HOST'))
  - spark.hadoop.fs.s3a.proxy.port=$System.getenv('S3_PROXY_PORT', $System.getenv('AWS_PROXY_PORT' ,'-1'))
  - spark.hadoop.fs.s3a.proxy.username=
  - spark.hadoop.fs.s3a.proxy.password=

As you can see, each property has to be specified as key=value. Configuration properties are evaluated in the order they are specified within a single file.

All Spark config properties are passed to Spark when the Spark session is created. As you can also see, you can use expression evaluation in the values. It is not possible to use expressions for the keys.

environment Section#

The environment section contains key-value-pairs which can be accessed via expression evaluation in almost any value definition in the specification files. A typical environment section may look as follows

environment:
  - start_year=2007
  - end_year=2014
  - export_location=hdfs://export/weather-data

All values specified in the environment can be overridden either by profiles or by explicitly setting them as property definitions on the command line.

Note the difference between environment and config. While the first provides user defined variables to be used as placeholders in the specification, all entries in config impact the execution and are used either directly by Flowman or by its underlying libraries like Hadoop or Spark.

profiles Section#

TBD.

relations Section#

The relations section simply contains a map of named relations. For example

relations:
  measurements-raw:
    kind: file
    format: text
    location: "s3a://dimajix-training/data/weather/"
    pattern: "${year}"
    schema:
      kind: inline
      fields:
        - name: raw_data
          type: string
          description: "Raw measurement data"
    partitions:
      - name: year
        type: integer
        granularity: 1

This will define a relation called measurement-raw which can be accessed from other elements like mappings (for reading from the relation) or output operation (for writing to the relation). The list and syntax of available relations is described in detail in the Relations documentation.

connections Section#

Similar to relations the connections section contains a map of named connections. For example

connections:
  my-sftp-server:
    kind: sftp
    host: "${sftp_host}"
    port: ${sftp_port}
    username: "${sftp_username}"
    password: "${sftp_password}"
    keyFile: "${sftp_keyfile}"
    knownHosts: "$System.getProperty('user.home')/.ssh/known_hosts"

This will declare a connection called my-sftp-server of kind sftp which referenced in specific mappings or tasks (for example inside a SFTP upload task). Detailed descriptions of all supported connections is provided in the Connections documentation.

mappings Section#

Again the mappings section contains named mappings which describe the data flow and any data transformation. For example

mappings:
  measurements-raw:
    kind: read-relation
    source: measurements-raw
    partitions:
      year:
        start: $start_year
        end: $end_year
    columns:
      raw_data: String

This defines a mapping called measurements-raw and reads data from a relation called measurements-raw. As you can see, you can reuse the same name inside different sections, for example you can use the same name measurements-raw as a relation, a mapping and an output.

You can read all about mappings in the Mappings section.

targets Section#

The targets section contains a map of named output operations like writing to files, relations or simply dumping the contents of a mapping on the console. For example

targets:
  measurements-dump:
    kind: dump
    enabled: false
    input: measurements
    limit: 100

This would define one output called measurements-dump which will show the first 100 records from a mapping called measurements.

You can read all about build targets in the Targets section.

tests Section#

Flowman also provides a built-in test framework for creating unit tests for your logic. The test framework is able to replace relations and mappings by mocked data, so the tests do not require any external data sources.

jobs Section#

Finally, there is the jobs section which contains one or multiple named job specifications, which contain lists of tasks to be executed. Jobs sit one layer above the data flow itself, they are used to building complex processing pipelines which may also require additional actions like uploading files via SFTP.

A typical job specification may look as follows:

jobs:
  main:
    description: "Main job"
    tasks:
      - kind: show-environment
      - kind: print
        text:
          - "project.name=${project.name}"
          - "project.version=${project.version}"
          - "project.basedir=${project.basedir}"
          - "project.filename=${project.filename}"
      - kind: call
        job: dump-all
        force: true

This would create a single job called main which contains three tasks which are executed sequentially (first task would show all environment variables, second would print some information on the console and the last would call another job called dump-all).

Every project should contain one job called main which is executed whenever the whole project is to be executed using the Flowman CLI

templates Section#

With Flowman 0.18.0, a new templating mechanism is implemented which helps you to avoid repeating similar specification blocks.