Jobs#

In addition to a completely data centric data flow specification, Flowman also supports so-called jobs, which simply provide a list of targets to be built. The correct build order of all specified build targets is determined automatically by Flowman by examining the artifacts being generated and required by each target.

Example#

jobs:
  main:
    description: "Processes all outputs"
    extends:
      - some_parent_job
    parameters:
      - name: processing_date
        type: string
        description: "Specifies the date in yyyy-MM-dd for which the job will be run"
    environment:
      - start_ts=$processing_date
      - end_ts=$Date.parse($processing_date).plusDays(1)
    hooks:
      - kind: web
        jobSuccess: http://0.0.0.0/success&startdate=$URL.encode($start_ts)&enddate=$URL.encode($end_ts)&period=$processing_duration&force=$force

    targets:
      - invoices_daily
      - transactions_daily
      - customers_full
      - documentation

    # The executions block allows fine-grained control over which targets should participate in which execution phase.
    # This helps to reduce the total amount of work, especially when executing parameter ranges.
    executions:
      # The following entry completely disables the VALIDATE phase
      - phase: validate
        cycle: never
      # The CREATE phase should only be executed once at the beginning of each parameter range
      - phase: create
        cycle: first
        # You can also omit the targets altogether, if you want to execute all of them. 
        targets: .*
      # You are allowed to specify a single phase more than once
      - phase: build
        cycle: always
        targets:
          # The following regular expressions matches all targets ending with "_daily"
          - .*_daily
      - phase: build
        cycle: last
        targets:
          # The following regular expressions matches all targets ending with "_full"
          - .*_full
      # Documentation should only be generated for the last entry in the execution sequence
      - phase: verify
        cycle: last
        targets: documentation

Fields#

description (optional) (type: string): A textual description of the job
extends (optional) (type: list:string): A list of other job names, which should be extended by this job. All environment variables, parameters, build targets, hooks and metrics will be inherited from the parent jobs. This helps to split up a big job into smaller ones or to reuse some configuration in slightly different but related jobs.
targets (optional) (type: list:string): A list of names of all targets that should be built as part of this job.
environment (optional) (type: list:string): A list of key=value pairs for defining or overriding environment variables which can be accessed in expressions. You can also access the job parameters in the environment definition for deriving new values.
parameters (optional) (type: list:parameter): A list of job parameters. Values for job parameters have to be specified for each job execution, be it either directly via the command line or via setting an environment variable in a derived job.
hooks (optional) (type: list:hook): A list of hooks which will be called before and after each job and target is executed. Hooks provide some ways to notify external systems (or possibly plugins) about the current execution status of jobs and targets.
metrics (optional) (type: list:hook): A list of metrics that should be published after job execution. See below for more details.
executions (optional) (type: list:execution) (since Flowman 0.30.0): This optional section provides fine-grained control over when the individual phases are to be executed. This allows reducing the amount of redundant work when a whole (date) range is used for a parameter. Within this section you can explicitly state when each phase should be executed. Each entry of the list has three attributes
- phase (required) (type: string) - the execution phase to be configured. You can have multiple entries per phase, these will be logically merged during execution.
- cycle (optional) (type: string) (default: always) - specifies when this block is active when executing a whole range of job parameters via command line. Possible values are:
  - always - this phase will be executed for all parameter instances
  - never - the corresponding phase will never be executed
  - first - only to be executed for the first parameter instance
  - last - only to be executed for the last parameter instance
- targets (optional) (type: regex) (default: .*) - list of regular expressions to match build targets. Note that you still need to specify all targets in the jobs targets main list. This list in the executions sections acts as a filter on top of the jobs main target list. Defaults to .*, which simply selects all job targets for execution.

Please find more information about executions in the Cookbook for Execution Phases. When this section is omitted, then all targets will participate in all execution phases, and all phases will be executed for the full parameter range (if specified on the command line).

Metrics#

For each job Flowman provides the following execution metrics:

metric: “job_runtime”
labels:
- category: “job”
- kind: “job”
- namespace: The name of the namespace
- project: The name of the project

Job Parameters#

A Job optionally can have parameters, which play a special role. They are available as environment variables, but explicitly provided as part of the job invocation. Parameters are defined as part of the job:

jobs:
  main:
    parameters:
      - name: processing_date
        type: date
        description: "Specifies the date in yyyy-MM-dd for which the job will be run"
        granularity: 1
        default: "2022-03-10"

Parameter Fields#

name (mandatory) (type: string): The name of the parameter.
type (optional) (type: datatype) (default: string): The data type of the parameter. See Fields and Data Types for a complete list of supported data types
description (optional) (type: string): A description of the parameter
default (optional) (type: object): Provides a default value of the parameter.
granularity (optional) (type: integer) (default: 1): Defines the step size of the parameter.

Using Parameters#

Job parameters have to be specified when a job is run from the command line (via flowexec job run param=value), except if there is a default value defined for a parameter.

Flowman can be configured such that every run of a job is logged into a database. Each log entry includes the job’s name and also all values for all parameters. This way it is possible to identify individual runs of a job.

With these explanations in mind, you should only declare job parameters which have an influence on the data processing result (for example, the processing date range). Other settings like credentials should not be provided as job parameters, but as normal environment variables instead.

Note that you can also execute a whole range of values for a given parameter as follows:

flowexec job build daily processing_datetime:start=2021-06-01T00:00 processing_datetime:end=2021-08-10T00:00 processing_datetime:step=P1D --target parquet_lineitem --no-lifecycle -j 4

This would assume that the job parameter processing_datetime is of type date.

Job Isolation#

Because a Job might be invoked with different values for the same set of parameters, each Job will be executed in a logically isolated environment, where all cached data is cleared after the Job is finished. This way it is ensured that all mappings, which rely on specific parameter values, are reevaluated when the same Job is run multiple times within a project.

Publishing Metrics#

Each job can define a set of metrics to be published. The job only contains the logical definition of metrics, the type and endpoint for publishing the metrics is defined in the namespace.

Sub Pages#

Flowman Job Execution