In addition to a completely data centric data flow specification, Flowman also supports so-called jobs, which simply provide a list of targets to be built. The correct build order of all specified build targets is determined automatically by Flowman by examining the artifacts being generated and required by each target.
jobs: main: description: "Processes all outputs" extends: - some_parent_job parameters: - name: processing_date type: string description: "Specifies the date in yyyy-MM-dd for which the job will be run" environment: - start_ts=$processing_date - end_ts=$Date.parse($processing_date).plusDays(1) hooks: - kind: web jobSuccess: http://0.0.0.0/success&startdate=$URL.encode($start_ts)&enddate=$URL.encode($end_ts)&period=$processing_duration&force=$force targets: - invoices_daily - transactions_daily - customers_full - documentation # The executions block allows fine-grained control over which targets should participate in which execution phase. # This helps to reduce the total amount of work, especially when executing parameter ranges. executions: # The following entry completely disables the VALIDATE phase - phase: validate cycle: never # The CREATE phase should only be executed once at the beginning of each parameter range - phase: create cycle: first # You can also omit the targets altogether, if you want to execute all of them. targets: .* # You are allowed to specify a single phase more than once - phase: build cycle: always targets: # The following regular expressions matches all targets ending with "_daily" - .*_daily - phase: build cycle: last targets: # The following regular expressions matches all targets ending with "_full" - .*_full # Documentation should only be generated for the last entry in the execution sequence - phase: verify cycle: last targets: documentation
description(optional) (type: string): A textual description of the job
extends(optional) (type: list:string): A list of other job names, which should be extended by this job. All environment variables, parameters, build targets, hooks and metrics will be inherited from the parent jobs. This helps to split up a big job into smaller ones or to reuse some configuration in slightly different but related jobs.
targets(optional) (type: list:string): A list of names of all targets that should be built as part of this job.
environment(optional) (type: list:string): A list of
key=valuepairs for defining or overriding environment variables which can be accessed in expressions. You can also access the job parameters in the environment definition for deriving new values.
parameters(optional) (type: list:parameter): A list of job parameters. Values for job parameters have to be specified for each job execution, be it either directly via the command line or via setting an environment variable in a derived job.
hooks(optional) (type: list:hook): A list of hooks which will be called before and after each job and target is executed. Hooks provide some ways to notify external systems (or possibly plugins) about the current execution status of jobs and targets.
metrics(optional) (type: list:hook): A list of metrics that should be published after job execution. See below for more details.
executions(optional) (type: list:execution) (since Flowman 0.30.0): This optional section provides fine-grained control over when the individual phases are to be executed. This allows reducing the amount of redundant work when a whole (date) range is used for a parameter. Within this section you can explicitly state when each phase should be executed. Each entry of the list has three attributes
phase(required) (type: string) - the execution phase to be configured. You can have multiple entries per phase, these will be logically merged during execution.
cycle(optional) (type: string) (default:
always) - specifies when this block is active when executing a whole range of job parameters via command line. Possible values are:
always- this phase will be executed for all parameter instances
never- the corresponding phase will never be executed
first- only to be executed for the first parameter instance
last- only to be executed for the last parameter instance
targets(optional) (type: regex) (default:
.*) - list of regular expressions to match build targets. Note that you still need to specify all targets in the jobs
targetsmain list. This list in the
executionssections acts as a filter on top of the jobs main target list. Defaults to
.*, which simply selects all job targets for execution.
Please find more information about
executions in the Cookbook for Execution Phases.
When this section is omitted, then all targets will participate in all execution phases, and all phases will be
executed for the full parameter range (if specified on the command line).
For each job Flowman provides the following execution metrics:
namespace: The name of the namespace
project: The name of the project
A Job optionally can have parameters, which play a special role. They are available as environment variables, but explicitly provided as part of the job invocation. Parameters are defined as part of the job:
jobs: main: parameters: - name: processing_date type: date description: "Specifies the date in yyyy-MM-dd for which the job will be run" granularity: 1 default: "2022-03-10"
name(mandatory) (type: string): The name of the parameter.
type(optional) (type: datatype) (default: string): The data type of the parameter. See Fields and Data Types for a complete list of supported data types
description(optional) (type: string): A description of the parameter
default(optional) (type: object): Provides a default value of the parameter.
granularity(optional) (type: integer) (default: 1): Defines the step size of the parameter.
Job parameters have to be specified when a job is run from the command line (via
flowexec job run param=value),
except if there is a default value defined for a parameter.
Flowman can be configured such that every run of a job is logged into a database. Each log entry includes the job’s name and also all values for all parameters. This way it is possible to identify individual runs of a job.
With these explanations in mind, you should only declare job parameters which have an influence on the data processing result (for example, the processing date range). Other settings like credentials should not be provided as job parameters, but as normal environment variables instead.
Note that you can also execute a whole range of values for a given parameter as follows:
flowexec job build daily processing_datetime:start=2021-06-01T00:00 processing_datetime:end=2021-08-10T00:00 processing_datetime:step=P1D --target parquet_lineitem --no-lifecycle -j 4
This would assume that the job parameter
processing_datetime is of type
Because a Job might be invoked with different values for the same set of parameters, each Job will be executed in a logically isolated environment, where all cached data is cleared after the Job is finished. This way it is ensured that all mappings, which rely on specific parameter values, are reevaluated when the same Job is run multiple times within a project.
Each job can define a set of metrics to be published. The job only contains the logical definition of metrics, the type and endpoint for publishing the metrics is defined in the namespace.