Data Validation before Build#
In many cases, you’d like to perform some validation of input data before you start processing. For example when joining data, you often assume some uniqueness constraint on the join key in some tables or mappings. If that constraint was violated, your processing would probably create multiple entries per key as a result of the join operation, which might be very undesirable.
To detect such cases and prevent Flowman from running the processing (which would probably create wrong results
due to the wrong assumptions on uniqueness of some columns), Flowman provides the ability to run some validations
before the processing itself starts. Validations can be specified using the
validate targets, which build on the core functionality of the testing framework,
description: "There should be at most one entry per campaign"
query: "SELECT campaign,COUNT() FROM some_table GROUP BY campaign HAVING COUNT() > 1"
description: "There should be no NULL values in the campaign column"
query: "SELECT COUNT(*) FROM some_table WHERE campaign IS NULL"
The example above will validate assumptions on
some_table mapping, which reads from some relation also called
some_table (for example that could be a Hive table or a JDBC table). The first assertion validates that the column
campaign only contains unique values while the second assertion validates that the column doesn’t contain any
validate targets are executed during the VALIDATE phase, which is executed before any other
build phase. If one of these targets fail, Flowman will stop execution on return an error. This helps to prevent
building invalid data.
In addition to validating data quality before a Flowman job starts its main work later in the
BUILD phase, Flowman also provides the ability to verify the results of all data transformations after the
BUILD execution phase, namely in the
VERIFY phase. In order to implement a verification, you simply need
to use a verify target, which works precisely like the
validate target with the only
difference that it is executed after the
Note that when you are concerned about the quality of the data produced by your Flowman job, the
is only one of multiple possibilities to implement meaningful checks. Read more in the
data quality cookbook about available options.