Data Quality Checks#
Data quality is an important topic, which is also addressed in Flowman in multiple, complementary ways.
Verification and Validation#
First you might want to add some validate and verify targets
to your job. The validate
the target will be executed before the CREATE
phase and is well suited for performing some tests
on the source data. If these tests fail, you may either emit a simple warning or stop the build altogether in failed
state (which is the default behavior).
The verify
target will be executed in the VERIFY
phase after the BUILD
phase and is well suited for conducting
data quality tests after the build itself has finished. Again a failing verify
target may either only generate a
warning, or may fail the build.
Example#
targets:
validate_input:
kind: validate
mode: failFast
assertions:
assert_primary_key:
kind: sql
tests:
- query: "SELECT id,count(*) FROM source GROUP BY id HAVING count(*) > 0"
expected: []
assert_measurement_count:
kind: sql
tests:
- query: "SELECT COUNT(*) FROM measurements_extracted"
expected: 2
Data Quality Checks as Documentation#
With the new documentation framework, Flowman adds the possibility not only to document
mappings and relations, but also to add test cases. These will be executed as part of the documentation (which is
generated with an independent command with flowexec
).
Data Quality Metrics#
In addition to the validate
and verify
targets, Flowman also offers a special measure target.
This target provides some means to collect some important metrics from data and provide the results as metrics. These
in turn can be published to Prometheus or other metric collectors.
Example#
targets:
measures:
kind: measure
measures:
record_stats:
kind: sql
query: "
SELECT
COUNT(*) AS record_count
SUM(column IS NULL) AS column_sum
FROM some_mapping"
This example will publish two metrics, record_count
and column_sum
, which then can be sent to a
metric sink configured in the namespace.
When to use what#
All three approaches are complementary and can be used together. It all depends on what you want to achieve.
Checking Pre- and Post-Conditions#
If you want to verify that certain pre- or post-conditions in the source or output data are met, then the
validate
and verify
targets should be used. They
will perform arbitrary tests either before the CREATE
and BUILD
phase (in case of the validate
target) or after
the BUILD
phase (in case of the verify
target). In case any of the tests fail, the whole build will fail and not
proceed any processing. This approach can be used to only start the data transformations when input data is clean and
matches your expectations.
Continuous Monitoring of Data Quality#
If you want to set up some continuous monitoring of your data quality (either input or output or both), then the
measure
target is the right choice. It will collect arbitrary numerical metrics from
the data and publish it to a metrics sink like Prometheus. Typically, metric collectors are used in conjunction with
a dashboard (like Grafana), which then can be used to display the whole history of these metrics over time. This way
you can see if data quality improves or gets worse, and many of these tools also allow you to set up alarms when
some threshold is reached.
Documenting Expectations with Reality Check#
Finally, the whole documentation subsystem is the right tool for specifying your expectations on the data quality and have these expectations automatically checked with the real data. In combination with continuous monitoring this can help to better understand what might be going wrong. In contrast to pre-/post-condition checking, a failed check in the documentation will not fail the build - it will simply be marked as failed in the documentation, but that’s all what will happen.