Relations¶
Physical data sources (like Hive tables, relational SQL databases, files etc) are specified using so called relations. Data flows can read and write from and to relations via appropriate mappings (Read Mapping) or build targets (Relation Target).
A relation always contains all required information to connect to the data source (for example the table name and a JDBC Connection in case of a JDBC relation).
Some relations (like files, Hive tables and views) can also be directly created from Flowman, thereby providing a single tool for schema management and data ingestion into tables.
Schema Conversion¶
Most relations support implicit schema conversions, which means that
- when reading from a relation with a user specified schema, the actual schema of the physical data source is converted to the desired user defined schema
- when writing to a relation, the schema of the data written is automatically converted to the actual schema of the data sink
The details of these conversions can be controlled via some config variables
flowman.default.relation.input.columnMismatchPolicy
(default isIGNORE
)flowman.default.relation.input.typeMismatchPolicy
(default isIGNORE
)flowman.default.relation.input.charVarcharPolicy
(default isPAD_AND_TRUUNCATE
)flowman.default.relation.output.columnMismatchPolicy
(default isADD_REMOVE_COLUMNS
)flowman.default.relation.output.typeMismatchPolicy
(default isCAST_ALWAYS
)flowman.default.relation.output.charVarcharPolicy
(default isPAD_AND_TRUUNCATE
)
The schema conversion is implemented using two aspects. The first is a mismatch between column (names). This can be
configured using the columnMismatchPolicy
as follows. Basically the idea is that
IGNORE
will simply pass through the input columns unchangedERROR
will fail the build once a mismatch between actual and requested schema is detectedADD_COLUMNS_OR_IGNORE
will add (NULL
) columns from the requested schema to the input schema, and will keep columns in the input schema which are not present in the requested schemaADD_COLUMNS_OR_ERROR
will add (NULL
) columns from the requested schema to the input schema, but will fail the build if the input schema contains columns not present in the requested schemaREMOVE_COLUMNS_OR_IGNORE
will remove columns from the input schema which are not present in the requested schemaREMOVE_COLUMNS_OR_ERROR
will remove columns from the input schema which are not present in the requested schema and will fail if the input schema is missing requested columnsADD_REMOVE_COLUMNS
will essentially pass through the requested schema as is
Column Mismatch Policy | Input Column | Requested Column | Result |
---|---|---|---|
IGNORE |
present | missing | present |
IGNORE |
missing | present | missing |
ERROR |
present | missing | build error |
ERROR |
missing | present | build error |
ADD_COLUMNS_OR_IGNORE |
present | missing | present |
ADD_COLUMNS_OR_IGNORE |
missing | present | present |
ADD_COLUMNS_OR_ERROR |
present | missing | build error |
ADD_COLUMNS_OR_ERROR |
missing | present | missing |
REMOVE_COLUMNS_OR_IGNORE |
present | missing | missing |
REMOVE_COLUMNS_OR_IGNORE |
missing | present | missing |
REMOVE_COLUMNS_OR_ERROR |
present | missing | missing |
REMOVE_COLUMNS_OR_ERROR |
missing | present | build error |
ADD_REMOVE_COLUMNS |
present | missing | missing |
ADD_REMOVE_COLUMNS |
missing | present | present |
The second aspect is the conversion of data types. Again this can be configured using the typeMismatchPolicy
as
follows:
Type Mismatch Policy | Input vs Requested Data Type | Result |
---|---|---|
IGNORE |
Source can be cast to dest | Input Data Type |
IGNORE |
Source cannot be safely cast to dest | Input Data Type |
ERROR |
Source can be cast to dest | build error |
ERROR |
Source cannot be safely cast to dest | build error |
CAST_COMPATIBLE_OR_ERROR |
Source can be cast to dest | Target Data Type |
CAST_COMPATIBLE_OR_ERROR |
Source cannot be safely cast to dest | build error |
CAST_COMPATIBLE_OR_IGNORE |
Source can be safely cast to dest | Target Data Type |
CAST_COMPATIBLE_OR_IGNORE |
Source cannot be safely cast to dest | Input Data Type |
CAST_ALWAYS |
Source can be safely cast to dest | Target Data Type |
CAST_ALWAYS |
Source cannot be safely cast to dest | Target Data Type |
The two options flowman.default.relation.input.charVarcharPolicy
and flowman.default.relation.output.charVarcharPolicy
control how Flowman will treat VARCHAR(n)
and CHAR(n)
data types. The possible values are
IGNORE
- Do not apply any length restrictionsPAD_AND_TRUNCATE
- TruncateVARCHAR(n)
/CHAR(n)
strings which are too long and padCHAR(n)
strings which are too shortPAD
- PadCHAR(n)
strings which are too shortTRUNCATE
- TruncateVARCHAR(n)
/CHAR(n)
strings which are too long
Relation Types¶
Flowman directly provides support for the most important data sources, which are also supported directly by Spark. Additional data sources can be implemented as plugins if required.
- Delta File Relations
- Delta Table Relations
- Empty Relation
- File Relations
- Generic Relation
- Hive Query Relations
- Hive Table Relations
- Hive Union Table
- Hive View Relations
- JDBC Query Relations
- JDBC Table Relations
- JDBC View Relations
- Kafka Relations
- Local Relations
- Mock Relation
- SQL Server Relations
- Template Relation
- Values Relation