Relations#
Physical data sources (like Hive tables, relational SQL databases, files etc.) are specified using so-called relations. Data flows can read and write from and to relations via appropriate mappings (Read Mapping) or build targets (Relation Target).
A relation always contains all required information to connect to the data source (for example the table name and a JDBC Connection in case of a JDBC relation).
Some relations (like files, Hive tables and views) can also be directly created from Flowman, thereby providing a single tool for schema management and data ingestion into tables.
Schema Conversion#
Most relations support implicit schema conversions, which means that
when reading from a relation with a user specified schema, the actual schema of the physical data source is converted to the desired user defined schema
when writing to a relation, the schema of the data written is automatically converted to the actual schema of the data sink
The details of these conversions can be controlled via some config variables
flowman.default.relation.input.columnMismatchPolicy(default isIGNORE)flowman.default.relation.input.typeMismatchPolicy(default isIGNORE)flowman.default.relation.input.charVarcharPolicy(default isPAD_AND_TRUUNCATE)flowman.default.relation.output.columnMismatchPolicy(default isADD_REMOVE_COLUMNS)flowman.default.relation.output.typeMismatchPolicy(default isCAST_ALWAYS)flowman.default.relation.output.charVarcharPolicy(default isPAD_AND_TRUUNCATE)
The schema conversion is implemented using two aspects. The first is a mismatch between column (names). This can be
configured using the columnMismatchPolicy as follows. Basically the idea is that
IGNOREwill simply pass through the input columns unchangedERRORwill fail the build once a mismatch between actual and requested schema is detectedADD_COLUMNS_OR_IGNOREwill add (NULL) columns from the requested schema to the input schema, and will keep columns in the input schema which are not present in the requested schemaADD_COLUMNS_OR_ERRORwill add (NULL) columns from the requested schema to the input schema, but will fail the build if the input schema contains columns not present in the requested schemaREMOVE_COLUMNS_OR_IGNOREwill remove columns from the input schema which are not present in the requested schemaREMOVE_COLUMNS_OR_ERRORwill remove columns from the input schema which are not present in the requested schema and will fail if the input schema is missing requested columnsADD_REMOVE_COLUMNSwill essentially pass through the requested schema as is
| Column Mismatch Policy | Input Column | Requested Column | Result |
|---|---|---|---|
IGNORE |
present | missing | present |
IGNORE |
missing | present | missing |
ERROR |
present | missing | build error |
ERROR |
missing | present | build error |
ADD_COLUMNS_OR_IGNORE |
present | missing | present |
ADD_COLUMNS_OR_IGNORE |
missing | present | present |
ADD_COLUMNS_OR_ERROR |
present | missing | build error |
ADD_COLUMNS_OR_ERROR |
missing | present | missing |
REMOVE_COLUMNS_OR_IGNORE |
present | missing | missing |
REMOVE_COLUMNS_OR_IGNORE |
missing | present | missing |
REMOVE_COLUMNS_OR_ERROR |
present | missing | missing |
REMOVE_COLUMNS_OR_ERROR |
missing | present | build error |
ADD_REMOVE_COLUMNS |
present | missing | missing |
ADD_REMOVE_COLUMNS |
missing | present | present |
The second aspect is the conversion of data types. Again this can be configured using the typeMismatchPolicy as
follows:
| Type Mismatch Policy | Input vs Requested Data Type | Result |
|---|---|---|
IGNORE |
Source can be cast to destination | Input Data Type |
IGNORE |
Source cannot be safely cast to destination | Input Data Type |
ERROR |
Source can be cast to destination | build error |
ERROR |
Source cannot be safely cast to destination | build error |
CAST_COMPATIBLE_OR_ERROR |
Source can be cast to destination | Target Data Type |
CAST_COMPATIBLE_OR_ERROR |
Source cannot be safely cast to destination | build error |
CAST_COMPATIBLE_OR_IGNORE |
Source can be safely cast to destination | Target Data Type |
CAST_COMPATIBLE_OR_IGNORE |
Source cannot be safely cast to destination | Input Data Type |
CAST_ALWAYS |
Source can be safely cast to destination | Target Data Type |
CAST_ALWAYS |
Source cannot be safely cast to destination | Target Data Type |
The two options flowman.default.relation.input.charVarcharPolicy and flowman.default.relation.output.charVarcharPolicy
control how Flowman will treat VARCHAR(n) and CHAR(n) data types. The possible values are
IGNORE- Do not apply any length restrictionsPAD_AND_TRUNCATE- TruncateVARCHAR(n)/CHAR(n)strings which are too long and padCHAR(n)strings which are too shortPAD- PadCHAR(n)strings which are too shortTRUNCATE- TruncateVARCHAR(n)/CHAR(n)strings which are too long
Relation Types#
Flowman directly provides support for the most important data sources, which are also supported directly by Spark. Additional data sources can be implemented as plugins if required.
- Delta File Relations
- Delta Table Relations
- Empty Relation
- File Relations
- Generic Relation
- Hive Query Relations
- Hive Table Relations
- Hive Union Table
- Hive View Relations
- JDBC Query Relations
- JDBC Table Relations
- JDBC View Relations
- Kafka Relations
- Local Relations
- Mock Relation
- SQL Server Relations
- Template Relation
- Values Relation