Relations¶
Physical data sources (like Hive tables, relational SQL databases, files etc) are specified using so called relations. Data flows can read and write from and to relations via appropriate mappings (Read Mapping) or build targets (Relation Target).
A relation always contains all required information to connect to the data source (for example the table name and a JDBC Connection in case of a JDBC relation).
Some relations (like files, Hive tables and views) can also be directly created from Flowman, thereby providing a single tool for schema management and data ingestion into tables.
Schema Conversion¶
Most relations support implicit schema conversions, which means that
- when reading from a relation with a user specified schema, the actual schema of the physical data source is converted to the desired user defined schema
- when writing to a relation, the schema of the data written is automatically converted to the actual schema of the data sink
The details of these conversions can be controlled via some config variables
flowman.default.relation.input.columnMismatchPolicy(default isIGNORE)flowman.default.relation.input.typeMismatchPolicy(default isIGNORE)flowman.default.relation.input.charVarcharPolicy(default isPAD_AND_TRUUNCATE)flowman.default.relation.output.columnMismatchPolicy(default isADD_REMOVE_COLUMNS)flowman.default.relation.output.typeMismatchPolicy(default isCAST_ALWAYS)flowman.default.relation.output.charVarcharPolicy(default isPAD_AND_TRUUNCATE)
The schema conversion is implemented using two aspects. The first is a mismatch between column (names). This can be
configured using the columnMismatchPolicy as follows. Basically the idea is that
IGNOREwill simply pass through the input columns unchangedERRORwill fail the build once a mismatch between actual and requested schema is detectedADD_COLUMNS_OR_IGNOREwill add (NULL) columns from the requested schema to the input schema, and will keep columns in the input schema which are not present in the requested schemaADD_COLUMNS_OR_ERRORwill add (NULL) columns from the requested schema to the input schema, but will fail the build if the input schema contains columns not present in the requested schemaREMOVE_COLUMNS_OR_IGNOREwill remove columns from the input schema which are not present in the requested schemaREMOVE_COLUMNS_OR_ERRORwill remove columns from the input schema which are not present in the requested schema and will fail if the input schema is missing requested columnsADD_REMOVE_COLUMNSwill essentially pass through the requested schema as is
| Column Mismatch Policy | Input Column | Requested Column | Result |
|---|---|---|---|
IGNORE |
present | missing | present |
IGNORE |
missing | present | missing |
ERROR |
present | missing | build error |
ERROR |
missing | present | build error |
ADD_COLUMNS_OR_IGNORE |
present | missing | present |
ADD_COLUMNS_OR_IGNORE |
missing | present | present |
ADD_COLUMNS_OR_ERROR |
present | missing | build error |
ADD_COLUMNS_OR_ERROR |
missing | present | missing |
REMOVE_COLUMNS_OR_IGNORE |
present | missing | missing |
REMOVE_COLUMNS_OR_IGNORE |
missing | present | missing |
REMOVE_COLUMNS_OR_ERROR |
present | missing | missing |
REMOVE_COLUMNS_OR_ERROR |
missing | present | build error |
ADD_REMOVE_COLUMNS |
present | missing | missing |
ADD_REMOVE_COLUMNS |
missing | present | present |
The second aspect is the conversion of data types. Again this can be configured using the typeMismatchPolicy as
follows:
| Type Mismatch Policy | Input vs Requested Data Type | Result |
|---|---|---|
IGNORE |
Source can be cast to dest | Input Data Type |
IGNORE |
Source cannot be safely cast to dest | Input Data Type |
ERROR |
Source can be cast to dest | build error |
ERROR |
Source cannot be safely cast to dest | build error |
CAST_COMPATIBLE_OR_ERROR |
Source can be cast to dest | Target Data Type |
CAST_COMPATIBLE_OR_ERROR |
Source cannot be safely cast to dest | build error |
CAST_COMPATIBLE_OR_IGNORE |
Source can be safely cast to dest | Target Data Type |
CAST_COMPATIBLE_OR_IGNORE |
Source cannot be safely cast to dest | Input Data Type |
CAST_ALWAYS |
Source can be safely cast to dest | Target Data Type |
CAST_ALWAYS |
Source cannot be safely cast to dest | Target Data Type |
The two options flowman.default.relation.input.charVarcharPolicy and flowman.default.relation.output.charVarcharPolicy
control how Flowman will treat VARCHAR(n) and CHAR(n) data types. The possible values are
IGNORE- Do not apply any length restrictionsPAD_AND_TRUNCATE- TruncateVARCHAR(n)/CHAR(n)strings which are too long and padCHAR(n)strings which are too shortPAD- PadCHAR(n)strings which are too shortTRUNCATE- TruncateVARCHAR(n)/CHAR(n)strings which are too long
Relation Types¶
Flowman directly provides support for the most important data sources, which are also supported directly by Spark. Additional data sources can be implemented as plugins if required.
- Delta File Relations
- Delta Table Relations
- Empty Relation
- File Relations
- Generic Relation
- Hive Query Relations
- Hive Table Relations
- Hive Union Table
- Hive View Relations
- JDBC Query Relations
- JDBC Table Relations
- JDBC View Relations
- Kafka Relations
- Local Relations
- Mock Relation
- SQL Server Relations
- Template Relation
- Values Relation