Project Mapping#
The project
mapping performs a projection of an input mapping onto a specific set of columns.
This corresponds to a trivial SQL SELECT
with a series of simple column names, with optional CAST
operations.
Example#
The project
mapping supports two different syntax. The first concise version only selects specific columns:
mappings:
partial_facts:
kind: project
input: facts
columns:
- id
- temperature
- wind_speed
The second version also supports changing the column name and data type and optional description:
mappings:
partial_facts:
kind: project
input: facts
columns:
- column: id
- column: air_temperature
name: temperature
type: FLOAT
description: "The air temperature"
- column: wind_speed
type: FLOAT
Fields#
kind
(mandatory) (type: string):project
broadcast
(optional) (type: boolean) (default: false): Hint for broadcasting the result of this mapping for map-side joins.cache
(optional) (type: string) (default: NONE): Cache mode for the results of this mapping. Supported values areNONE
- Disables caching of teh results of this mappingDISK_ONLY
- Caches the results on diskMEMORY_ONLY
- Caches the results in memory. If not enough memory is available, records will be uncached.MEMORY_ONLY_SER
- Caches the results in memory in a serialized format. If not enough memory is available, records will be uncached.MEMORY_AND_DISK
- Caches the results first in memory and then spills to disk.MEMORY_AND_DISK_SER
- Caches the results first in memory in a serialized format and then spills to disk.
input
(mandatory) (type: string): Specifies the name of the input mapping to be filtered.columns
(mandatory) (type: list:string): Specifies the list of columns to be present in the output. The list can either be simply a list of column names or they can be more complex column descriptors
columns:
- name: name_of_output_column
column: name_of_incoming_column
type: string
description: "This is the (optional) description of the column"
You can also mix both column types in a single project
mapping.
filter
(optional) (type: string) (default: empty): An optional SQL filter expression that is applied after projection.
Outputs#
main
- the only output of the mapping
Remarks#
The project
mapping is similar to both the cast
mapping and the schema
mapping. The
differences are as follows:
The
project
mapping only performs simple type conversions and will only emit the columns specified in thecolumns
list. It will drop all columns not specified in the list and cannot add new columns.The
schema
mapping allows specifying arbitrary complex data types including nested types. It will make sure that the result precisely matches the specified schema, i.e. it will add and/or drop columns as required.The
cast
mapping will only change the data type of the specified columns and will keep all other columns unchanged. This means that it will not add or drop any of the incoming columns.
Supported data types#
The following simple data types are supported by Apache Spark and Flowman:
string
,text
- text and strings of arbitrary lengthbinary
- binary data of arbitrary lengthtinyint
,byte
- 8-bit signed numberssmallint
,short
- 16-bit signed numbersint
,integer
- 32-bit signed numbersbigint
,long
- 64-bit signed numbersboolean
- true or falsefloat
- 32-bit floating point numberdouble
- 64-bit floating point numberdecimal(a,b)
varchar(n)
- text with up ton
characters. Note that this data type is only supported for specifying input or output data types. Internally Spark and therefore Flowman convert these columns to astring
column of arbitrary length.char(n)
- text with exactlyn
characters. Note that this data type is only supported for specifying input or output data types. Internally Spark and therefore Flowman convert these columns to astring
column of arbitrary length.date
- date typetimestamp
- timestamp type (date and time)duration
- duration type