Repartition Mapping
Shuffles data to produce a specified amount of Spark partitions by using the specified columns to determine the partitioning. This operation is useful for caching carefully repartitioned data before multiple join operations.
Example
mappings:
repartitioned_customer_basket:
kind: repartition
input: customer_basket
partitions: 200
columns:
- customer_id
- product_id
sort: true
Fields
kind
(mandatory) (string):repartition
broadcast
(optional) (type: boolean) (default: false): Hint for broadcasting the result of this mapping for map-side joins.cache
(optional) (type: string) (default: NONE): Cache mode for the results of this mapping. Supported values areNONE
- Disables caching of teh results of this mappingDISK_ONLY
- Caches the results on diskMEMORY_ONLY
- Caches the results in memory. If not enough memory is available, records will be uncached.MEMORY_ONLY_SER
- Caches the results in memory in a serialized format. If not enough memory is available, records will be uncached.MEMORY_AND_DISK
- Caches the results first in memory and then spills to disk.MEMORY_AND_DISK_SER
- Caches the results first in memory in a serialized format and then spills to disk.
input
(mandatory) (string): The name of the input mappingcolumns
(mandatory) (list:string): The list of column names used for partitioning the datapartitions
(optional) (integer): The number of output partitionssort
(optional) (boolean): Specifies if the records within each partition should also be sortedfilter
(optional) (type: string) (default: empty): An optional SQL filter expression that is applied before the repartition operation itself.
Outputs
main
- the only output of the mapping
Description
This transformation can be used as part of a processing optimization where you want to repartition the result of some mapping because you perform multiple join operations on the same columns afterwards.