Configuration Properties¶
Flowman supports some configuration properties, which influence the behaviour. These properties either can be set
on the command line via --conf
(See flowexec documentation), or in the config
section of the flow
specification (see module documentation) or in the namespace configuration (see
namespace documentation)
List of Configuration Properties¶
General Properties¶
flowman.spark.enableHive
(type: boolean) (default:true) If set tofalse
, then Hive support will be disabled in Flowman.flowman.home
(type: string) Contains the home location of the Flowman installation. This will be set implicitly by the system environment variableFLOWMAN_HOME
.flowman.conf.directory
(type: string) Contains the location of the Flowman configuration directory. This will be set implicitly by the system environment variableFLOWMAN_CONF_DIR
orFLOWMAN_HOME
.flowman.plugin.directory
(type: string) Contains the location of the Flowman plugin directory. This will be set implicitly by the system environment variableFLOWMAN_PLUGIN_DIR
orFLOWMAN_HOME
.flowman.hive.analyzeTable
(type: boolean) (default:true) If enabled (i.e. set totrue
), then Flowman will perform aANALYZE TABLE
for all Hive table updates.flowman.impala.computeStats
(type: boolean) (default:true) If enabled (i.e. set totrue
), then Flowman will perform aCOMPUTE STATS
within the Impala Catalog plugin whenever a Hive table is updated. TheREFRESH
statements will always be executed by the plugin.flowman.externalCatalog.ignoreErrors
(type: boolean) (default:false) If enabled (i.e. set totrue
), then Flowman will ignore all errors from external catalogs like Impala. This is desired in many cases, such that these will not block processing.
Workarounds¶
Sometimes some workarounds are required, especially for non-quite-open-source Big Data platforms.
flowman.workaround.analyze_partition
(type: boolean) (since Flowman 0.18.0) Enables a workaround for CDP 7.1, where ANALYZE TABLES wouldn’t always work correctly (especially in unittests). The workaround is enabled per default if the Spark version matches ?.?.?.7.?.?.?.+ (i.e. 2.4.0.7.1.6.0-297) AND if the Spark repository url contains “cloudera”.
Example¶
You can set the properties either at namespace level or at project level in the config
section as follows:
# default-namespace.yml
config:
# Generic Spark configs
- spark.sql.suffle.partitions=20
- spark.sql.session.timeZone=UTC
# Flowman specific config
- flowman.workaround.analyze_partition=true
- flowman.default.relation.migrationStrategy=FAIL