Configuration Properties#
Flowman supports some configuration properties, which influence the behavior. These properties either can be set
on the command line via --conf
(See flowexec
documentation), or in the config
section
of the flow specification (see module documentation) or in the namespace configuration (see
namespace documentation)
List of Configuration Properties#
General Properties#
flowman.spark.enableHive
(type: boolean) (default:true) If set tofalse
, then Hive support will be disabled in Flowman.flowman.home
(type: string) Contains the home location of the Flowman installation. This will be set implicitly by the system environment variableFLOWMAN_HOME
.flowman.conf.directory
(type: string) Contains the location of the Flowman configuration directory. This will be set implicitly by the system environment variableFLOWMAN_CONF_DIR
orFLOWMAN_HOME
.flowman.plugin.directory
(type: string) Contains the location of the Flowman plugin directory. This will be set implicitly by the system environment variableFLOWMAN_PLUGIN_DIR
orFLOWMAN_HOME
.flowman.hive.analyzeTable
(type: boolean) (default:true) If enabled (i.e. set totrue
), then Flowman will perform aANALYZE TABLE
for all Hive table updates.flowman.impala.computeStats
(type: boolean) (default:true) If enabled (i.e. set totrue
), then Flowman will perform aCOMPUTE STATS
within the Impala Catalog plugin whenever a Hive table is updated. TheREFRESH
statements will always be executed by the plugin.flowman.externalCatalog.ignoreErrors
(type: boolean) (default:false) If enabled (i.e. set totrue
), then Flowman will ignore all errors from external catalogs like Impala. This is desired in many cases, such that these will not block processing.
Workarounds#
Sometimes some workarounds are required, especially for non-quite-open-source Big Data platforms.
flowman.workaround.analyze_partition
(type: boolean) (since Flowman 0.18.0) Enables a workaround for CDP 7.1, where ANALYZE TABLES wouldn’t always work correctly (especially in unit tests). The workaround is enabled per default if the Spark version matches ?.?.?.7.?.?.?.+ (i.e. 2.4.0.7.1.6.0-297) AND if the Spark repository URL contains “cloudera”.
Example#
You can set the properties either at namespace level or at project level in the config
section as follows:
# default-namespace.yml
config:
# Generic Spark configs
- spark.sql.suffle.partitions=20
- spark.sql.session.timeZone=UTC
# Flowman specific config
- flowman.workaround.analyze_partition=true
- flowman.default.relation.migrationStrategy=FAIL
The default namespace is configured with the conf/default-namespace.yml
file in your Flowman installation directory.