Configuration Properties#
Flowman supports some configuration properties, which influence the behavior. These properties either can be set
on the command line via --conf (See flowexec documentation), or in the config section
of the flow specification (see module documentation) or in the namespace configuration (see
namespace documentation)
List of Configuration Properties#
General Properties#
flowman.spark.enableHive(type: boolean) (default:true) If set tofalse, then Hive support will be disabled in Flowman.flowman.home(type: string) Contains the home location of the Flowman installation. This will be set implicitly by the system environment variableFLOWMAN_HOME.flowman.conf.directory(type: string) Contains the location of the Flowman configuration directory. This will be set implicitly by the system environment variableFLOWMAN_CONF_DIRorFLOWMAN_HOME.flowman.plugin.directory(type: string) Contains the location of the Flowman plugin directory. This will be set implicitly by the system environment variableFLOWMAN_PLUGIN_DIRorFLOWMAN_HOME.flowman.hive.analyzeTable(type: boolean) (default:true) If enabled (i.e. set totrue), then Flowman will perform aANALYZE TABLEfor all Hive table updates.flowman.impala.computeStats(type: boolean) (default:true) If enabled (i.e. set totrue), then Flowman will perform aCOMPUTE STATSwithin the Impala Catalog plugin whenever a Hive table is updated. TheREFRESHstatements will always be executed by the plugin.flowman.externalCatalog.ignoreErrors(type: boolean) (default:false) If enabled (i.e. set totrue), then Flowman will ignore all errors from external catalogs like Impala. This is desired in many cases, such that these will not block processing.
Workarounds#
Sometimes some workarounds are required, especially for non-quite-open-source Big Data platforms.
flowman.workaround.analyze_partition(type: boolean) (since Flowman 0.18.0) Enables a workaround for CDP 7.1, where ANALYZE TABLES wouldn’t always work correctly (especially in unit tests). The workaround is enabled per default if the Spark version matches ?.?.?.7.?.?.?.+ (i.e. 2.4.0.7.1.6.0-297) AND if the Spark repository URL contains “cloudera”.
Example#
You can set the properties either at namespace level or at project level in the config section as follows:
# default-namespace.yml
config:
# Generic Spark configs
- spark.sql.suffle.partitions=20
- spark.sql.session.timeZone=UTC
# Flowman specific config
- flowman.workaround.analyze_partition=true
- flowman.default.relation.migrationStrategy=FAIL
The default namespace is configured with the conf/default-namespace.yml file in your Flowman installation directory.