Configuration Properties

Flowman supports some configuration properties, which influence the behaviour. These properties either can be set on the command line via --conf (See flowexec documentation), or in the config section of the flow specification (see module documentation) or in the namespace configuration (see namespace documentation)

List of Configuration Properties

General Properties

  • flowman.spark.enableHive (type: boolean) (default:true) If set to false, then Hive support will be disabled in Flowman.
  • flowman.home (type: string) Contains the home location of the Flowman installation. This will be set implicitly by the system environment variable FLOWMAN_HOME.
  • flowman.conf.directory (type: string) Contains the location of the Flowman configuration directory. This will be set implicitly by the system environment variable FLOWMAN_CONF_DIR or FLOWMAN_HOME.
  • flowman.plugin.directory (type: string) Contains the location of the Flowman plugin directory. This will be set implicitly by the system environment variable FLOWMAN_PLUGIN_DIR or FLOWMAN_HOME.
  • flowman.hive.analyzeTable (type: boolean) (default:true) If enabled (i.e. set to true), then Flowman will perform a ANALYZE TABLE for all Hive table updates.
  • flowman.impala.computeStats (type: boolean) (default:true) If enabled (i.e. set to true), then Flowman will perform a COMPUTE STATS within the Impala Catalog plugin whenever a Hive table is updated. The REFRESH statements will always be executed by the plugin.
  • flowman.externalCatalog.ignoreErrors (type: boolean) (default:false) If enabled (i.e. set to true), then Flowman will ignore all errors from external catalogs like Impala. This is desired in many cases, such that these will not block processing.

Workarounds

Sometimes some workarounds are required, especially for non-quite-open-source Big Data platforms.

  • flowman.workaround.analyze_partition (type: boolean) (since Flowman 0.18.0) Enables a workaround for CDP 7.1, where ANALYZE TABLES wouldn’t always work correctly (especially in unittests). The workaround is enabled per default if the Spark version matches ?.?.?.7.?.?.?.+ (i.e. 2.4.0.7.1.6.0-297) AND if the Spark repository url contains “cloudera”.

Example

You can set the properties either at namespace level or at project level in the config section as follows:

# default-namespace.yml

config:
  # Generic Spark configs  
  - spark.sql.suffle.partitions=20
  - spark.sql.session.timeZone=UTC
  # Flowman specific config  
  - flowman.workaround.analyze_partition=true
  - flowman.default.relation.migrationStrategy=FAIL