Configuration Properties#

Flowman supports some configuration properties, which influence the behavior. These properties either can be set on the command line via --conf (See flowexec documentation), or in the config section of the flow specification (see module documentation) or in the namespace configuration (see namespace documentation)

List of Configuration Properties#

General Properties#

  • flowman.spark.enableHive (type: boolean) (default:true) If set to false, then Hive support will be disabled in Flowman.

  • flowman.home (type: string) Contains the home location of the Flowman installation. This will be set implicitly by the system environment variable FLOWMAN_HOME.

  • flowman.conf.directory (type: string) Contains the location of the Flowman configuration directory. This will be set implicitly by the system environment variable FLOWMAN_CONF_DIR or FLOWMAN_HOME.

  • flowman.plugin.directory (type: string) Contains the location of the Flowman plugin directory. This will be set implicitly by the system environment variable FLOWMAN_PLUGIN_DIR or FLOWMAN_HOME.

  • flowman.hive.analyzeTable (type: boolean) (default:true) If enabled (i.e. set to true), then Flowman will perform a ANALYZE TABLE for all Hive table updates.

  • flowman.impala.computeStats (type: boolean) (default:true) If enabled (i.e. set to true), then Flowman will perform a COMPUTE STATS within the Impala Catalog plugin whenever a Hive table is updated. The REFRESH statements will always be executed by the plugin.

  • flowman.externalCatalog.ignoreErrors (type: boolean) (default:false) If enabled (i.e. set to true), then Flowman will ignore all errors from external catalogs like Impala. This is desired in many cases, such that these will not block processing.

Workarounds#

Sometimes some workarounds are required, especially for non-quite-open-source Big Data platforms.

  • flowman.workaround.analyze_partition (type: boolean) (since Flowman 0.18.0) Enables a workaround for CDP 7.1, where ANALYZE TABLES wouldn’t always work correctly (especially in unit tests). The workaround is enabled per default if the Spark version matches ?.?.?.7.?.?.?.+ (i.e. 2.4.0.7.1.6.0-297) AND if the Spark repository URL contains “cloudera”.

Example#

You can set the properties either at namespace level or at project level in the config section as follows:

# default-namespace.yml

config:
  # Generic Spark configs  
  - spark.sql.suffle.partitions=20
  - spark.sql.session.timeZone=UTC
  # Flowman specific config  
  - flowman.workaround.analyze_partition=true
  - flowman.default.relation.migrationStrategy=FAIL

The default namespace is configured with the conf/default-namespace.yml file in your Flowman installation directory.