Impala Metadata#

Impala is another “SQL on Hadoop” execution engine mainly developed and backed up by Cloudera. Impala allows you to access data stored in Hadoop and registered in the Hive metastore, just like Hive itself, but often at a significantly better performance. Unfortunately Impala requires that all changes to Hive tables are either performed directly via Impala itself or that the changed tables are synced from Hive to Impala after changes.

In order to better support environments which rely on Impala, Flowman also supports the second approach of automatically syncing all changes to Hive tables performed in Flowman. This feature is accomplished by the “Impala” plugin for Flowman. The plugin only needs to be enabled and properly configured, and then all changes to Hive will be automatically propagated to Impala

Configuration of Impala Plugin#

First you need to enable the Impala plugin. This needs to be done in the system.yml configuration file in the conf directory of Flowman:

# The system configuration loads plugin before namespaces are instantiated. The Impala plugin may already be required
# within a namespace to define an external catalog, therefore it needs to be loaded in advance.
plugins:
  - flowman-impala

Next you need to configure the Impala plugin as an “external catalog provider” within the namespace configuration file. The default namespace is configured via conf/default-namespace.yml. You need to add the following sections:

# Define the connection to Impala
connections:
  impala:
    kind: jdbc
    url: jdbc:impala://IMPALA_HOST:21050
    properties:
      SocketTimeout: 0

# Setup Impala as an additional catalog besides Hive
catalog:
  kind: impala
  connection: impala

Please consult the Cloudera Impala documentation for a comprehensive list of JDBC properties, which can be set under properties in the connection above.

Using Impala with Kerberos#

When you are using Impala with Kerberos authentication, you also need to specify some additional details (please consult the Cloudera Impala documentation for a comprehensive list of JDBC properties):

# Define the connection to Impala with Kerberos enabled
connections:
  impala:
    kind: jdbc
    url: jdbc:impala://IMPALA_HOST:21050
    properties:
      SocketTimeout: 0
      AuthMech: 1
      AuthType: 1
      KrbRealm: MY.KERBEROS.REALM
      KrbHostFQDN: IMPALA_HOST
      KrbServiceName: impala
      AllowSelfSignedCerts: 1
      CAIssuedCertsMismatch: 1
      SSL: 1

In addition to the namespace configuration, you also need to provide valid Kerberos credentials stored in a keytab. Impala then in turn requires a valid JAAS configuration file, which refers to that keytab. That file may look as follows (of course you need to replace KRB_PRINCIPAL and MY.KERBEROS.REALM):

Client {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  keyTab="conf/KRB_PRINCIPAL.keytab"
  useTicketCache=true
  principal="KRB_PRINCIPAL@MY.KERBEROS.REALM"
  doNotPrompt=true
  debug=false;
};

Finally, you need to tell Flowman to read in this JAAS file. This can be done by specifying a Java command line option in conf/flowman-env.sh as follows:

SPARK_DRIVER_JAVA_OPTS="-Djava.security.auth.login.config=$FLOWMAN_CONF_DIR/jaas.conf"