JDBC Query Relations

The jdbcQuery relation allows you to access databases using a JDBC driver. Note that you need to put an appropriate JDBC driver onto the classpath of Flowman. This can be done by using an appropriate plugin.

Example

# First specify a connection. This can be used by multiple JDBC relations
connections:
  frontend:
    kind: jdbc
    driver: "$frontend_db_driver"
    url: "$frontend_db_url"
    username: "$frontend_db_username"
    password: "$frontend_db_password"

relations:
  lineitem:
    kind: jdbcQuery
    connection: frontend
    sql: "
      SELECT
        CONCAT('DIR_', li.id) AS lineitem,
        li.campaign_id AS campaign,
        IF(c.demand_type_system = 1, 'S', IF(li.demand_type_system = 1, 'S', 'D')) AS demand_type
      FROM
        line_item AS li
      INNER JOIN
        campaign c
        ON c.id = li.campaign_id
    "
    schema:
      kind: inline
      fields:
        - name: lineitem
          type: string
        - name: campaign
          type: long
        - name: demand_type
          type: string

It is also possible to directly embed the connection as follows:

relations:
  frontend_users:
    kind: jdbcQuery
    # Specify the name of the connection to use
    connection:
      kind: jdbc
      driver: "$frontend_db_driver"
      url: "$frontend_db_url"
      username: "$frontend_db_username"
      password: "$frontend_db_password"
    sql: "
      SELECT
        CONCAT('DIR_', li.id) AS lineitem,
        li.campaign_id AS campaign,
        IF(c.demand_type_system = 1, 'S', IF(li.demand_type_system = 1, 'S', 'D')) AS demand_type
      FROM
        line_item AS li
      INNER JOIN
        campaign c
        ON c.id = li.campaign_id
    "

For most cases, it is recommended not to embed the connection, since this prevents reusing the same connection in multiple places.

The schema is still optional in this case, but it will help mocking the relation for unittests.

Fields

kind (mandatory) (type: string): jdbcQuery
schema (optional) (type: schema) (default: empty): Explicitly specifies the schema of the JDBC source. Alternatively Flowman will automatically try to infer the schema.
description (optional) (type: string) (default: empty): A description of the relation. This is purely for informational purpose.
connection (mandatory) (type: string): The connection field specifies the name of a Connection object which has to be defined elsewhere.
sql (optional) (type: string): This property specifies the SQL query which will be executed by the database for retrieving data. Of course, then only read operations are possible. Since the SQL is directly forwarded to the database, you also need to use the SQL dialect of the corresponding database. Cannot be used together with file.
file (optional) (string) (default: empty): Contains the name of a file containing SQL code of the Hive query. Cannot be used together with sql.
properties (optional) (type: map:string) (default: empty): Specifies any additional properties passed to the JDBC connection. Note that both the JDBC relation and the JDBC connection can define properties. So it is advisable to define all common properties in the connection and more table specific properties in the relation. The connection properties are applied first, then the relation properties. This means that a relation property can overwrite a connection property if it has the same name.

Schema Conversion

The JDBC relation fully supports automatic schema conversion on input and output operations as described in the corresponding section of relations.

Output Modes

The jdbcQuery relation does not support write operations.

Remarks

Mocking JDBC relations

Note that Flowman will rely on schema inference in some important situations, like mocking and generally for describing the schema of a relation. This might create unwanted connections to the physical data source, particular in case of self-contained tests. To prevent Flowman from creating a connection to the physical data source, you simply need to explicitly specify a schema, which will then be used instead of the physical schema in all situations where only schema information is required.