MS SQL Server Plugin#
The MS SQL Server plugin provides a JDBC driver to access MS SQL Server and Azure SQL Server databases via
the JDBC relation. Moreover, it also provides a specialized
sqlserver
relation which uses bulk copy to speed up writing process, and it
also uses temp tables to encapsulate the whole data upload within a transaction.
Provided Entities#
Activation#
The plugin can be easily activated by adding the following section to the default-namespace.yml
plugins:
- flowman-mssqlserver
Fixing JDBC driver version issues#
Some Hadoop distributions (e.g. Cloudera) come along with an outdated MS SQL JDBC connector. This causes problems with
the plugin. Fortunately you can manually explicitly force Spark to use the correct JDBC.
You need to add the following lines to your custom flowman-env.sh
file which is stored in the conf
subdirectory:
# Add MS SQL JDBC Driver. Normally this is handled by the plugin mechanism, but Cloudera already provides some
# old version of the JDBC driver, and this is the only place where we can force to use our JDBC driver
SPARK_JARS="$FLOWMAN_HOME/plugins/flowman-mssqlserver/mssql-jdbc-9.2.1.jre8.jar"
SPARK_OPTS="--conf spark.executor.extraClassPath=mssql-jdbc-9.2.1.jre8.jar --conf spark.driver.extraClassPath=$FLOWMAN_HOME/plugins/flowman-mssqlserver/mssql-jdbc-9.2.1.jre8.jar"
Usage#
In order to connect to an MS SQL Server or Azure SQL database, you need to specify a JDBC connection and use that one in a JDBC relation as follows:
# First specify a connection. This can be used by multiple JDBC relations
connections:
frontend:
kind: jdbc
driver: "com.microsoft.sqlserver.jdbc.SQLServerDriver"
url: "jdbc:sqlserver://my-azure-database.domain.com"
username: "my_username"
password: "secret!password"
properties:
# We need to specify the database name already as part of connection, otherwise the login won't work
databaseName: my_database
relations:
frontend_users:
kind: jdbcTable
# Specify the name of the connection to use
connection: frontend
# Specify schema (within the database)
database: "dbo"
# Specify the table
table: "users"
In order to benefit from faster write operations, you should use the sqlserver
relation
instead of the generic jdbc
relation:
# First specify a connection. This can be used by multiple JDBC relations
connections:
frontend:
kind: jdbc
url: "jdbc:sqlserver://my-azure-database.domain.com"
username: "my_username"
password: "secret!password"
properties:
# We need to specify the database name already as part of connection, otherwise the login won't work
databaseName: my_database
relations:
frontend_users:
kind: sqlserver
# Specify the name of the connection to use
connection: frontend
# Specify schema (within the database)
database: "dbo"
# Specify the table
table: "users"
Data Types#
Flowman will map its built in data types to the following data types in MariaDB
Flowman/Spark Datatype | MariaDB Datatype |
---|---|
string , text |
NVARCHAR(MAX) |
binary |
VARBINARY(MAX) |
tinyint , byte |
BYTE |
smallint , short |
SMALLINT |
int , integer |
INTEGER |
bigint , long |
BIGINT |
boolean , bool |
BIT |
float |
FLOAT |
double |
DOUBLE PRECISION |
decimal(a,b) |
DECIMAL(a,b) |
varchar(n) |
NVARCHAR(n) |
char(n) |
NCHAR(n) |
date |
DATE |
timestamp |
DATETIME2 |
duration |
unsupported |