There are two supported connection methods for Spark targets:
thrift connection method if you are connecting to a Thrift server sitting in front of a Spark cluster.
your_profile_name:target: devoutputs:dev:type: sparkmethod: thriftschema: [database/schema name]host: [hostname]port: [port]user: [user]
http method if your Spark provider supports connections over HTTP (e.g. Databricks).
your_profile_name:target: devoutputs:dev:type: sparkmethod: httpschema: [database/schema name]host: [yourorg.sparkhost.com]organization: [org id] # required if Azure Databricks, exclude if AWS Databricksport: [port]token: [abc123]cluster: [cluster id]connect_timeout: 60 # optional, default 10connect_retries: 5 # optional, default 0
Databricks interactive clusters can take several minutes to start up. You may
include the optional profile configs
and dbt will periodically retry the connection.
Installation and Distribution
dbt's Spark adapter is managed in its own repository, dbt-spark. To use the Spark adapter, you must install the
The following command will install the latest version of
dbt-spark as well as the requisite version of
pip install dbt-spark
Usage with EMR
To connect to Spark running on an Amazon EMR cluster, you will need to run
sudo /usr/lib/spark/sbin/start-thriftserver.sh on the master node of the cluster to start the Thrift server (see the docs for more information). You will also need to connect to port 10001, which will connect to the Spark backend Thrift server; port 10000 will instead connect to a Hive backend, which will not work correctly with dbt.
Most dbt Core functionality is supported, but some features are only available on Delta Lake (Databricks).
Some dbt features, available on the core adapters, are not yet supported on Spark:
- Persisting column-level descriptions as database comments