Apache Hive setup

Maintained by: Cloudera
Authors: Cloudera
GitHub repo: cloudera/dbt-hive
PyPI package: dbt-hive
Slack channel: #db-hive
Supported dbt Core version: v1.1.0 and newer
dbt Cloud support: Not Supported
Minimum data platform version: n/a

Installing dbt-hive

Use pip to install the adapter. Before 1.8, installing the adapter would automatically install dbt-core and any additional dependencies. Beginning in 1.8, installing an adapter does not automatically install dbt-core. This is because adapters and dbt Core versions have been decoupled from each other so we no longer want to overwrite existing dbt-core installations. Use the following command for installation:

Configuring dbt-hive

For Hive-specific configuration, please refer to Hive configs.

Connection Methods

dbt-hive can connect to Apache Hive and Cloudera Data Platform clusters. The Impyla library is used to establish connections to Hive.

dbt-hive supports two transport mechanisms:

binary
HTTP(S)

The default mechanism is binary. To use HTTP transport, use the boolean option. For example, use_http_transport: true.

Authentication Methods

dbt-hive supports two authentication mechanisms:

insecure No authentication is used, only recommended for testing.
ldap Authentication via LDAP

Insecure

This method is only recommended if you have a local install of Hive and want to test out the dbt-hive adapter.

~/.dbt/profiles.yml

your_profile_name:
  target: dev
  outputs:
    dev:
      type: hive
      host: localhost
      port: PORT # default value: 10000
      schema: SCHEMA_NAME
      

LDAP

LDAP allows you to authenticate with a username and password when Hive is configured with LDAP Auth. LDAP is supported over Binary & HTTP connection mechanisms.

This is the recommended authentication mechanism to use with Cloudera Data Platform (CDP).

~/.dbt/profiles.yml

your_profile_name:
  target: dev
  outputs:
    dev:
     type: hive
     host: HOST_NAME
     http_path: YOUR/HTTP/PATH # optional, http path to Hive default value: None
     port: PORT # default value: 10000
     auth_type: ldap
     use_http_transport: BOOLEAN # default value: true
     use_ssl: BOOLEAN # TLS should always be used with LDAP to ensure secure transmission of credentials, default value: true
     username: USERNAME
     password: PASSWORD
     schema: SCHEMA_NAME

Note: When creating workload user in CDP, make sure the user has CREATE, SELECT, ALTER, INSERT, UPDATE, DROP, INDEX, READ and WRITE permissions. If you need the user to execute GRANT statements, you should also configure the appropriate GRANT permissions for them. When using Apache Ranger, permissions for allowing GRANT are typically set using "Delegate Admin" option. For more information, see grants and on-run-start & on-run-end.

Kerberos

The Kerberos authentication mechanism uses GSSAPI to share Kerberos credentials when Hive is configured with Kerberos Auth.

~/.dbt/profiles.yml

your_profile_name:
  target: dev
  outputs:
    dev:
      type: hive
      host: HOSTNAME
      port: PORT # default value: 10000
      auth_type: GSSAPI
      kerberos_service_name: KERBEROS_SERVICE_NAME # default value: None
      use_http_transport: BOOLEAN # default value: true
      use_ssl: BOOLEAN # TLS should always be used to ensure secure transmission of credentials, default value: true
      schema: SCHEMA_NAME

Note: A typical setup of Cloudera Private Cloud will involve the following steps to setup Kerberos before one can execute dbt commands:

Get the correct realm config file for your installation (krb5.conf)
Set environment variable to point to the config file (export KRB5_CONFIG=/path/to/krb5.conf)
Set correct permissions for config file (sudo chmod 644 /path/to/krb5.conf)
Obtain keytab using kinit (kinit username@YOUR_REALM.YOUR_DOMAIN)
The keytab is valid for certain period after which you will need to run kinit again to renew validity of the keytab.
User will need CREATE, DROP, INSERT permissions on the schema provided in profiles.yml

Instrumentation

By default, the adapter will collect instrumentation events to help improve functionality and understand bugs. If you want to specifically switch this off, for instance, in a production environment, you can explicitly set the flag usage_tracking: false in your profiles.yml file.

Installation and Distribution

dbt's adapter for Apache Hive is managed in its own repository, dbt-hive. To use it, you must install the dbt-hive plugin.

Using pip

The following commands will install the latest version of dbt-hive as well as the requisite version of dbt-core and impyla driver used for connections.

python -m pip install dbt-hive

Supported Functionality

Name	Supported
Materialization: Table	Yes
Materialization: View	Yes
Materialization: Incremental - Append	Yes
Materialization: Incremental - Insert+Overwrite	Yes
Materialization: Incremental - Merge	No
Materialization: Ephemeral	No
Seeds	Yes
Tests	Yes
Snapshots	No
Documentation	Yes
Authentication: LDAP	Yes
Authentication: Kerberos	Yes

Installing dbt-hive

Configuring dbt-hive

Connection Methods​

Authentication Methods​

Insecure​

LDAP​

Kerberos​

Instrumentation​

Installation and Distribution​

Using pip​

Supported Functionality​

Connection Methods

Authentication Methods

Insecure

LDAP

Kerberos

Instrumentation

Installation and Distribution

Using pip

Supported Functionality