Monitor dbt performance with Bigeye

What is dbt observability and why is it important?

Observability for dbt involves monitoring, alerting, and visibility over the dbt run and test results that are executed during different jobs. Just like any other production environment, your dbt project benefits from monitoring and logging. Without monitoring or logging, as your dbt project grows it will become extremely difficult to understand which models and tests were executed on each run and what the results were. Data owners are interested and would benefit from knowing how long models take to run, when jobs hit bottlenecks, when data quality tests fail, or which tests are flaky. This article will provide a step-by-step guide on how to implement a dbt observability solution with Bigeye.

Configure dbt run & test logging

Before we can gather insights on the behavior of our dbt results over time, we have to first log the results of each run and test into a table. This can be done with four relatively simple steps:

Create a table to store results
Parse the run results
Use an on-run-end hook to insert results into table
Update dbt jobs

Step 1 - Create a table to store dbt results

The execution of a resource in dbt generates what is called a Result object. This object contains a lot of different information about the timing and metadata of that execution. At the end of an invocation, dbt records these objects in a file named run_results.json. Here is an example run_results.json file. The first step to logging the dbt results is to review the results object and create an empty table that references the fields that you would like to store.

Here is a sample model definition using an empty select query with a schema that matches the fields we believe are beneficial for storing. The naming of this model will be important in later steps, here we have named the model dbt_results.

{{  
 config(  
   materialized = 'incremental',  
   transient = False,  
   unique_key = 'result_id'  
 )  
}}  
with empty_table as (  
   select  
       null as result_id,  
       null as invocation_id,  
       null as unique_id,  
       null as database_name,  
       null as schema_name,  
       null as name,  
       null as resource_type,  
       null as status,  
       cast(null as float) as execution_time_seconds,  
       cast(null as int) as rows_affected,  
       cast(null as timestamp) as dbt_run_at
)

select * from empty_table  
-- This is a filter so we will never actually insert these values  
where 1 = 0

Step 2 - Parse the run results

The result object contains many fields and some of them can be deeply nested. In order to load the data into our table definition above, we need a macro that flattens the result objects and extracts only the fields we are interested in storing. Based on the fields we have defined in the table definition above, here is a macro to parse those results.

{% macro parse_dbt_results(results) %}  
   -- Create a list of parsed results  
   {%- set parsed_results = [] %}  
   -- Flatten results and add to list  
   {% for run_result in results %}  
       -- Convert the run result object to a simple dictionary  
       {% set run_result_dict = run_result.to_dict() %}  
       -- Get the underlying dbt graph node that was executed  
       {% set node = run_result_dict.get('node') %}  
       {% set rows_affected = run_result_dict.get('adapter_response', {}).get('rows_affected', 0) %}  
       {%- if not rows_affected -%}  
           {% set rows_affected = 0 %}  
       {%- endif -%}  
       {% set parsed_result_dict = {  
               'result_id': invocation_id ~ '.' ~ node.get('unique_id'),  
               'invocation_id': invocation_id,  
               'unique_id': node.get('unique_id'),  
               'database_name': node.get('database'),  
               'schema_name': node.get('schema'),  
               'name': node.get('name'),  
               'resource_type': node.get('resource_type'),  
               'status': run_result_dict.get('status'),  
               'execution_time_seconds': run_result_dict.get('execution_time'),  
               'rows_affected': rows_affected  
               }%}  
       {% do parsed_results.append(parsed_result_dict) %}  
   {% endfor %}  
   {{ return(parsed_results) }}  
{% endmacro %}

Step 3 - Use an on-run-end hook to insert results into table

With the table defined and the macro in place to extract relevant fields, we can now create a new macro that will insert the parsed results into the table. The results Jinja variable is only available after runs have completed, so we need to build a macro that will be called as an on-run-end hook. Here is some example code to do just that.

{% macro log_dbt_results(results) %}  
   -- depends_on: {{ ref('dbt_results') }}  
   {%- if execute -%}  
       {%- set parsed_results = parse_dbt_results(results) -%}  
       {%- if parsed_results | length  > 0 -%}  
           {% set insert_dbt_results_query -%}  
               insert into {{ ref('dbt_results') }}  
                   (  
                       result_id,  
                       invocation_id,  
                       unique_id,  
                       database_name,  
                       schema_name,  
                       name,  
                       resource_type,  
                       status,  
                       execution_time_seconds,  
                       rows_affected,  
                       dbt_run_at  
               ) values  
                   {%- for parsed_result_dict in parsed_results -%}  
                       (  
                           '{{ parsed_result_dict.get('result_id') }}',  
                           '{{ parsed_result_dict.get('invocation_id') }}',  
                           '{{ parsed_result_dict.get('unique_id') }}',  
                           '{{ parsed_result_dict.get('database_name') }}',  
                           '{{ parsed_result_dict.get('schema_name') }}',  
                           '{{ parsed_result_dict.get('name') }}',  
                           '{{ parsed_result_dict.get('resource_type') }}',  
                           '{{ parsed_result_dict.get('status') }}',  
                           {{ parsed_result_dict.get('execution_time_seconds') }},  
                           {{ parsed_result_dict.get('rows_affected') }},  
                           current_timestamp  
                       ) {{- "," if not loop.last else "" -}}  
                   {%- endfor -%}  
           {%- endset -%}  
           {%- do run_query(insert_dbt_results_query) -%}  
       {%- endif -%}  
   {%- endif -%}  
   -- This macro is called from an on-run-end hook and therefore must return a query txt to run. Returning an empty string will do the trick  
   {{ return ('') }}  
{% endmacro %}

Make sure to add this macro to the dbt_project.yml file as a on-run-end hook, like so:

on-run-end:
- "{{ log_dbt_results(results) }}"

Step 4 - Update dbt jobs

Since the macro will run as a hook at the end of separate runs, the dbt_results table will need to be built first before any other tables. You can either do this as a one off prior to executing your first job or you can add it as a quick step to any pre-built jobs you already have.

dbt run --select dbt_result

Setup Bigeye metrics to monitor results

The next time you run your dbt project the dbt_results table will get populated with useful logging information to help keep track of your results. With this in place, you can now create Bigeye metrics to monitor the behavior of your dbt project and give data owners the information they need. The first step is to ensure the new dbt_results table can be queried by your Bigeye service account. Once visible in Bigeye, you can then set the row creation time of this table to dbt_run_at to avoid full scans of this logging table. We recommend the following suite of metrics (full bigconfig file with metrics can also be found at the bottom!):

Job performance by model
Job outcomes by model
Job success rate by model
Testing success rates

Metric 1 - Job Performance by Model

One question that most dbt developers will ask is, how performant are my models and have the changes made to my models introduced major changes to that performance. To track this in Bigeye, here is our metric recommendation (only non-defaults listed).

Config	Value
Column Name	execution_time_seconds
Metric Type	Average
Metric Name	[Average] Execution Time in Seconds by Model
Description	Tracks the average execution time per day of dbt model runs, using autothresholds.
Conditions	resource_type = 'model'
Group By	name
Schedule	How often dbt jobs are executed (i.e. every 12 hours)

Metric 2 - Job Outcomes by Model

The natural next question is what was the outcome of my model runs, more specifically how many rows were affected each run. To track this in Bigeye, here is our metric recommendation (only non-defaults listed).

Config	Value
Column Name	rows_affected
Metric Type	Average
Metric Name	[Average] Rows Affected by Model
Description	Tracks the average number of rows affected per day of dbt model runs, using autothresholds.
Conditions	resource_type = 'model'
Group By	name
Schedule	How often dbt jobs are executed (i.e. every 12 hours)

Metric 3 - Job Success Rate by Model

Config	Value
Column Name	status
Metric Type	Value in List %
Metric Name	[Success Rate %] Daily Success Rate by Model
Description	Tracks the percentage of models that succeeded per day of dbt model runs, using autothresholds.
List (string)	success
Conditions	resource_type = 'model'
Group By	name
Schedule	How often dbt jobs are executed (i.e. every 12 hours)

You may want to consider setting a constant threshold with a lower bound of 100% for this metric. This depends on whether you think a single model failure warrants an alert sent to your team or not.

Metric 4 - Testing Success Rates

Switching context a bit, another set of behavior you may want to track would be your dbt tests. Depending on the workflow, you may not need dbt tests to act as a circuit breaker to kill your dbt jobs, but still want to know the results of those tests. To track this in Bigeye, here is our metric recommendation (only non-defaults listed).

Config	Value
Column Name	status
Metric Type	Value in List %
Metric Name	[Success Rate %] Daily Testing Success Rate
Description	Tracks the percentage of all tests that passed per day, using autothresholds.
List (string)	pass
Conditions	resource_type = 'test'
Schedule	How often dbt jobs are executed (i.e. every 12 hours)

You may want to consider setting a constant threshold with a lower bound of 100% for this metric. This depends on whether you think a single test failure warrants an alert sent to your team or not.

All Metrics - Bigconfig Sample File

If you would prefer to deploy these metrics via bigconfig, here is a file that will deploy all of these metrics (and a few extras) into a specific collection called ‘dbt Performance Monitoring’. The only required step before deploying this file is to change the name of the source, schema, and table of the columns to deploy metrics too.

type: BIGCONFIG_FILE
auto_apply_on_indexing: True

row_creation_times:
  column_selectors:
    - name: "<source-name>.<database-name>.<schema-name>.DBT_RESULTS.DBT_RUN_AT"

tag_deployments:
  - collection:
      name: dbt Performance Monitoring
      description: 'Track the performance and outcomes of dbt runs and tests'
      notification_channels:
        - slack: '#dbt-monitoring'
    deployments:
      - column_selectors:
          - name: <source-name>.<database-name>.<schema-name>.DBT_RESULTS.*
        metrics:
          - metric_type:
              predefined_metric: FRESHNESS
          - metric_type:
              predefined_metric: VOLUME
      - column_selectors:
          - name: <source-name>.<database-name>.<schema-name>.DBT_RESULTS.EXECUTION_TIME_SECONDS
        metrics:
          - metric_type:
              predefined_metric: AVERAGE
            metric_name: '[Average] Execution Time in Seconds by Model'
            description: 'Tracks the average execution time per day of dbt model runs, using autothresholds. '
            schedule_frequency: &every12hours
              interval_type: HOURS
              interval_value: 12
            lookback: &1daylookback
              lookback_type: DATA_TIME
              lookback_window:
                interval_type: DAYS
                interval_value: 1
            conditions: &modelsonly
              - "resource_type = 'model'"
            group_by:
              - name
          - metric_type:
              predefined_metric: SUM
            metric_name: '[Sum] Total Execution Time in Seconds Per Day'
            description: 'Tracks the total execution time per day of all dbt model runs, using autothresholds. '
            schedule_frequency: *every12hours
            lookback: *1daylookback
            conditions: *modelsonly
      - column_selectors:
          - name: <source-name>.<database-name>.<schema-name>.DBT_RESULTS.ROWS_AFFECTED
        metrics:
          - metric_type:
              predefined_metric: AVERAGE
            metric_name: '[Average] Rows Affected by Model'
            description: 'Tracks the average number of rows affected per day of dbt model runs, using autothresholds. '
            schedule_frequency: *every12hours
            lookback: *1daylookback
            conditions: *modelsonly
            group_by:
              - name
          - metric_type:
              predefined_metric: SUM
            metric_name: '[Sum] Total Rows Affected Per Day'
            description: 'Tracks the total number of rows that are affected per day using autothresholds.'
            schedule_frequency: *every12hours
            lookback: *1daylookback
            conditions: *modelsonly
      - column_selectors:
          - name: <source-name>.<database-name>.<schema-name>.DBT_RESULTS.STATUS
        metrics:
          - metric_type:
              predefined_metric: PERCENT_VALUE_IN_LIST
            parameters:              
              - key: list
                string_value: "success" 
            metric_name: '[Success Rate %] Daily Success Rate by Model'
            description: 'Tracks the percentage of models that succeeded per day of dbt model runs, using autothresholds. '
            schedule_frequency: *every12hours
            lookback: *1daylookback
            conditions: *modelsonly
            group_by:
              - name
          - metric_type:
              predefined_metric: PERCENT_VALUE_IN_LIST
            parameters:              
              - key: list
                string_value: "success" 
            metric_name: '[Success Rate %] Overall Daily Success Rate'
            description: 'Tracks the percentage of all models that succeeded per day of dbt model runs, using autothresholds.'
            schedule_frequency: *every12hours
            lookback: *1daylookback
            conditions: *modelsonly
          - metric_type:
              predefined_metric: PERCENT_VALUE_IN_LIST
            parameters:              
              - key: list
                string_value: "pass" 
            metric_name: '[Success Rate %] Daily Testing Success Rate'
            description: 'Tracks the percentage of all tests that passed per day, using autothresholds. '
            schedule_frequency: *every12hours
            lookback: *1daylookback
            conditions:
              - "resource_type = 'test'"