Bigconfig

Overview

Bigconfig is a yaml-based declarative monitoring as code solution. Data engineers can deploy Bigeye metrics from the command-line for convenient and scalable data quality monitoring.

Prerequisites

Bigconfig can be applied via the Bigeye-CLI. See here for instructions on how to install the Bigeye-CLI and log into your workspace.

Bigconfig CLI Commands

Bigconfig has two relevant CLI commands: plan and apply. Run bigeye bigconfig –help in the CLI for convenient access to relevant commands and parameters.

Plan

bigeye bigconfig plan

The plan command performs a “dry run” and returns a report of the expected changes when the provided Bigconfig file is applied. If there are validation errors in the YAML file, a FIXME_ yaml file will be generated with inline errors to help you amend your Bigconfig. Otherwise, a report with a list of metrics to be created, updated, and deleted, as well as a list of columns set as row creation time will be created.

Command options:

  • -ip, --input_path (optional): Input path containing Bigconfig YAML file. If none is provided, then any .yaml or .yml files in the current working directory with type: BIGCONFIG_FILE will be used.
  • -op, --output_path (optional): Output path where reports and FIXME files will be saved. If no output path is defined the current working directory will be used.
  • -psn, --purge_source_name (optional): deletes all Bigconfig metrics in a particular source. Accepts a list of data source names to purge E.g. -psn source_1 -psn source_2.
  • -purge_all, --purge_all_sources (optional): deletes all Bigconfig metrics in the workspace: True or False. [default: False]

Apply

bigeye bigconfig apply

The apply command executes the Bigconfig to create, update, and delete metrics, set row creation times, and SLAs. If there are validation errors in the YAML file, a FIXME_ yaml file will be generated with inline errors to help you amend your Bigconfig. Otherwise, a report with a list of metrics created, updated, and deleted, as well as a list of columns set as row creation time will be created.

Command options

  • -ip, --input_path (optional): Input path containing Bigconfig YAML file. If none is provided, then any .yaml or .yml files in the current working directory with type: BIGCONFIG_FILE will be used.
  • -op, --output_path (optional): Output path where reports and FIXME files will be saved. If no output path is defined the current working directory will be used.
  • -psn, --purge_source_name (optional): deletes all Bigconfig metrics in a particular source. Accepts a list of data source names to purge E.g. -psn source_1 -psn source_2.
  • -purge_all, --purge_all_sources (optional): deletes all Bigconfig metrics in the workspace: True or False. [default: False]

Bigconfig YAML File

The bigconfig YAML file is made up of several modules. Either the table deployment or tag deployment module is required, all other modules are optional but are designed to help you scale your data observability across your warehouse(s).

View a complete sample bigconfig at the bottom of this page to get started. Read below for more tips on how to customize to your data and monitoring goals.

👍

Start Simple, Build for Scale

We recommend starting simple -- the most basic bigconfig only contains the table deployment module, all other modules are optional but are designed to help you scale your data observability across your warehouse(s).

Table Deployments

Deploy metrics for important tables by listing metrics for each column.

Table deployments include a fully qualified table name, row creation time, table metrics, and metrics by column. When listing metrics, you can either use a [saved_metric_id](https://docs.bigeye.com/docs/bigconfig#saved-metric-definitions-optional) or inline your metric definition. Similar to saved_metric_definitions, only metric_type is required, all other attributes are optional and will use workspace defaults if not specified.

You can group one or more table deployments into an SLA so you are able to route notifications effectively and see a status summary of metrics in the Bigeye application. You can use an existing SLA or create a new SLA. SLA attributes include:

  • name: required
  • notification_channels:
    • type: SLACK or EMAIL
    • value: slack channel with # or email address, one per list item

Valid table metrics include:

  • COUNT_ROWS
  • ROWS_INSERTED
  • HOURS_SINCE_LAST_LOAD
  • COUNT_READ_QUERIES

Valid column metrics depend on the column type - for example AVERAGE metrics are only applicable to numeric columns - see the full list of available metrics here. The FIXME file will contain errors if you attempt to deploy an invalid metric on a column in a table deployment.

Note: either a table_deployment or a tag_deployment is required to create metrics via bigconfig. Wildcards are not supported in table deployments.

Example:

table_deployments:
  - sla:
      name: prod_analytics_monthly_actives
      notification_channels:
        - slack: '#prod_analytics'
    deployments:
      - fq_table_name: analytics_warehouse.kpi_reports.maus
        row_creation_time: updated_at
        table_metrics:
          - metric_type: 
              predefined_metric: COUNT_ROWS
          - metric_type: 
              predefined_metric: HOURS_SINCE_LAST_LOAD
        columns:  
          - column_name: user_id
            metrics:  
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: COUNT_DUPLICATES
          - column_name: device_type
            metrics:
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: COUNT_DISTINCT
          - column_name: total_logins
            metrics:
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: AVERAGE

Saved Metric Definitions (Optional)

Define reusable metrics with custom configurations. You can deploy these metrics on a specified tag and/or in a table deployment later. For example, you may want to save NULL and duplicate metric with a constant threshold of zero to apply to ID columns that should never be NULL or duplicate.

To define a saved metric, you need to specify a saved_metric_id and metric_type - both are required. All other attributes are optional, if not provided metrics will be created with workspace defaults. A full list of attributes and their defaults is below:

  • saved_metric_id: required
  • metric_type required
  • metric_name: optional name used to identify the monitor, will show in notifications and metric charts. Defaults to metric type.
  • description: optional description that shows on metric page
  • schedule_frequency defaults to every 6 hours
    • interval_type: DAYS or HOURS
    • interval_value: integer value
  • lookback: Note this attribute is only valid if row creation time is set on the table.
    • lookback_type : DATA_TIME, CLOCK_TIME, or METRIC_TIME. Learn more about lookback types. Default is DATA_TIME.
    • lookback_window: Default is 2 days.
      • interval_type: DAYS or HOURS
      • interval_value: integer value
    • bucket_size_seconds : Aggregation for metric query, only for metrics with METRIC_TIME lookback. Defaults to 86400 (1 day).
  • conditions: list of conditions to add to metric query as WHERE clauses, each is added with AND
  • group_by : list of fields in the metric table to group by.
  • threshold defaults to autothresholds with medium width
    • type : AUTO, RELATIVE, STDDEV, or CONSTANT. Learn more.
    • sensitivity: autothresholds only, defaults to medium
    • upper_bound: constant, relative, and stddev thresholds only
    • lower_bound: constant, relative, and stddev thresholds only
    • lookback: relative and stdev thresholds only
  • notification_channels accepts a list of notification channels, each with the following:
    • type: SLACK or EMAIL
    • value: slack channel name or email address

Note that this is optional, you can always inline metrics and custom configurations as you need.

👍

Default Settings Keep Things Clean

Only a saved_metric_id and metric_type are required in saved_metric_definitions, bigconfig will automatically apply your other workspace default settings so you only have to specify attributes you want to customize.

Example:

saved_metric_definitions:
  metrics:
    - saved_metric_id: no_nulls
      metric_type: 
        predefined_metric: PERCENT_NULL
      threshold:
        type: CONSTANT
        upper_bound: 0 
    - saved_metric_id: no_dupes
      metric_type: 
        predefined_metric: COUNT_DUPLICATES
      threshold:
        type: CONSTANT
        upper_bound: 0

Tag Definitions (Optional)

Define tags to deploy metrics on a dynamic set of columns. You can also use tags to select columns that will be used as row creation times across multiple tables. For example, you may want to tag all ID columns for format, duplicate and NULL monitoring. Alternatively, you may want to tag an important schema or collection of tables for analytics monitoring.

The tag_definitions module consists of many tag_ids which are defined by one or more column_selectors. Tag_ids must be unique in the workspace. column_selectors are fully qualified column names including source, schema, table, and column. You can use wildcards in column_selectors.

Example:

tag_definitions: 
  - tag_id: ID_FIELDS
    column_selectors:
      - name: analytics_warehouse.*.*.*_id
      - name: analytics_warehouse.*.*.*_ID
      - name: analytics_warehouse.*.*.*Id
  - tag_id: INCREMENTAL_IDS
    column_selectors:
      - name: analytics_warehouse.kpi_reports.*.created_at
      - name: analytics_warehouse.revenue_reports.*.updated_at   
  - tag_id: PROD_TABLES
    column_selectors:
      - name: analytics_warehouse.*_prod.*.*
  - tag_id: EXEC_REPORTING
    column_selectors:
      - name: analytics_warehouse.kpi_reports.*.*
      - name: analytics_warehouse.revenue_reports.*.*

Row Creation Times (Optional)

Set row creation time across multiple tables with a tag_id or inline column_selectors. For example, you may want to always set the updated_at column as the row creation time across your source.

All columns must be valid timestamp columns. Only one column per table may be selected as row creation time.

Example:

row_creation_times:
  tag_ids:
    - INCREMENTAL_IDS 
  column_selectors:
    - name: analytics_warehouse.staging.*.createdate

Tag Deployments

Deploy metrics on predefined tags. For example, you can apply your saved NULL and duplicate metrics on the ID_FIELD tag to ensure all ID columns across your warehouse are consistently monitored.

When listing metrics, you can either use a saved_metric_id or inline your metric definition. Similar to saved_metric_definitions, only metric_type is required, all other attributes are optional and will use workspace defaults if not specified.

When deploying metrics via tag, metrics will only be created on valid column types automatically. For example, if you deploy an AVERAGE metric on a tag that includes all columns and tables in a schema, average metrics will only be created on numeric columns. You can also deploy table-level metrics on a tag, as long as the tag definition uses a column wildcard, ex: source.schema.table.*

👍

Automatic Tag Deployments by Type

When deploying metrics via tag, metrics will only be created on valid column types automatically. For example, if you deploy an AVERAGE metric on a tag that includes all columns and tables in a schema, average metrics will only be created on numeric columns.

You can group one or more table deployments into an SLA so you are able to route notifications effectively and see a status summary of metrics in the Bigeye application. You can use an existing SLA or create a new SLA. SLA attributes include:

  • name: required
  • notification_channels:
    • type: SLACK or EMAIL
    • value: slack channel with # or email address, one per list item

Note: either a table_deployment or a tag_deployment is required to create metrics via bigconfig.

Example:

tag_deployments:
  - sla:
      name: prod_data_ops
      notification_channels:
        - slack: '#data_alerts'
    deployments:
        - tag_id: ID_FIELDS 
          metrics:
            - saved_metric_id: no_nulls
            - saved_metric_id: no_dupes
        - tag_id: PROD_TABLES
          metrics:
            - metric_type: 
                predefined_metric: COUNT_ROWS
            - metric_type: 
                predefined_metric: HOURS_SINCE_LAST_LOAD
            - metric_type: 
                predefined_metric: PERCENT_NULL
        - tag_id: EXEC_REPORTING 
          metrics:
            - metric_type: 
                predefined_metric: COUNT_ROWS
            - metric_type: 
                predefined_metric: HOURS_SINCE_LAST_LOAD
            - metric_type: 
                predefined_metric: AVERAGE
            - metric_type: 
                predefined_metric: MIN
            - metric_type: 
                predefined_metric: MAX
            - metric_type: 
                predefined_metric: VARIANCE

Example Template

Below is a complete bigconfig example with all modules included. Copy/paste it as a template to create your own.

type: BIGCONFIG_FILE
tag_definitions: 
  - tag_id: ID_FIELDS
    column_selectors:
      - name: analytics_warehouse.*.*.*_id
      - name: analytics_warehouse.*.*.*_ID
      - name: analytics_warehouse.*.*.*Id
  - tag_id: INCREMENTAL_IDS
    column_selectors:
      - name: analytics_warehouse.kpi_reports.*.created_at
      - name: analytics_warehouse.revenue_reports.*.updated_at   
  - tag_id: PROD_TABLES
    column_selectors:
      - name: analytics_warehouse.*_prod.*.*
  - tag_id: EXEC_REPORTING
    column_selectors:
      - name: analytics_warehouse.kpi_reports.*.*
      - name: analytics_warehouse.revenue_reports.*.*

row_creation_times:
  tag_ids:
    - INCREMENTAL_IDS 
  column_selectors:
    - name: analytics_warehouse.staging.*.createdate

saved_metric_definitions:
  metrics:
    - saved_metric_id: no_nulls
      metric_type: 
        predefined_metric: PERCENT_NULL
      threshold:
        type: CONSTANT
        upper_bound: 0 
    - saved_metric_id: no_dupes
      metric_type: 
        predefined_metric: COUNT_DUPLICATES
      threshold:
        type: CONSTANT
        upper_bound: 0 
        
tag_deployments:
  - sla:
      name: prod_data_ops
      notification_channels:
        - slack: '#data_alerts'
    deployments:
        - tag_id: ID_FIELDS 
          metrics:
            - saved_metric_id: no_nulls
            - saved_metric_id: no_dupes
        - tag_id: PROD_TABLES
          metrics:
            - metric_type: 
                predefined_metric: COUNT_ROWS
            - metric_type: 
                predefined_metric: HOURS_SINCE_LAST_LOAD
            - metric_type: 
                predefined_metric: PERCENT_NULL
        - tag_id: EXEC_REPORTING 
          metrics:
            - metric_type: 
                predefined_metric: COUNT_ROWS
            - metric_type: 
                predefined_metric: HOURS_SINCE_LAST_LOAD
            - metric_type: 
                predefined_metric: AVERAGE
            - metric_type: 
                predefined_metric: MIN
            - metric_type: 
                predefined_metric: MAX
            - metric_type: 
                predefined_metric: VARIANCE

table_deployments:
  - sla:
      name: prod_analytics_monthly_actives
      notification_channels:
        - slack: '#prod_analytics'
    deployments:
      - fq_table_name: analytics_warehouse.kpi_reports.maus
        row_creation_time: updated_at
        table_metrics:
          - metric_type: 
              predefined_metric: COUNT_ROWS
          - metric_type: 
              predefined_metric: HOURS_SINCE_LAST_LOAD
        columns:  
          - column_name: user_id
            metrics:  
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: COUNT_DUPLICATES
          - column_name: device_type
            metrics:
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: COUNT_DISTINCT
          - column_name: total_logins
            metrics:
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: AVERAGE