Bigconfig

Overview

Bigconfig is a yaml-based declarative monitoring as code solution. Data engineers can deploy Bigeye metrics from the command-line for convenient and scalable data quality monitoring.

Prerequisites

Bigconfig can be applied via the Bigeye-CLI. See here for instructions on how to install the Bigeye-CLI and log into your workspace.

Bigconfig CLI Commands

Bigconfig has two relevant CLI commands: plan and apply. Run bigeye bigconfig –help in the CLI for convenient access to relevant commands and parameters.

Plan

bigeye bigconfig plan

The plan command performs a “dry run” and returns a report of the expected changes when the provided Bigconfig file is applied. If there are validation errors in the YAML file, a FIXME_ yaml file will be generated with inline errors to help you amend your Bigconfig. Otherwise, a report with a list of metrics to be created, updated, and deleted, as well as a list of columns set as row creation time will be created.

Command options:

  • -ip, --input_path (optional): List of paths containing Bigconfig YAML file(s). If none is provided, then any .yaml or .yml files in the current working directory with type: BIGCONFIG_FILE will be used.
  • -op, --output_path (optional): Output path where reports and FIXME files will be saved. If no output path is defined the current working directory will be used.
  • -psn, --purge_source_name (optional): deletes all Bigconfig metrics in a particular source. Accepts a list of data source names to purge E.g. -psn source_1 -psn source_2.
  • -purge_all, --purge_all_sources (optional): deletes all Bigconfig metrics in the workspace: True or False. [default: False]
  • -r, --recursive (optional): Search ALL input directories recursively. [default: False]

Apply

bigeye bigconfig apply

The apply command executes the Bigconfig to create, update, and delete metrics, set row creation times, and SLAs. If there are validation errors in the YAML file, a FIXME_ yaml file will be generated with inline errors to help you amend your Bigconfig. Otherwise, a report with a list of metrics created, updated, and deleted, as well as a list of columns set as row creation time will be created. By default, an apply will submit the bigconfig to a queue.

Command options

  • -ip, --input_path (optional): List of paths containing Bigconfig YAML file(s). If none is provided, then any .yaml or .yml files in the current working directory with type: BIGCONFIG_FILE will be used.
  • -op, --output_path (optional): Output path where reports and FIXME files will be saved. If no output path is defined the current working directory will be used.
  • -psn, --purge_source_name (optional): deletes all Bigconfig metrics in a particular source. Accepts a list of data source names to purge E.g. -psn source_1 -psn source_2.
  • -purge_all, --purge_all_sources (optional): deletes all Bigconfig metrics in the workspace: True or False. [default: False]
  • -nq, --no-queue (optional): bypass the queue endpoint for bigconfig [default: False]
  • -r, --recursive (optional): Search ALL input directories recursively. [default: False]

Bigconfig YAML File

The bigconfig YAML file is made up of several modules. Either the table deployment or tag deployment module is required, all other modules are optional but are designed to help you scale your data observability across your warehouse(s).

View a complete sample bigconfig at the bottom of this page to get started. Read below for more tips on how to customize to your data and monitoring goals.

👍

Start Simple, Build for Scale

We recommend starting simple -- the most basic bigconfig only contains the table deployment module, all other modules are optional but are designed to help you scale your data observability across your warehouse(s).

Auto Apply on Indexing (Optional, default False)

Applies a deployed bigconfig when sources are indexed. Indexing occurs automatically once a day, and ad hoc when users select "rescan" in the catalog. This will automatically create metrics when newly added datasets or columns match an existing tag definition or column selector in a tag deployment.

📘

This setting must be consistent across all bigconfig files. By adding auto_apply_on_indexing = True to one file, you must add it to all, since the default value is False.

Table Deployments

Deploy metrics for important tables by listing metrics for each column.

Table deployments include a fully qualified table name, row creation time, table metrics, and metrics by column. When listing metrics, you can either use a [saved_metric_id](https://docs.bigeye.com/docs/bigconfig#saved-metric-definitions-optional) or inline your metric definition. Similar to saved_metric_definitions, only metric_type is required, all other attributes are optional and will use workspace defaults if not specified.

You can group one or more table deployments into a collection so you are able to route notifications effectively and see a status summary of metrics in the Bigeye application. You can use an existing collection or create a new collection. Collection attributes include:

  • name: required
  • notification_channels:
    • type: SLACK, EMAIL, or WEBHOOK
    • value: slack channel with #, email address, or webhook url; one per list item

Valid table metrics include:

  • COUNT_ROWS
  • ROWS_INSERTED
  • HOURS_SINCE_LAST_LOAD
  • COUNT_READ_QUERIES
  • FRESHNESS
  • VOLUME

Valid column metrics depend on the column type - for example AVERAGE metrics are only applicable to numeric columns - see the full list of available metrics here. The FIXME file will contain errors if you attempt to deploy an invalid metric on a column in a table deployment.

Note: either a table_deployment or a tag_deployment is required to create metrics via bigconfig. Wildcards are not supported in table deployments.

Example:

table_deployments:
  - collection:
      name: prod_analytics_monthly_actives
      notification_channels:
        - slack: '#prod_analytics'
    deployments:
      - fq_table_name: analytics_warehouse.kpi_reports.maus
        row_creation_time: updated_at
        table_metrics:
          - metric_type: 
              predefined_metric: FRESHNESS
          - metric_type: 
              predefined_metric: VOLUME
        columns:  
          - column_name: user_id
            metrics:  
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: COUNT_DUPLICATES
          - column_name: device_type
            metrics:
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: COUNT_DISTINCT
          - column_name: total_logins
            metrics:
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: AVERAGE

Saved Metric Definitions (Optional)

Define reusable metrics with custom configurations. You can deploy these metrics on a specified tag and/or in a table deployment later. For example, you may want to save NULL and duplicate metric with a constant threshold of zero to apply to ID columns that should never be NULL or duplicate.

To define a saved metric, you need to specify a saved_metric_id and metric_type - both are required. All other attributes are optional, if not provided metrics will be created with workspace defaults. A full list of attributes and their defaults is below:

  • saved_metric_id: required

  • metric_type required

  • metric_name: optional name used to identify the monitor, will show in notifications and metric charts. Defaults to metric type.

  • description: optional description that shows on metric page

  • schedule_frequency defaults to every 6 hours

    • interval_type: DAYS or HOURS
    • interval_value: integer value
  • metric_schedule optional named schedule defined in a workspace or schedule frequency

    • named_schedule: optional cron schedule in a workspace
      • name: the name of the schedule
  • lookback: Note this attribute is only valid if row creation time is set on the table.

    • lookback_type : DATA_TIME, CLOCK_TIME, or METRIC_TIME. Learn more about lookback types.

      Default value depends on "Data time window as default" in advanced settings of a workspace. If enabled, default is METRIC_TIME (data time window), otherwise DATA_TIME.

    • lookback_window: Default is 2 days.

      • interval_type: DAYS or HOURS
      • interval_value: integer value
    • bucket_size : Aggregation for metric query, HOUR or DAY. Note only for metrics with METRIC_TIME lookback.

  • conditions: list of conditions to add to metric query as WHERE clauses, each is added with AND

  • group_by : list of fields in the metric table to group by.

  • threshold defaults to autothresholds with medium width

    • type : AUTO, RELATIVE, STDDEV, or CONSTANT. Learn more.
    • sensitivity: autothresholds only: NARROW, MEDIUM, WIDE or XWIDE, defaults to MEDIUM
    • upper_bound: constant, relative, and stddev thresholds only
    • lower_bound: constant, relative, and stddev thresholds only
    • lookback: Note this attribute is only valid and required for relative and stdev thresholds
      • interval_type : DAYS
      • interval_value : integer value
  • notification_channels accepts a list of notification channels, each with the following:

    • type: SLACK, EMAIL, or WEBHOOK
    • value: slack channel name, email address, or webhook url

Note that this is optional, you can always inline metrics and custom configurations as you need.

👍

Default Settings Keep Things Clean

Only a saved_metric_id and metric_type are required in saved_metric_definitions, bigconfig will automatically apply your other workspace default settings so you only have to specify attributes you want to customize.

Example:

saved_metric_definitions:
  metrics:
    - saved_metric_id: no_nulls
      metric_type: 
        predefined_metric: PERCENT_NULL
      threshold:
        type: CONSTANT
        upper_bound: 0 
    - saved_metric_id: no_dupes
      metric_type: 
        predefined_metric: COUNT_DUPLICATES
      threshold:
        type: CONSTANT
        upper_bound: 0

Tag Definitions (Optional)

Define tags to deploy metrics on a dynamic set of columns. You can also use tags to select columns that will be used as row creation times across multiple tables. For example, you may want to tag all ID columns for format, duplicate and NULL monitoring. Alternatively, you may want to tag an important schema or collection of tables for analytics monitoring.

The tag_definitions module consists of many tag_ids which are defined by one or more column_selectors. Tag_ids must be unique in the workspace. column_selectors can be defined by name, using fully qualified column names including source, schema, table, and column, or they can be defined by column type. You can use wildcards in column_selectors defined by name.

Note that you can combine column selectors in tag definitions to create AND requirements. The example below selects all columns in analytics_warehouse.*.*.* AND have type INT

column_selectors:
    - name: analytics_warehouse.*.*.*
      type: INT

See additional examples below:

tag_definitions: 
  - tag_id: ID_FIELDS
    column_selectors:
      - name: analytics_warehouse.*.*.*_id
      - name: analytics_warehouse.*.*.*_ID
      - name: analytics_warehouse.*.*.*Id
  - tag_id: INCREMENTAL_IDS
    column_selectors:
      - name: analytics_warehouse.kpi_reports.*.created_at
      - name: analytics_warehouse.revenue_reports.*.updated_at   
  - tag_id: PROD_TABLES
    column_selectors:
      - name: analytics_warehouse.*_prod.*.*
  - tag_id: EXEC_REPORTING
    column_selectors:
      - name: analytics_warehouse.kpi_reports.*.*
      - name: analytics_warehouse.revenue_reports.*.*
 - tag_id: STRING_TYPES
        - type: STRING
 - tag_id: ANALYTICS_NUMERICS
    - name: analytics_warehouse.*.*.*
      type: INT

Row Creation Times (Optional)

Set row creation time across multiple tables with a tag_id or inline column_selectors. For example, you may want to always set the updated_at column as the row creation time across your source.

All columns must be valid timestamp columns. Only one column per table may be selected as row creation time.

Example:

row_creation_times:
  tag_ids:
    - INCREMENTAL_IDS 
  column_selectors:
    - name: analytics_warehouse.staging.*.createdate

Tag Deployments

Deploy metrics on predefined tags. For example, you can apply your saved NULL and duplicate metrics on the ID_FIELD tag to ensure all ID columns across your warehouse are consistently monitored.

When listing metrics, you can either use a saved_metric_id or inline your metric definition. Similar to saved_metric_definitions, only metric_type is required, all other attributes are optional and will use workspace defaults if not specified.

When deploying metrics via tag, metrics will only be created on valid column types automatically. For example, if you deploy an AVERAGE metric on a tag that includes all columns and tables in a schema, average metrics will only be created on numeric columns. You can also deploy table-level metrics on a tag, as long as the tag definition uses a column wildcard, ex: source.schema.table.*

👍

Automatic Tag Deployments by Type

When deploying metrics via tag, metrics will only be created on valid column types automatically. For example, if you deploy an AVERAGE metric on a tag that includes all columns and tables in a schema, average metrics will only be created on numeric columns.

You can group one or more table deployments into a collection so you are able to route notifications effectively and see a status summary of metrics in the Bigeye application. You can use an existing collection or create a new collection. Collection attributes include:

  • name: required
  • notification_channels:
    • type: SLACK, EMAIL, or WEBHOOK
    • value: slack channel with #, email address, or webhook url; one per list item

Note: either a table_deployment or a tag_deployment is required to create metrics via bigconfig.

Example:

tag_deployments:
  - collection:
      name: prod_data_ops
      notification_channels:
        - slack: '#data_alerts'
    deployments:
        - tag_id: ID_FIELDS 
          metrics:
            - saved_metric_id: no_nulls
            - saved_metric_id: no_dupes
        - tag_id: PROD_TABLES
          metrics:
            - metric_type: 
                predefined_metric: COUNT_ROWS
            - metric_type: 
                predefined_metric: HOURS_SINCE_LAST_LOAD
            - metric_type: 
                predefined_metric: PERCENT_NULL
        - tag_id: EXEC_REPORTING 
          metrics:
            - metric_type: 
                predefined_metric: COUNT_ROWS
            - metric_type: 
                predefined_metric: HOURS_SINCE_LAST_LOAD
            - metric_type: 
                predefined_metric: AVERAGE
            - metric_type: 
                predefined_metric: MIN
            - metric_type: 
                predefined_metric: MAX
            - metric_type: 
                predefined_metric: VARIANCE

Example Template

Below is a complete bigconfig example with all modules included. Copy/paste it as a template to create your own.

type: BIGCONFIG_FILE
auto_apply_on_indexing: True
tag_definitions: 
  - tag_id: ID_FIELDS
    column_selectors:
      - name: analytics_warehouse.*.*.*_id
      - name: analytics_warehouse.*.*.*_ID
      - name: analytics_warehouse.*.*.*Id
  - tag_id: INCREMENTAL_IDS
    column_selectors:
      - name: analytics_warehouse.kpi_reports.*.created_at
      - name: analytics_warehouse.revenue_reports.*.updated_at   
  - tag_id: PROD_TABLES
    column_selectors:
      - name: analytics_warehouse.*_prod.*.*
  - tag_id: EXEC_REPORTING
    column_selectors:
      - name: analytics_warehouse.kpi_reports.*.*
      - name: analytics_warehouse.revenue_reports.*.*

row_creation_times:
  tag_ids:
    - INCREMENTAL_IDS 
  column_selectors:
    - name: analytics_warehouse.staging.*.createdate

saved_metric_definitions:
  metrics:
    - saved_metric_id: no_nulls
      metric_type: 
        predefined_metric: PERCENT_NULL
      threshold:
        type: CONSTANT
        upper_bound: 0 
    - saved_metric_id: no_dupes
      metric_type: 
        predefined_metric: COUNT_DUPLICATES
      threshold:
        type: CONSTANT
        upper_bound: 0
      metric_schedule:   
        named_schedule:              
          name: 'Everyday, 8:00 UTC'
        
tag_deployments:
  - collection:
      name: prod_data_ops
      notification_channels:
        - slack: '#data_alerts'
        - webhook: https://automation.atlassian.com/pro/hooks
    deployments:
        - tag_id: ID_FIELDS 
          metrics:
            - saved_metric_id: no_nulls
            - saved_metric_id: no_dupes
        - tag_id: PROD_TABLES
          metrics:
            - metric_type: 
                predefined_metric: COUNT_ROWS
            - metric_type: 
                predefined_metric: HOURS_SINCE_LAST_LOAD
            - metric_type: 
                predefined_metric: PERCENT_NULL
        - tag_id: EXEC_REPORTING 
          metrics:
            - metric_type: 
                predefined_metric: COUNT_ROWS
            - metric_type: 
                predefined_metric: HOURS_SINCE_LAST_LOAD
            - metric_type: 
                predefined_metric: AVERAGE
            - metric_type: 
                predefined_metric: MIN
            - metric_type: 
                predefined_metric: MAX
            - metric_type: 
                predefined_metric: VARIANCE
              threshold:
                type: RELATIVE
                lower_bound: 5
                lookback:
                  interval_type: DAYS
                  interval_value: 1

table_deployments:
  - collection:
      name: prod_analytics_monthly_actives
      notification_channels:
        - slack: '#prod_analytics'
        - webhook: https://dev.service-now.com/api/workspace
    deployments:
      - fq_table_name: analytics_warehouse.kpi_reports.maus
        row_creation_time: updated_at
        table_metrics:
          - metric_type: 
              predefined_metric: COUNT_ROWS
          - metric_type: 
              predefined_metric: HOURS_SINCE_LAST_LOAD
        columns:  
          - column_name: user_id
            metrics:  
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: COUNT_DUPLICATES
          - column_name: device_type
            metrics:
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: COUNT_DISTINCT
              - metric_type:
                  predefined_metric: PERCENT_VALUE_IN_LIST
                parameters:              
                  - key: list
                    string_value: "DEVICE1,DEVICE2,DEVICE3" 
          - column_name: total_logins
            metrics:
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: AVERAGE
              - metric_type:
                  type: TEMPLATE
                  template_id: 240
                  aggregation_type: COUNT
                  template_name: Is Greater than 0
                threshold:
                  type: AUTO
                  sensitivity: NARROW
                parameters:
                  - key: column
                    column_name: total_logins