Bigconfig

Overview

Bigconfig is a yaml-based declarative monitoring as code solution. Data engineers can deploy Bigeye metrics from the command-line for convenient and scalable data quality monitoring.

Prerequisites

Bigconfig can be applied via the Bigeye-CLI. See here for instructions on how to install the Bigeye-CLI and log into your workspace.

The user authenticated to the Bigeye CLI must have permissions to edit the tables and metrics being configured. Users of the Bigeye CLI and Bigconfig do not need administrative privileges.

Bigconfig CLI Commands

Bigconfig has two relevant CLI commands: plan and apply. Run bigeye bigconfig –help in the CLI for convenient access to relevant commands and parameters.

Plan

bigeye bigconfig plan

The plan command performs a β€œdry run” and returns a report of the expected changes when the provided Bigconfig file is applied. If there are validation errors in the YAML file, a FIXME_ yaml file will be generated with inline errors to help you amend your Bigconfig. Otherwise, a report with a list of metrics to be created, updated, and deleted, as well as a list of columns set as row creation time will be created.

Command options:

  • -ip, --input_path (optional): List of paths containing Bigconfig YAML file(s). If none is provided, then any .yaml or .yml files in the current working directory with type: BIGCONFIG_FILE will be used.
  • -op, --output_path (optional): Output path where reports and FIXME files will be saved. If no output path is defined the current working directory will be used.
  • -psn, --purge_source_name (optional): deletes all Bigconfig metrics in a particular source. Accepts a list of data source names to purge E.g. -psn source_1 -psn source_2.
  • -purge_all, --purge_all_sources (optional): deletes all Bigconfig metrics in the workspace: True or False. [default: False]
  • -nq, --no-queue (optional): bypass the queue endpoint for bigconfig [default: False]
  • -r, --recursive (optional): Search ALL input directories recursively. [default: False]
  • -strict, --strict_mode (optional): API errors cause an exception if True. Otherwise errors are returned in PLAN reports but the process exits successfully. [default: False]

Apply

bigeye bigconfig apply

The apply command executes the Bigconfig to create, update, and delete metrics, set row creation times, and SLAs. If there are validation errors in the YAML file, a FIXME_ yaml file will be generated with inline errors to help you amend your Bigconfig. Otherwise, a report with a list of metrics created, updated, and deleted, as well as a list of columns set as row creation time will be created. By default, an apply will submit the bigconfig to a queue.

Command options

  • -ip, --input_path (optional): List of paths containing Bigconfig YAML file(s). If none is provided, then any .yaml or .yml files in the current working directory with type: BIGCONFIG_FILE will be used.
  • -op, --output_path (optional): Output path where reports and FIXME files will be saved. If no output path is defined the current working directory will be used.
  • -psn, --purge_source_name (optional): deletes all Bigconfig metrics in a particular source. Accepts a list of data source names to purge E.g. -psn source_1 -psn source_2.
  • -purge_all, --purge_all_sources (optional): deletes all Bigconfig metrics in the workspace: True or False. [default: False]
  • -nq, --no-queue (optional): bypass the queue endpoint for bigconfig [default: False]
  • -r, --recursive (optional): Search ALL input directories recursively. [default: False]
  • -strict, --strict_mode (optional): API errors cause an exception if True. Otherwise errors are returned in APPLY reports but the process exits successfully [default: False]
  • -auto-approve, --auto-approve (optional): manual plan approval is not needed prior to metric deployment [default: False]

πŸ“˜

Note that if you are running bigeye bigconfig apply as part of a CI/CD process then you will want to make sure that the -auto-approve flag is provided in the command to avoid any failed jobs.

Auto Apply on Indexing

In addition to running apply from the CLI, you can set your bigconfig to auto-apply on indexing. This will detect when newly added datasets or columns match an existing tag definition or column selector and automatically create the relevant metrics based on tag deployments.

Indexing is Bigeye's process to update your catalog with any new schema changes. Indexing occurs automatically once a day, and ad hoc when users select "rescan" in the catalog.

In order to auto-apply your bigconfig on indexing, simply add auto_apply_on_indexing = True to the top of your yaml file.

πŸ“˜

Note that this setting must be consistent across all bigconfig files. By adding auto_apply_on_indexing = True to one file, you must add it to all, since the default value is False.

Bigconfig YAML File

You can deploy monitoring with Bigconfig using one or more yaml files at a time. Remember that the input path accepts a list of paths, and any .yaml or .yml files in the specified directory with type: BIGCONFIG_FILE will be included.

The bigconfig YAML file is made up of several modules. Either the table deployment or tag deployment module is required, all other modules are optional but are designed to help you scale your data observability across your warehouse(s). You may want to use separate files for each module, or you may separate files for independent table or tag deployments for different teams.

View a complete sample bigconfig at the bottom of this page to get started. Read below for more tips on how to customize to your data and monitoring goals.

πŸ‘

Start Simple, Build for Scale

We recommend starting simple -- the most basic bigconfig only contains the table deployment module, all other modules are optional but are designed to help you scale your data observability across your warehouse(s).

πŸ‘

Best Practice: Use Version Control

Checking your YAML files into a source control system such as Git will allow you to track changes to configuration-driven metrics, solicit feedback on changes, and work as a team on a suite of metrics.

Table Deployments

Deploy metrics for important tables by listing metrics for each column.

Table deployments include a fully qualified table name, row creation time, table metrics, and metrics by column. When listing metrics, you can either use a [saved_metric_id](https://docs.bigeye.com/docs/bigconfig#saved-metric-definitions-optional) or inline your metric definition. Similar to saved_metric_definitions, only metric_type is required, all other attributes are optional and will use workspace defaults if not specified.

You can group one or more table deployments into a collection so you are able to route notifications effectively and see a status summary of metrics in the Bigeye application. You can use an existing collection or create a new collection. Collection attributes include:

  • name: required
  • notification_channels:
    • slack: slack channel with #
    • email: email address
    • webhook: webhook url

Valid table metrics include:

  • COUNT_ROWS
  • ROWS_INSERTED
  • COUNT_READ_QUERIES
  • FRESHNESS
  • VOLUME

Valid column metrics depend on the column type - for example AVERAGE metrics are only applicable to numeric columns - see the full list of available metrics here. The FIXME file will contain errors if you attempt to deploy an invalid metric on a column in a table deployment.

Note: either a table_deployment or a tag_deployment is required to create metrics via bigconfig. Wildcards are not supported in table deployments.

Example:

table_deployments:
  - collection:
      name: prod_analytics_monthly_actives
      notification_channels:
        - slack: '#prod_analytics'
    deployments:
      - fq_table_name: analytics_warehouse.kpi_reports.maus
        row_creation_time: updated_at
        table_metrics:
          - metric_type: 
              predefined_metric: FRESHNESS
          - metric_type: 
              predefined_metric: VOLUME
        columns:  
          - column_name: user_id
            metrics:  
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: COUNT_DUPLICATES
          - column_name: device_type
            metrics:
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: COUNT_DISTINCT
          - column_name: total_logins
            metrics:
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: AVERAGE

Saved Metric Definitions (Optional)

Define reusable metrics with custom configurations. You can deploy these metrics on a specified tag and/or in a table deployment later. For example, you may want to save NULL and duplicate metric with a constant threshold of zero to apply to ID columns that should never be NULL or duplicate.

To define a saved metric, you need to specify a saved_metric_id and metric_type - both are required. All other attributes are optional, if not provided metrics will be created with workspace defaults. A full list of attributes and their defaults is below:

  • saved_metric_id: required

  • metric_type required

  • metric_name: optional name used to identify the monitor, will show in notifications and metric charts. Defaults to metric type.

  • description: optional description that shows on metric page

  • schedule_frequency defaults to every 6 hours

    • interval_type: DAYS or HOURS
    • interval_value: integer value
  • metric_schedule optional named schedule defined in a workspace or schedule frequency

    • named_schedule: optional cron schedule in a workspace
      • name: the name of the schedule
  • lookback: Note this attribute is only valid if row creation time is set on the table.

    • lookback_type : DATA_TIME, CLOCK_TIME, or METRIC_TIME. Learn more about lookback types.

      Default value depends on "Data time window as default" in advanced settings of a workspace. If enabled, default is METRIC_TIME (data time window), otherwise DATA_TIME.

    • lookback_window: Default is 2 days.

      • interval_type: DAYS or HOURS
      • interval_value: integer value
    • bucket_size : Aggregation for metric query, HOUR or DAY. Note only for metrics with METRIC_TIME lookback.

  • conditions: list of conditions to add to metric query as WHERE clauses, each is added with AND

  • group_by : list of fields in the metric table to group by.

  • threshold defaults to autothresholds with medium width

    • type : AUTO, RELATIVE, STDDEV, or CONSTANT. Learn more.
    • sensitivity: autothresholds only: NARROW, MEDIUM, WIDE or XWIDE, defaults to MEDIUM
    • upper_bound: constant, relative, and stddev thresholds only
    • lower_bound: constant, relative, and stddev thresholds only
    • upper_bound_only: autothresholds only: True or False, defaults to False.
    • lower_bound_only: autothresholds only: True or False, defaults to False.
    • lookback: Note this attribute is only valid and required for relative and stdev thresholds
      • interval_type : DAYS
      • interval_value : integer value
  • notification_channels accepts a list of notification channels, each with the following:

    • slack: slack channel with #
    • email: email address
    • webhook: webhook url

Note that this is optional, you can always inline metrics and custom configurations as you need.

πŸ‘

Default Settings Keep Things Clean

Only a saved_metric_id and metric_type are required in saved_metric_definitions, bigconfig will automatically apply your other workspace default settings so you only have to specify attributes you want to customize.

Example:

saved_metric_definitions:
  metrics:
    - saved_metric_id: no_nulls
      metric_type: 
        predefined_metric: PERCENT_NULL
      threshold:
        type: CONSTANT
        upper_bound: 0 
    - saved_metric_id: no_dupes
      metric_type: 
        predefined_metric: COUNT_DUPLICATES
      threshold:
        type: CONSTANT
        upper_bound: 0

Tag Definitions (Optional)

Define tags to deploy metrics on a dynamic set of columns. You can also use tags to select columns that will be used as row creation times across multiple tables. For example, you may want to tag all ID columns for format, duplicate and NULL monitoring. Alternatively, you may want to tag an important schema or collection of tables for analytics monitoring.

The tag_definitions module consists of many tag_ids which are defined by one or more column_selectors. Tag_ids must be unique in the workspace. column_selectors can be defined by name, using fully qualified column names including source, schema, table, and column, or they can be defined by column type. You can use wildcards in column_selectors defined by name. Users can also combine exclusions with names to define a reduced set of column_selectors.

Tag definitions also include the use of regular expressions to specify exact matches for a deployment. Periods separating objects need to be escaped with \ e.g. source\.database\.schema\.table\.column

Note that you can combine column selectors in tag definitions to create AND requirements. The example below selects all columns in analytics_warehouse.*.*.* AND have type INT

column_selectors:
    - name: analytics_warehouse.*.*.*
      type: INT

See additional examples below:

tag_definitions: 
  - tag_id: ID_FIELDS
    column_selectors:
      - name: analytics_warehouse.*.*.*_id
      - name: analytics_warehouse.*.*.*_ID
      - name: analytics_warehouse.*.*.*Id
  - tag_id: INCREMENTAL_IDS
    column_selectors:
      - name: analytics_warehouse.kpi_reports.*.created_at
      - name: analytics_warehouse.revenue_reports.*.updated_at   
  - tag_id: PROD_TABLES
    column_selectors:
      - name: analytics_warehouse.*_prod.*.*
  - tag_id: EXEC_REPORTING
    column_selectors:
      - name: analytics_warehouse.kpi_reports.*.*
        exclude: analytics_warehouse.kpi_reports.*_history.*
        exclude: analytics_warehouse.kpi_reports.*_archived.*
      - name: analytics_warehouse.revenue_*.*.*
        exclude: analytics_warehouse.revenue_*_nonprod.*.*
 - tag_id: STRING_TYPES
   column_selectors:
      - type: STRING
 - tag_id: ANALYTICS_NUMERICS
   column_selectors:
      - name: analytics_warehouse.*.*.*
        type: INT
 - tag_id: ANALYTICS_REGEX
   column_selectors:
      - regex: analytics_warehouse\.kpi_reports\.reporting_(?!.*(history|archived).)\..*

Row Creation Times (Optional)

Set row creation time across multiple tables with a tag_id or inline column_selectors. For example, you may want to always set the updated_at column as the row creation time across your source.

All columns must be valid timestamp columns. Only one column per table may be selected as row creation time.

Example:

row_creation_times:
  tag_ids:
    - INCREMENTAL_IDS 
  column_selectors:
    - name: analytics_warehouse.staging.*.createdate

Tag Deployments

Deploy metrics on predefined tags. For example, you can apply your saved NULL and duplicate metrics on the ID_FIELD tag to ensure all ID columns across your warehouse are consistently monitored.

When listing metrics, you can either use a saved_metric_id or inline your metric definition. Similar to saved_metric_definitions, only metric_type is required, all other attributes are optional and will use workspace defaults if not specified.

When deploying metrics via tag, metrics will only be created on valid column types automatically. For example, if you deploy an AVERAGE metric on a tag that includes all columns and tables in a schema, average metrics will only be created on numeric columns. You can also deploy table-level metrics on a tag, as long as the tag definition uses a column wildcard, ex: source.schema.table.*

πŸ‘

Automatic Tag Deployments by Type

When deploying metrics via tag, metrics will only be created on valid column types automatically. For example, if you deploy an AVERAGE metric on a tag that includes all columns and tables in a schema, average metrics will only be created on numeric columns.

You can group one or more table deployments into a collection so you are able to route notifications effectively and see a status summary of metrics in the Bigeye application. You can use an existing collection or create a new collection. Collection attributes include:

  • name: required
  • notification_channels:
    • slack: slack channel with #
    • email: email address
    • webhook: webhook url

Note: either a table_deployment or a tag_deployment is required to create metrics via bigconfig.

Example:

tag_deployments:
  - collection:
      name: prod_data_ops
      notification_channels:
        - slack: '#data_alerts'
    deployments:
        - tag_id: ID_FIELDS 
          metrics:
            - saved_metric_id: no_nulls
            - saved_metric_id: no_dupes
        - tag_id: PROD_TABLES
          metrics:
            - metric_type: 
                predefined_metric: COUNT_ROWS
            - metric_type: 
                predefined_metric: HOURS_SINCE_MAX_TIMESTAMP
            - metric_type: 
                predefined_metric: PERCENT_NULL
        - tag_id: EXEC_REPORTING 
          metrics:
            - metric_type: 
                predefined_metric: COUNT_ROWS
            - metric_type: 
                predefined_metric: HOURS_SINCE_MAX_TIMESTAMP
            - metric_type: 
                predefined_metric: AVERAGE
            - metric_type: 
                predefined_metric: MIN
            - metric_type: 
                predefined_metric: MAX
            - metric_type: 
                predefined_metric: VARIANCE

Example Template

Below is a complete bigconfig example with all modules included. Copy/paste it as a template to create your own.

type: BIGCONFIG_FILE
auto_apply_on_indexing: True
tag_definitions: 
  - tag_id: ID_FIELDS
    column_selectors:
      - name: analytics_warehouse.*.*.*_id
      - name: analytics_warehouse.*.*.*_ID
      - name: analytics_warehouse.*.*.*Id
  - tag_id: INCREMENTAL_IDS
    column_selectors:
      - name: analytics_warehouse.kpi_reports.*.created_at
      - name: analytics_warehouse.revenue_reports.*.updated_at   
  - tag_id: PROD_TABLES
    column_selectors:
      - name: analytics_warehouse.*_prod.*.*
  - tag_id: EXEC_REPORTING
    column_selectors:
      - name: analytics_warehouse.kpi_reports.*.*
      - name: analytics_warehouse.revenue_reports.*.*

row_creation_times:
  tag_ids:
    - INCREMENTAL_IDS 
  column_selectors:
    - name: analytics_warehouse.staging.*.createdate

saved_metric_definitions:
  metrics:
    - saved_metric_id: no_nulls
      metric_type: 
        predefined_metric: PERCENT_NULL
      threshold:
        type: CONSTANT
        upper_bound: 0 
    - saved_metric_id: no_dupes
      metric_type: 
        predefined_metric: COUNT_DUPLICATES
      threshold:
        type: CONSTANT
        upper_bound: 0
      metric_schedule:   
        named_schedule:              
          name: 'Everyday, 8:00 UTC'
        
tag_deployments:
  - collection:
      name: prod_data_ops
      notification_channels:
        - slack: '#data_alerts'
        - webhook: https://automation.atlassian.com/pro/hooks
    deployments:
        - tag_id: ID_FIELDS 
          metrics:
            - saved_metric_id: no_nulls
            - saved_metric_id: no_dupes
        - tag_id: PROD_TABLES
          metrics:
            - metric_type: 
                predefined_metric: COUNT_ROWS
            - metric_type: 
                predefined_metric: HOURS_SINCE_MAX_TIMESTAMP
            - metric_type: 
                predefined_metric: PERCENT_NULL
        - tag_id: EXEC_REPORTING 
          metrics:
            - metric_type: 
                predefined_metric: COUNT_ROWS
            - metric_type: 
                predefined_metric: HOURS_SINCE_MAX_TIMESTAMP
            - metric_type: 
                predefined_metric: AVERAGE
            - metric_type: 
                predefined_metric: MIN
            - metric_type: 
                predefined_metric: MAX
            - metric_type: 
                predefined_metric: VARIANCE
              threshold:
                type: RELATIVE
                lower_bound: 5
                lookback:
                  interval_type: DAYS
                  interval_value: 1

table_deployments:
  - collection:
      name: prod_analytics_monthly_actives
      notification_channels:
        - slack: '#prod_analytics'
        - webhook: https://dev.service-now.com/api/workspace
    deployments:
      - fq_table_name: analytics_warehouse.kpi_reports.maus
        row_creation_time: updated_at
        table_metrics:
          - metric_type: 
              predefined_metric: COUNT_ROWS
          - metric_type: 
              predefined_metric: HOURS_SINCE_MAX_TIMESTAMP
        columns:  
          - column_name: user_id
            metrics:  
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: COUNT_DUPLICATES
          - column_name: device_type
            metrics:
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: COUNT_DISTINCT
              - metric_type:
                  predefined_metric: PERCENT_VALUE_IN_LIST
                parameters:              
                  - key: list
                    string_value: "DEVICE1,DEVICE2,DEVICE3" 
          - column_name: total_logins
            metrics:
              - metric_type:
                  predefined_metric: PERCENT_NULL
              - metric_type:
                  predefined_metric: AVERAGE
              - metric_type:
                  type: TEMPLATE
                  template_id: 240
                  aggregation_type: COUNT
                  template_name: Is Greater than 0
                threshold:
                  type: AUTO
                  sensitivity: NARROW
                parameters:
                  - key: column
                    column_name: total_logins