Bigconfig
Overview
Bigconfig is a yaml-based declarative monitoring as code solution. Data engineers can deploy Bigeye metrics from the command-line for convenient and scalable data quality monitoring.
Prerequisites
Bigconfig can be applied via the Bigeye-CLI. See here for instructions on how to install the Bigeye-CLI and log into your workspace.
The user authenticated to the Bigeye CLI must have permissions to edit the tables and metrics being configured. Users of the Bigeye CLI and Bigconfig do not need administrative privileges.
Bigconfig CLI Commands
Bigconfig has two relevant CLI commands: plan and apply. Run bigeye bigconfig –help
in the CLI for convenient access to relevant commands and parameters.
Plan
bigeye bigconfig plan
The plan
command performs a “dry run” and returns a report of the expected changes when the provided Bigconfig file is applied. If there are validation errors in the YAML file, a FIXME_ yaml file will be generated with inline errors to help you amend your Bigconfig. Otherwise, a report with a list of metrics to be created, updated, and deleted, as well as a list of columns set as row creation time will be created.
Command options:
- -ip, --input_path (optional): List of paths containing Bigconfig YAML file(s). If none is provided, then any .yaml or .yml files in the current working directory with
type: BIGCONFIG_FILE
will be used. - -op, --output_path (optional): Output path where reports and FIXME files will be saved. If no output path is defined the current working directory will be used.
- -psn, --purge_source_name (optional): deletes all Bigconfig metrics in a particular source. Accepts a list of data source names to purge E.g. -psn source_1 -psn source_2.
- -purge_all, --purge_all_sources (optional): deletes all Bigconfig metrics in the workspace: True or False. [default: False]
- -nq, --no-queue (optional): bypass the queue endpoint for bigconfig [default: False]
- -r, --recursive (optional): Search ALL input directories recursively. [default: False]
- -strict, --strict_mode (optional): API errors cause an exception if True. Otherwise errors are returned in PLAN reports but the process exits successfully. [default: False]
Apply
bigeye bigconfig apply
The apply
command executes the Bigconfig to create, update, and delete metrics, set row creation times, and SLAs. If there are validation errors in the YAML file, a FIXME_ yaml file will be generated with inline errors to help you amend your Bigconfig. Otherwise, a report with a list of metrics created, updated, and deleted, as well as a list of columns set as row creation time will be created. By default, an apply will submit the bigconfig to a queue.
Command options
- -ip, --input_path (optional): List of paths containing Bigconfig YAML file(s). If none is provided, then any .yaml or .yml files in the current working directory with
type: BIGCONFIG_FILE
will be used. - -op, --output_path (optional): Output path where reports and FIXME files will be saved. If no output path is defined the current working directory will be used.
- -psn, --purge_source_name (optional): deletes all Bigconfig metrics in a particular source. Accepts a list of data source names to purge E.g. -psn source_1 -psn source_2.
- -purge_all, --purge_all_sources (optional): deletes all Bigconfig metrics in the workspace: True or False. [default: False]
- -nq, --no-queue (optional): bypass the queue endpoint for bigconfig [default: False]
- -r, --recursive (optional): Search ALL input directories recursively. [default: False]
- -strict, --strict_mode (optional): API errors cause an exception if True. Otherwise errors are returned in APPLY reports but the process exits successfully [default: False]
- -auto-approve, --auto-approve (optional): manual plan approval is not needed prior to metric deployment [default: False]
Note that if you are running
bigeye bigconfig apply
as part of a CI/CD process then you will want to make sure that the-auto-approve
flag is provided in the command to avoid any failed jobs.
Auto Apply on Indexing
In addition to running apply
from the CLI, you can set your bigconfig to auto-apply on indexing. This will detect when newly added datasets or columns match an existing tag definition or column selector and automatically create the relevant metrics based on tag deployments.
Indexing is Bigeye's process to update your catalog with any new schema changes. Indexing occurs automatically once a day, and ad hoc when users select "rescan" in the catalog.
In order to auto-apply your bigconfig on indexing, simply add auto_apply_on_indexing = True
to the top of your yaml file.
Note that this setting must be consistent across all bigconfig files. By adding
auto_apply_on_indexing = True
to one file, you must add it to all, since the default value is False.
Bigconfig YAML File
You can deploy monitoring with Bigconfig using one or more yaml files at a time. Remember that the input path accepts a list of paths, and any .yaml or .yml files in the specified directory with type: BIGCONFIG_FILE
will be included.
The bigconfig YAML file is made up of several modules. Either the table deployment or tag deployment module is required, all other modules are optional but are designed to help you scale your data observability across your warehouse(s). You may want to use separate files for each module, or you may separate files for independent table or tag deployments for different teams.
View a complete sample bigconfig at the bottom of this page to get started. Read below for more tips on how to customize to your data and monitoring goals.
Start Simple, Build for Scale
We recommend starting simple -- the most basic bigconfig only contains the table deployment module, all other modules are optional but are designed to help you scale your data observability across your warehouse(s).
Best Practice: Use Version Control
Checking your YAML files into a source control system such as Git will allow you to track changes to configuration-driven metrics, solicit feedback on changes, and work as a team on a suite of metrics.
Table Deployments
Deploy metrics for important tables by listing metrics for each column.
Table deployments include a fully qualified table name, row creation time, table metrics, and metrics by column. When listing metrics, you can either use a [saved_metric_id](https://docs.bigeye.com/docs/bigconfig#saved-metric-definitions-optional)
or inline your metric definition. Similar to saved_metric_definitions
, only metric_type
is required, all other attributes are optional and will use workspace defaults if not specified.
You can group one or more table deployments into a collection so you are able to route notifications effectively and see a status summary of metrics in the Bigeye application. You can use an existing collection or create a new collection. Collection attributes include:
name
: requirednotification_channels
:slack
: slack channel with #email
: email addresswebhook
: webhook url
Valid table metrics include:
COUNT_ROWS
ROWS_INSERTED
COUNT_READ_QUERIES
FRESHNESS
VOLUME
Valid column metrics depend on the column type - for example AVERAGE
metrics are only applicable to numeric columns - see the full list of available metrics here. The FIXME file will contain errors if you attempt to deploy an invalid metric on a column in a table deployment.
Note: either a table_deployment
or a tag_deployment
is required to create metrics via bigconfig. Wildcards are not supported in table deployments.
Example:
table_deployments:
- collection:
name: prod_analytics_monthly_actives
notification_channels:
- slack: '#prod_analytics'
deployments:
- fq_table_name: analytics_warehouse.kpi_reports.maus
table_metrics:
- metric_type:
predefined_metric: FRESHNESS
- metric_type:
predefined_metric: VOLUME
columns:
- column_name: user_id
metrics:
- metric_type:
predefined_metric: PERCENT_NULL
- metric_type:
predefined_metric: COUNT_DUPLICATES
- column_name: device_type
metrics:
- metric_type:
predefined_metric: PERCENT_NULL
- metric_type:
predefined_metric: COUNT_DISTINCT
- column_name: total_logins
metrics:
- metric_type:
predefined_metric: PERCENT_NULL
- metric_type:
predefined_metric: AVERAGE
Saved Metric Definitions (Optional)
Define reusable metrics with custom configurations. You can deploy these metrics on a specified tag and/or in a table deployment later. For example, you may want to save NULL and duplicate metric with a constant threshold of zero to apply to ID columns that should never be NULL or duplicate.
To define a saved metric, you need to specify a saved_metric_id
and metric_type
- both are required. All other attributes are optional, if not provided metrics will be created with workspace defaults. A full list of attributes and their defaults is below:
-
saved_metric_id
: required -
metric_type
requiredtype
: eitherPREDEFINED
orTEMPLATE
metric
: one of the predefined metric names
-
metric_name
: optional name used to identify the monitor, will show in notifications and metric charts. Defaults to metric type. -
description
: optional description that shows on metric page -
rct_overrides
: optional list of column names to set as row creation time for the metric. The metric will attempt to use the first column in the list. If it does not exist, it will attempt the next column until a column is found or the end of the list -
schedule_frequency
defaults to every 6 hoursinterval_type
: DAYS or HOURSinterval_value
: integer value
-
metric_schedule
optional named schedule defined in a workspace or schedule frequencynamed_schedule
: optional cron schedule in a workspace (the schedule MUST exist in Bigeye)name
: the name of the schedule
-
lookback
: Note this attribute is only valid if row creation time is set.-
lookback_type
: DATA_TIME, CLOCK_TIME, or METRIC_TIME. Learn more about lookback types.Default value depends on "Data time window as default" in advanced settings of a workspace. If enabled, default is METRIC_TIME (data time window), otherwise DATA_TIME.
-
lookback_window
: Default is 2 days.interval_type
: DAYS or HOURSinterval_value
: integer value
-
bucket_size
: Aggregation for metric query,HOUR
orDAY
. Note only for metrics with METRIC_TIME lookback.
-
-
conditions
: list of conditions to add to metric query as WHERE clauses, each is added withAND
-
group_by
: list of fields in the metric table to group by. -
threshold
defaults to autothresholds with medium widthtype
: AUTO, RELATIVE, STDDEV, or CONSTANT. Learn more.sensitivity
: autothresholds only: NARROW, MEDIUM, WIDE or XWIDE, defaults to MEDIUMupper_bound
: constant, relative, and stddev thresholds onlylower_bound
: constant, relative, and stddev thresholds onlyupper_bound_only
: autothresholds only: True or False, defaults to False.lower_bound_only
: autothresholds only: True or False, defaults to False.lookback
: Note this attribute is only valid and required for relative and stdev thresholdsinterval_type
: DAYSinterval_value
: integer value
-
notification_channels
accepts a list of notification channels, each with the following:slack
: slack channel with #email
: email addresswebhook
: webhook url
Note that this is optional, you can always inline metrics and custom configurations as you need.
Default Settings Keep Things Clean
Only a
saved_metric_id
andmetric_type
are required in saved_metric_definitions, bigconfig will automatically apply your other workspace default settings so you only have to specify attributes you want to customize.
Example:
saved_metric_definitions:
metrics:
- saved_metric_id: no_nulls
metric_type:
predefined_metric: PERCENT_NULL
threshold:
type: CONSTANT
upper_bound: 0
- saved_metric_id: no_dupes
metric_type:
predefined_metric: COUNT_DUPLICATES
threshold:
type: CONSTANT
upper_bound: 0
Tag Definitions (Optional)
Define tags to deploy metrics on a dynamic set of columns. You can also use tags to select columns that will be used as row creation times across multiple tables. For example, you may want to tag all ID columns for format, duplicate and NULL monitoring. Alternatively, you may want to tag an important schema or collection of tables for analytics monitoring.
The tag_definitions
module consists of many tag_ids
which are defined by one or more column_selectors
. Tag_ids must be unique in the workspace. column_selectors can be defined by name, using fully qualified column names including source, schema, table, and column, or they can be defined by column type. You can use wildcards in column_selectors defined by name. Users can also combine exclusions with names to define a reduced set of column_selectors
.
Tag definitions also include the use of regular expressions to specify exact matches for a deployment. Periods separating objects need to be escaped with \ e.g. source\.database\.schema\.table\.column
Note that you can combine column selectors in tag definitions to create AND requirements. The example below selects all columns in analytics_warehouse.*.*.*
AND have type INT
column_selectors:
- name: analytics_warehouse.*.*.*
type: INT
See additional examples below:
tag_definitions:
- tag_id: ID_FIELDS
column_selectors:
- name: analytics_warehouse.*.*.*_id
- name: analytics_warehouse.*.*.*_ID
- name: analytics_warehouse.*.*.*Id
- tag_id: INCREMENTAL_IDS
column_selectors:
- name: analytics_warehouse.kpi_reports.*.created_at
- name: analytics_warehouse.revenue_reports.*.updated_at
- tag_id: PROD_TABLES
column_selectors:
- name: analytics_warehouse.*_prod.*.*
- tag_id: EXEC_REPORTING
column_selectors:
- name: analytics_warehouse.kpi_reports.*.*
exclude: analytics_warehouse.kpi_reports.*_history.*
exclude: analytics_warehouse.kpi_reports.*_archived.*
- name: analytics_warehouse.revenue_*.*.*
exclude: analytics_warehouse.revenue_*_nonprod.*.*
- tag_id: STRING_TYPES
column_selectors:
- type: STRING
- tag_id: ANALYTICS_NUMERICS
column_selectors:
- name: analytics_warehouse.*.*.*
type: INT
- tag_id: ANALYTICS_REGEX
column_selectors:
- regex: analytics_warehouse\.kpi_reports\.reporting_(?!.*(history|archived).)\..*
Row Creation Times (Optional)
NOTE: It is recommended to use the rct_overrides parameter to set row creation time at the metric level
Set row creation time across multiple tables with a tag_id
or inline column_selectors
. For example, you may want to always set the updated_at column as the row creation time across your source.
All columns must be valid timestamp columns. Only one column per table may be selected as row creation time.
Example:
row_creation_times:
tag_ids:
- INCREMENTAL_IDS
column_selectors:
- name: analytics_warehouse.staging.*.createdate
Tag Deployments
Deploy metrics on predefined tags. For example, you can apply your saved NULL and duplicate metrics on the ID_FIELD tag to ensure all ID columns across your warehouse are consistently monitored.
When listing metrics, you can either use a saved_metric_id
or inline your metric definition. Similar to saved_metric_definitions
, only metric_type
is required, all other attributes are optional and will use workspace defaults if not specified.
When deploying metrics via tag, metrics will only be created on valid column types automatically. For example, if you deploy an AVERAGE metric on a tag that includes all columns and tables in a schema, average metrics will only be created on numeric columns. You can also deploy table-level metrics on a tag, as long as the tag definition uses a column wildcard, ex: source.schema.table.*
Automatic Tag Deployments by Type
When deploying metrics via tag, metrics will only be created on valid column types automatically. For example, if you deploy an AVERAGE metric on a tag that includes all columns and tables in a schema, average metrics will only be created on numeric columns.
You can group one or more table deployments into a collection so you are able to route notifications effectively and see a status summary of metrics in the Bigeye application. You can use an existing collection or create a new collection. Collection attributes include:
name
: requirednotification_channels
:slack
: slack channel with #email
: email addresswebhook
: webhook url
Note: either a table_deployment
or a tag_deployment
is required to create metrics via bigconfig.
Example:
tag_deployments:
- collection:
name: prod_data_ops
notification_channels:
- slack: '#data_alerts'
deployments:
- tag_id: ID_FIELDS
metrics:
- saved_metric_id: no_nulls
- saved_metric_id: no_dupes
- tag_id: PROD_TABLES
metrics:
- metric_type:
predefined_metric: COUNT_ROWS
- metric_type:
predefined_metric: HOURS_SINCE_MAX_TIMESTAMP
- metric_type:
predefined_metric: PERCENT_NULL
- tag_id: EXEC_REPORTING
metrics:
- metric_type:
predefined_metric: COUNT_ROWS
- metric_type:
predefined_metric: HOURS_SINCE_MAX_TIMESTAMP
- metric_type:
predefined_metric: AVERAGE
- metric_type:
predefined_metric: MIN
- metric_type:
predefined_metric: MAX
- metric_type:
predefined_metric: VARIANCE
Example Template
Below is a complete bigconfig example with all modules included. Copy/paste it as a template to create your own.
type: BIGCONFIG_FILE
auto_apply_on_indexing: True
tag_definitions:
- tag_id: ID_FIELDS
column_selectors:
- name: analytics_warehouse.*.*.*_id
- name: analytics_warehouse.*.*.*_ID
- name: analytics_warehouse.*.*.*Id
- tag_id: INCREMENTAL_IDS
column_selectors:
- name: analytics_warehouse.kpi_reports.*.created_at
- name: analytics_warehouse.revenue_reports.*.updated_at
- tag_id: PROD_TABLES
column_selectors:
- name: analytics_warehouse.*_prod.*.*
- tag_id: EXEC_REPORTING
column_selectors:
- name: analytics_warehouse.kpi_reports.*.*
- name: analytics_warehouse.revenue_reports.*.*
saved_metric_definitions:
metrics:
- saved_metric_id: no_nulls
metric_type:
predefined_metric: PERCENT_NULL
rct_overrides:
- UPDATE_TIMESTAMP
threshold:
type: CONSTANT
upper_bound: 0
- saved_metric_id: no_dupes
metric_type:
predefined_metric: COUNT_DUPLICATES
threshold:
type: CONSTANT
upper_bound: 0
metric_schedule:
named_schedule:
name: 'Everyday, 8:00 UTC'
tag_deployments:
- collection:
name: prod_data_ops
notification_channels:
- slack: '#data_alerts'
- webhook: https://automation.atlassian.com/pro/hooks
deployments:
- tag_id: ID_FIELDS
metrics:
- saved_metric_id: no_nulls
- saved_metric_id: no_dupes
- tag_id: PROD_TABLES
metrics:
- metric_type:
predefined_metric: COUNT_ROWS
- metric_type:
predefined_metric: HOURS_SINCE_MAX_TIMESTAMP
- metric_type:
predefined_metric: PERCENT_NULL
- tag_id: EXEC_REPORTING
metrics:
- metric_type:
predefined_metric: COUNT_ROWS
- metric_type:
predefined_metric: HOURS_SINCE_MAX_TIMESTAMP
- metric_type:
predefined_metric: AVERAGE
- metric_type:
predefined_metric: MIN
- metric_type:
predefined_metric: MAX
- metric_type:
predefined_metric: VARIANCE
threshold:
type: RELATIVE
lower_bound: 5
lookback:
interval_type: DAYS
interval_value: 1
table_deployments:
- collection:
name: prod_analytics_monthly_actives
notification_channels:
- slack: '#prod_analytics'
- webhook: https://dev.service-now.com/api/workspace
deployments:
- fq_table_name: analytics_warehouse.Bigeye Virtual Schema.maus
row_creation_time: updated_at
table_metrics:
- metric_type:
predefined_metric: COUNT_ROWS
- metric_type:
predefined_metric: HOURS_SINCE_MAX_TIMESTAMP
columns:
- column_name: user_id
metrics:
- metric_type:
predefined_metric: PERCENT_NULL
- metric_type:
predefined_metric: COUNT_DUPLICATES
- column_name: device_type
metrics:
- metric_type:
predefined_metric: PERCENT_NULL
- metric_type:
predefined_metric: COUNT_DISTINCT
- metric_type:
predefined_metric: PERCENT_VALUE_IN_LIST
parameters:
- key: list
string_value: "DEVICE1,DEVICE2,DEVICE3"
- column_name: total_logins
metrics:
- metric_type:
predefined_metric: PERCENT_NULL
- metric_type:
predefined_metric: AVERAGE
- metric_type:
type: TEMPLATE
template_id: 240
aggregation_type: COUNT
template_name: Is Greater than 0
threshold:
type: AUTO
sensitivity: NARROW
parameters:
- key: column
column_name: total_logins
Updated 3 months ago