Checks & Validators¶
Part of the Data Contracts Technical Specification.
Overview¶
Data contract checks validate the quality of a single dataset. Each check computes a metric — a row count, a null percentage, a column mean — and compares it against a declared acceptance condition. Checks that exceed their acceptance condition produce failures at the declared severity level, giving data teams a structured signal to act on.
Data contracts support two categories of checks:
- Table-Level Checks: Validate table-wide properties (row counts, freshness, duplicates, partitions)
- Column-Level Checks: Validate individual column values (nullability, uniqueness, ranges, patterns, distributions)
Naming Conventions:
- Check names should be descriptive business statements (e.g., "Order ID is unique")
- Severity levels:
P0(critical),P1(important, default),P2(nice-to-have),P3(informational). Omittingseveritydefaults toP1. - All checks support optional
tagsparameter for categorization - Check type names in YAML contracts use
snake_case(e.g.,num_rows,duplicates,missing).
Check Structure:
Every check shares a common structure:
- name: "Descriptive business name" # required
type: <check_type> # required
severity: P0 | P1 | P2 | P3 # optional, defaults to P1
<validator>: <value> # optional — exactly one of: min, max, between, not_between, equals
tolerance: <value> # optional
# check-specific params (e.g. return: count/pct, columns:, values:, pattern:)
The name is a descriptive business statement. The type identifies which check to run. severity sets the priority level and defaults to P1 when omitted. A single validator — min, max, between, not_between, or equals — defines the acceptance condition; only one may be specified per check. The optional tolerance parameter allows acceptable variance. Check-specific parameters (such as return, columns, values, pattern) are documented in each check's detail section.
Note: Cross-dataset validation (referential integrity, schema consistency, aggregate reconciliation) should be handled at the orchestration/workflow level rather than in individual data contracts. This keeps contracts focused on single-dataset quality while allowing orchestration tools to manage relationships between datasets.
Checks Summary¶
Table-Level Checks¶
Validate table-wide properties and aggregates. Specified in the top-level checks array.
| Check Type | Description | Return |
|---|---|---|
num_rows |
Total row count validation | count |
duplicates |
Count of duplicate rows | count or pct |
freshness |
Data recency validation | age_hours |
completeness |
Partition gap detection | gap_count |
Total: 4 table-level checks
Column-Level Checks¶
Validate individual column values and properties. Specified within the checks array of a column definition.
Statistical Checks¶
Statistical checks compute aggregate metrics over the entire column and return a single numeric value.
| Check Type | Description |
|---|---|
cardinality |
Count of distinct values |
min |
Minimum value in column |
max |
Maximum value in column |
mean |
Average value validation |
sum |
Sum of values validation |
count |
Count of non-null values |
variance |
Variance validation |
stddev |
Standard deviation validation |
percentile |
Percentile value validation |
Value Checks¶
Value checks validate individual values within a column. Each check returns either an absolute count (count) or percentage (pct) controlled by the return parameter. For missing and duplicates, the return value represents the count/percentage of missing or duplicate values found. For whitelist, blacklist, pattern, min_length, max_length, and avg_length, the return value represents the count/percentage of valid (conforming) rows.
| Check Type | Description |
|---|---|
missing |
Null value validation |
duplicates |
Duplicate value validation |
whitelist |
Values in allowed set |
blacklist |
Values in disallowed set |
pattern |
Values matching regex pattern |
min_length |
Minimum length of string, list, or map values |
max_length |
Maximum length of string, list, or map values |
avg_length |
Average length of string, list, or map values |
Total: 17 column-level checks (9 Statistical + 8 Value Checks)
Validators¶
Every check returns a single float metric. Validators compare that metric against a declared acceptance condition. At most one validator may be specified per check. The optional tolerance parameter applies to all four and is a float that defaults to 1e-9.
min¶
Sets a lower bound. The check passes if actual ≥ (min - tolerance).
max¶
Sets an upper bound. The check passes if actual ≤ (max + tolerance).
between¶
Inclusive range — shorthand for specifying both min and max. The check passes if actual ≥ (min - tolerance) AND actual ≤ (max + tolerance). Cannot be combined with min or max; doing so raises a ContractValidationError.
not_between¶
Inverse range — the check passes if actual < (min - tolerance) OR actual > (max + tolerance). Cannot be combined with min, max, or between; doing so raises a ContractValidationError.
- name: "Order ID is not in reserved test range"
type: num_rows
not_between: [9000000, 9999999]
severity: P1
equals¶
Exact match. The check passes if |actual - expected| ≤ tolerance.
- name: "Total allocated budget is exactly $1M"
type: sum
equals: 1000000.0
tolerance: 0.01
severity: P0
Noop (no validator)¶
Omitting all validators produces a noop: the check runs and records the computed metric but never fails on the value. Use this for informational checks or when you want to observe a metric without enforcing a bound.
Table-Level Checks¶
Table-level checks are specified in the top-level checks array (sibling to columns). They validate table-wide properties and aggregates.
Volume Checks¶
num_rows¶
The num_rows check validates that the total row count falls within specified bounds.
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: Range Validation
checks:
- name: "Daily volume within bounds"
type: num_rows
between: [100, 1000000] # Between 100 and 1M rows
severity: P1
Example 2: Minimum Volume
checks:
- name: "At least 1000 transactions per day"
type: num_rows
min: 1000 # At least 1000 rows
severity: P0
Example 3: Exact Row Count
checks:
- name: "Partition has exactly 1000 records"
type: num_rows
equals: 1000 # Exactly 1000 rows
tolerance: 0
severity: P1
duplicates¶
The duplicates check validates that the number of duplicate rows (based on specified columns) is within bounds.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
columns |
list[string] |
Yes | None | Columns to check for duplicates (composite key) |
return |
string |
No | count |
Return type: "count" (absolute) or "pct" (percentage 0-1) |
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: No Duplicates (Most Common)
checks:
- name: "No duplicate orders"
type: duplicates
columns: ["order_id"]
max: 0 # No duplicates allowed
severity: P0
Example 2: Allow Small Number of Duplicates
checks:
- name: "Few duplicate user-date combinations"
type: duplicates
columns: ["user_id", "event_date"]
max: 10 # At most 10 duplicate rows
severity: P1
Example 3: Exact Duplicate Count
checks:
- name: "Expected duplicate rate"
type: duplicates
columns: ["transaction_id", "timestamp"]
equals: 50 # Exactly 50 duplicates expected
tolerance: 5 # Within 5
severity: P2
Example 4: Duplicate Percentage
checks:
- name: "Low duplicate rate"
type: duplicates
columns: ["order_id"]
return: pct
max: 0.01 # At most 1% duplicate rows
severity: P1
Freshness Checks¶
freshness¶
The freshness check validates that data is not stale — the most recent timestamp falls within an acceptable age.
Returns: age_hours — the age of the most recent (or oldest, depending on aggregation) record in hours. Validated implicitly: the check passes when age_hours <= max_age_hours.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
max_age_hours |
float |
Yes | None | Maximum allowed age in hours |
timestamp_column |
string |
Yes | None | Column containing timestamps |
aggregation |
string |
No | "max" | Aggregation method ("max" or "min") |
Validators: not applicable — uses
max_age_hoursinstead of standard validators.
Example 1: Basic Usage (Auto-generated from SLA)
checks:
- name: "Data is fresh"
type: freshness
max_age_hours: 24
timestamp_column: created_at
severity: P0
Example 2: Partition Freshness
checks:
- name: "Latest partition is within 2 hours"
type: freshness
max_age_hours: 2.0
timestamp_column: event_hour
severity: P0
Example 3: Oldest Record Within 7 Days
checks:
- name: "Oldest record within 7 days"
type: freshness
max_age_hours: 168.0 # oldest record must be no more than 7 days old
timestamp_column: ingestion_time
aggregation: min
severity: P2
completeness¶
The completeness check validates that partitioned data has no missing partitions within a time range (gap detection). This check only applies to tables with metadata.partitioned_by defined.
Returns: gap_count — the number of missing partitions found. Validated implicitly: the check passes when gap_count <= max_gap_count.
Note: For column-level null validation, use missing check (with return: count for absolute counts or return: pct for percentages).
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
partition_column |
string |
Yes | None | Column used for partitioning (date/timestamp) |
granularity |
string |
Yes | None | Expected granularity ("hourly", "daily", "weekly", "monthly") |
lookback_days |
int |
No | 30 |
Days to check for gaps |
allow_future_gaps |
bool |
No | true |
Ignore gaps in future dates |
max_gap_count |
int |
No | 0 |
Maximum allowed missing partitions |
Validators: not applicable — uses
max_gap_countinstead of standard validators.
Example 1: Daily Partitions (No Gaps)
checks:
- name: "No missing daily partitions in last 7 days"
type: completeness
partition_column: order_date
granularity: daily
lookback_days: 7
allow_future_gaps: true
severity: P1
Example 2: Hourly Partitions
checks:
- name: "Hourly partitions are complete"
type: completeness
partition_column: event_hour
granularity: hourly
lookback_days: 1
severity: P0
Example 3: Allow Some Gaps (Best Effort)
checks:
- name: "Weekly partitions mostly complete"
type: completeness
partition_column: report_week
granularity: weekly
lookback_days: 90
max_gap_count: 2
severity: P2
Column-Level Checks¶
Column-level checks are specified within the checks array of a column definition. They validate individual column values and statistical properties.
Value Checks¶
Value checks validate individual values within a column (rather than computing aggregates over the column). Each check defines its own return semantics. For missing and duplicates, the return value represents missing or duplicate values found. For whitelist, blacklist, pattern, min_length, max_length, and avg_length, the return value represents conforming (valid) rows. Use the return parameter to specify the return type (count or pct). When return: pct is used, the percentage denominator is always the total row count including nulls; the numerator differs by check family: for missing and duplicates the numerator is the count of missing or duplicate values respectively, while for whitelist, blacklist, pattern, min_length, max_length, and avg_length the numerator is the count of conforming (valid) rows.
missing¶
The missing check validates null values in a column, returning the count or percentage of null values based on the return parameter.
Return Parameter: Use return: count (default) to return absolute null count, or return: pct to return null percentage (0-1 scale).
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
return |
string |
No | count |
Return type: "count" (absolute) or "pct" (percentage 0-1) |
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: Maximum Bound (Most Common)
columns:
- name: notes
type: string
nullable: true
description: "Optional customer notes"
checks:
- name: "Most orders have notes"
type: missing
max: 1000 # At most 1000 nulls
severity: P2
Example 2: Range Validation
columns:
- name: deleted_at
type:
kind: timestamp
nullable: true
description: "Deletion timestamp (null if not deleted)"
checks:
- name: "Few deletions expected"
type: missing
between: [0, 50] # Between 0 and 50 nulls
severity: P1
Example 3: Exact Null Count
columns:
- name: optional_metadata
type: string
nullable: true
description: "Optional metadata field"
checks:
- name: "Exactly 100 records have metadata"
type: missing
equals: 900 # Expecting 900 nulls (100 non-null)
tolerance: 0
severity: P2
Example 4: Using Percentage Return
columns:
- name: email
type: string
nullable: true
description: "Customer email address"
checks:
- name: "Low null percentage"
type: missing
return: pct
max: 0.10 # At most 10% nulls
severity: P1
whitelist¶
The whitelist check validates that non-null values match a set of allowed values, returning the count or percentage of rows that conform.
Return Parameter: Use return: count (default) to return the count of matching rows, or return: pct to return the percentage (0-1 scale).
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
values |
list |
Yes | None | List of allowed values |
return |
string |
No | count |
Return type: "count" (absolute) or "pct" (percentage 0-1) |
case_sensitive |
bool |
No | true |
Whether comparison is case-sensitive (strings only) |
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: Basic Usage
columns:
- name: status
type: string
nullable: false
description: "Order status"
checks:
- name: "Status is valid"
type: whitelist
values: ["pending", "processing", "shipped", "delivered", "cancelled"]
severity: P0
Example 2: Case-Insensitive
columns:
- name: country_code
type: string
nullable: false
description: "ISO country code"
checks:
- name: "Country code is valid"
type: whitelist
values: ["US", "CA", "MX", "GB", "DE", "FR"]
case_sensitive: false
severity: P1
Example 3: Numeric Enum
columns:
- name: priority
type: int
nullable: false
description: "Task priority level"
checks:
- name: "Priority is valid (1-5)"
type: whitelist
values: [1, 2, 3, 4, 5]
severity: P0
Example 4: Using Percentage Return
columns:
- name: status
type: string
nullable: false
description: "Order status"
checks:
- name: "Most statuses are valid"
type: whitelist
values: ["pending", "processing", "shipped", "delivered", "cancelled"]
return: pct
min: 0.95 # At least 95% match
severity: P1
blacklist¶
The blacklist check validates that non-null values do not match any forbidden value, returning the count or percentage of rows that pass (rows not in the blacklist).
Return Parameter: Use return: count (default) to return the count of passing rows, or return: pct to return the percentage (0-1 scale). Both return values represent conforming rows — use min to set a lower bound (e.g., at least 95% must pass).
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
values |
list |
Yes | None | List of forbidden values |
return |
string |
No | count |
Return type: "count" (absolute) or "pct" (percentage 0-1) |
case_sensitive |
bool |
No | true |
Whether comparison is case-sensitive (strings only) |
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: Basic Usage
columns:
- name: username
type: string
nullable: false
description: "User username"
checks:
- name: "Username is not reserved"
type: blacklist
values: ["admin", "root", "system", "null", "undefined"]
case_sensitive: false
severity: P0
Example 2: Exclude Test Data
columns:
- name: email
type: string
nullable: false
description: "Customer email address"
checks:
- name: "Email is not test account"
type: blacklist
values: ["test@example.com", "noreply@example.com", "admin@test.com"]
severity: P1
Example 3: Percentage Passing
columns:
- name: username
type: string
nullable: false
description: "User username"
checks:
- name: "Almost all usernames are not reserved"
type: blacklist
values: ["admin", "root", "system"]
return: pct
min: 0.99 # At least 99% must pass
severity: P1
duplicates¶
The duplicates check (column-level) validates that the count of duplicate values in a column is within specified bounds. This is the column-level version of the table-level duplicates check, which validates duplicates across multiple columns.
Semantics: Duplicate count is computed as COUNT(*) - COUNT(DISTINCT col) — the total number of non-first occurrences. If value "A" appears 3 times, it contributes 2 to the count. NULL handling: COUNT(*) includes all rows while COUNT(DISTINCT col) excludes NULLs, so NULL values are counted as duplicates. For example, [A, A, NULL] yields a duplicate count of 2 (one extra "A" plus one NULL). This is the intended behavior: NULLs are treated as duplicate occurrences. If you need to exclude NULLs, use COUNT(col) - COUNT(DISTINCT col) in a manual check instead.
Return Parameter: Use return: count (default) to return absolute duplicate count, or return: pct to return duplicate percentage (0-1 scale).
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
return |
string |
No | count |
Return type: "count" (absolute) or "pct" (percentage 0-1) |
Validators:
min,max,between,not_between,equals— see Validators.
Note: To enforce uniqueness (no duplicates), use max: 0 or equals: 0.
Example 1: No Duplicates (Uniqueness)
columns:
- name: order_id
type: int
nullable: false
description: "Unique order identifier"
checks:
- name: "Order ID is unique"
type: duplicates
max: 0 # No duplicates allowed
severity: P0
Example 2: Allow Some Duplicates
columns:
- name: user_id
type: int
nullable: false
description: "User identifier (repeat customers allowed)"
checks:
- name: "User ID has reasonable duplicates"
type: duplicates
max: 100 # Up to 100 duplicate user IDs
severity: P1
Example 3: Exact Duplicate Count (Testing)
columns:
- name: test_category
type: string
nullable: false
description: "Test category with expected duplicates"
checks:
- name: "Category has expected duplicates"
type: duplicates
equals: 50 # Expect exactly 50 duplicates
tolerance: 0
severity: P2
Example 4: Duplicate Percentage
columns:
- name: order_id
type: int
nullable: false
description: "Order identifier"
checks:
- name: "Low duplicate rate"
type: duplicates
return: pct
max: 0.01 # At most 1% duplicate values
severity: P1
String and Collection Checks¶
String and collection checks are part of Value Checks. See the Value Checks section above for context.
pattern¶
The pattern check validates that string values match a pattern, supporting either explicit regex or predefined format shortcuts, and returns the count or percentage of conforming values.
Pattern Specification: Use either:
patternparameter with a regex pattern (e.g.,"^[A-Z]{2}\\d{6}$")formatparameter with a predefined format name (e.g.,"email","phone","uuid")
Return Parameter: Use return: count (default) to return count of matching values, or return: pct to return percentage (0-1 scale) of matching values.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
pattern |
string |
No* | None | Regular expression pattern to match |
format |
string |
No* | None | Predefined format name ("email", "phone", "uuid", "url", "ipv4", "ipv6", "date", "datetime") |
flags |
list[string] |
No | [] |
Regex flags (e.g., "IGNORECASE", "MULTILINE") - only for pattern |
return |
string |
No | count |
Return type: "count" (absolute) or "pct" (percentage 0-1) |
* Exactly ONE of pattern or format must be specified. Specifying flags together with format is invalid and raises a ContractValidationError.
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: Email Validation
columns:
- name: email
type: string
nullable: false
description: "Customer email address"
checks:
- name: "Email format is valid"
type: pattern
pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
severity: P0
Example 2: Phone Number Format
columns:
- name: phone
type: string
nullable: true
description: "Phone number (US format)"
checks:
- name: "Phone number matches US format"
type: pattern
pattern: "^\\+1-\\d{3}-\\d{3}-\\d{4}$"
severity: P1
Example 3: Product SKU Pattern
columns:
- name: sku
type: string
nullable: false
description: "Product SKU code"
checks:
- name: "SKU follows naming convention"
type: pattern
pattern: "^[A-Z]{2}\\d{6}[A-Z]$"
severity: P0
Example 4: Percentage Return
columns:
- name: email
type: string
nullable: false
description: "Customer email address"
checks:
- name: "95% emails are valid"
type: pattern
pattern: "^[^@]+@[^@]+\\.[^@]+$"
return: pct
min: 0.95 # At least 95%
severity: P1
Example 5: Using Predefined Format (Email)
columns:
- name: email
type: string
nullable: false
description: "Customer email address"
checks:
- name: "All emails are valid"
type: pattern
format: email # Predefined format instead of regex
return: pct
min: 0.95 # At least 95% valid
severity: P1
Example 6: Using Predefined Format (Phone)
columns:
- name: phone
type: string
nullable: true
description: "Phone number"
checks:
- name: "Phone numbers are valid"
type: pattern
format: phone # Matches common phone formats
return: count
min: 1000 # At least 1000 valid phone numbers
severity: P2
Example 7: Using Predefined Format (UUID)
columns:
- name: transaction_id
type: string
nullable: false
description: "Transaction UUID"
checks:
- name: "Transaction IDs are valid UUIDs"
type: pattern
format: uuid # Standard UUID format
return: pct
equals: 1.0 # 100% must be valid UUIDs
severity: P0
Example 8: Using Predefined Format (URL)
columns:
- name: website
type: string
nullable: true
description: "Website URL"
checks:
- name: "Websites are valid URLs"
type: pattern
format: url # HTTP/HTTPS URLs
return: pct
min: 0.90 # At least 90% valid
severity: P1
Supported Format Shortcuts:
| Format | Description | Equivalent Regex Pattern |
|---|---|---|
email |
Email address | ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$ |
phone |
Phone number (international) | ^\\+?[1-9]\\d{1,14}$ |
uuid |
UUID (any version) | ^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$ |
url |
HTTP/HTTPS URL | ^https?://[^\\s/$.?#].[^\\s]*$ |
ipv4 |
IPv4 address | ^(?:[0-9]{1,3}\\.){3}[0-9]{1,3}$ |
ipv6 |
IPv6 address | ^(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}$ |
date |
ISO 8601 date (YYYY-MM-DD) | ^\\d{4}-\\d{2}-\\d{2}$ |
datetime |
ISO 8601 datetime | ^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2} |
Note: Format shortcuts provide common patterns for convenience. For custom formats or more specific validation, use the pattern parameter with an explicit regex.
min_length¶
The min_length check computes the minimum length across all non-null values in a string, list, or map column. For strings, length is the character count. For lists and maps, length is the element count.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
return |
string |
No | count |
Return type: count (rows below minimum) or pct |
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: String Column — Minimum Character Count
columns:
- name: username
type: string
nullable: false
description: "User username"
checks:
- name: "Username is at least 3 characters"
type: min_length
min: 3
severity: P1
Example 2: List Column — Minimum Element Count
columns:
- name: tags
type:
kind: list
value_type: string
nullable: true
description: "Product tags"
checks:
- name: "Tags list has at least one element"
type: min_length
min: 1
severity: P2
max_length¶
The max_length check computes the maximum length across all non-null values in a string, list, or map column. For strings, length is the character count. For lists and maps, length is the element count.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
return |
string |
No | count |
Return type: count (rows exceeding maximum) or pct |
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: String Column — Maximum Character Count
columns:
- name: description
type: string
nullable: true
description: "Product description"
checks:
- name: "Description does not exceed 500 characters"
type: max_length
max: 500
severity: P2
Example 2: Map Column — Maximum Element Count
columns:
- name: properties
type:
kind: map
key_type: string
value_type: string
nullable: true
description: "Custom properties"
checks:
- name: "Properties map has at most 50 entries"
type: max_length
max: 50
severity: P2
avg_length¶
The avg_length check computes the average length of non-null values in a string, list, or map column, returning a single float metric. For strings, length is the character count. For lists and maps, length is the element count.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
return |
string |
No | count |
Return type: count (rows below/above threshold) or pct |
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: Average String Length Within Range
columns:
- name: product_name
type: string
nullable: false
description: "Product display name"
checks:
- name: "Product names have reasonable average length"
type: avg_length
between: [5, 100]
severity: P2
Example 2: Average List Length
columns:
- name: tags
type:
kind: list
value_type: string
nullable: true
description: "Product tags"
checks:
- name: "Tags list has reasonable average size"
type: avg_length
between: [1, 10]
severity: P3
Example 3: Using Percentage Return
columns:
- name: description
type: string
nullable: true
description: "Product description"
checks:
- name: "Most descriptions have reasonable average length"
type: avg_length
return: pct
min: 0.90 # At least 90% are within acceptable range
severity: P2
Statistical Checks¶
Statistical checks compute aggregate metrics over the entire column and return a single numeric value.
cardinality¶
The cardinality check counts distinct non-null values in a column and validates that the count falls within specified bounds.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
min |
number |
No | None | Minimum allowed distinct value count |
max |
number |
No | None | Maximum allowed distinct value count |
between |
[number, number] |
No | None | Inclusive range (shorthand for min + max) |
equals |
number |
No | None | Exact expected distinct value count |
tolerance |
float |
No | 1e-9 |
Acceptable variance |
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: Low Cardinality (Categorical)
columns:
- name: status
type: string
nullable: false
description: "Order status"
checks:
- name: "Status is low-cardinality categorical"
type: cardinality
max: 10 # At most 10 distinct values
severity: P1
Example 2: Exact Cardinality
columns:
- name: priority
type: int
nullable: false
description: "Priority level"
checks:
- name: "Priority has exactly 5 levels"
type: cardinality
equals: 5 # Exactly 5 distinct values
severity: P0
Example 3: Cardinality Range
columns:
- name: customer_segment
type: string
nullable: false
description: "Customer segmentation tier"
checks:
- name: "Segment cardinality is stable"
type: cardinality
between: [3, 8] # Between 3 and 8 segments
severity: P1
min¶
The min check validates that the minimum value in a column meets specified criteria — confirming that no value falls below an acceptable floor.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
min |
number |
No | None | Minimum allowed value for the column minimum |
max |
number |
No | None | Maximum allowed value for the column minimum |
between |
[number, number] |
No | None | Inclusive range (shorthand for min + max) |
equals |
number |
No | None | Exact expected value for the column minimum |
tolerance |
float |
No | 1e-9 |
Acceptable variance |
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: Minimum Must Be Non-Negative
columns:
- name: price
type: decimal
nullable: false
description: "Product price in USD"
checks:
- name: "All prices are non-negative"
type: min
min: 0.0
severity: P0
Example 2: Minimum Value in Range
columns:
- name: temperature
type: float
nullable: false
description: "Temperature in Celsius"
checks:
- name: "Minimum temperature is reasonable"
type: min
between: [-50.0, 0.0]
severity: P1
max¶
The max check validates that the maximum value in a column meets specified criteria — confirming that no value exceeds an acceptable ceiling.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
min |
number |
No | None | Minimum allowed value for the column maximum |
max |
number |
No | None | Maximum allowed value for the column maximum |
between |
[number, number] |
No | None | Inclusive range (shorthand for min + max) |
equals |
number |
No | None | Exact expected value for the column maximum |
tolerance |
float |
No | 1e-9 |
Acceptable variance |
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: Maximum Must Be Reasonable
columns:
- name: age
type: int
nullable: false
description: "Customer age in years"
checks:
- name: "Maximum age is reasonable"
type: max
max: 120
severity: P1
Example 2: Maximum Value Equals Expected
columns:
- name: priority
type: int
nullable: false
description: "Task priority (1-5)"
checks:
- name: "Maximum priority is 5"
type: max
equals: 5
severity: P0
mean¶
The mean check validates that the arithmetic mean of numeric values in a column falls within specified bounds.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
min |
number |
No | None | Minimum allowed mean value |
max |
number |
No | None | Maximum allowed mean value |
between |
[number, number] |
No | None | Inclusive range (shorthand for min + max) |
equals |
number |
No | None | Exact expected mean value |
tolerance |
float |
No | 1e-9 |
Acceptable variance |
Validators:
min,max,between,not_between,equals— see Validators.
Use min, max, between, or equals to set bounds. Most use cases require range validation to detect drift in either direction.
Example 1: Range Validation
columns:
- name: order_amount
type: decimal
nullable: false
description: "Order amount in USD"
checks:
- name: "Average order value is typical"
type: mean
between: [50.0, 150.0] # Between $50 and $150 average
severity: P2
Example 2: Monitoring Drift
columns:
- name: response_time_ms
type: int
nullable: false
description: "API response time in milliseconds"
checks:
- name: "Response time is acceptable"
type: mean
max: 500.0 # Average response under 500ms
severity: P1
Example 3: Exact Mean
columns:
- name: rating
type: int
nullable: false
description: "Product rating (1-5)"
checks:
- name: "Average rating matches target"
type: mean
equals: 3.5 # Exactly 3.5 average rating
tolerance: 0.1 # Within 0.1
severity: P2
sum¶
The sum check validates that the total of all non-null values in a column meets specified criteria.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
min |
number |
No | None | Minimum allowed sum |
max |
number |
No | None | Maximum allowed sum |
between |
[number, number] |
No | None | Inclusive range (shorthand for min + max) |
equals |
number |
No | None | Exact expected sum |
tolerance |
float |
No | 1e-9 |
Acceptable variance |
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: Exact Sum
columns:
- name: allocated_budget
type: decimal
nullable: false
description: "Budget allocated to departments"
checks:
- name: "Total allocated budget is exactly $1M"
type: sum
equals: 1000000.0
tolerance: 0.01
severity: P0
Example 2: Sum Range
columns:
- name: daily_sales
type: decimal
nullable: false
description: "Daily sales amount"
checks:
- name: "Daily sales within expected range"
type: sum
between: [10000.0, 100000.0]
severity: P1
Example 3: Minimum Sum
columns:
- name: quantity
type: int
nullable: false
description: "Order quantity"
checks:
- name: "Total quantity is positive"
type: sum
min: 0
tolerance: 0
severity: P0
count¶
The count check validates that the number of non-null values in a column falls within specified bounds.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
min |
number |
No | None | Minimum allowed non-null count |
max |
number |
No | None | Maximum allowed non-null count |
between |
[number, number] |
No | None | Inclusive range (shorthand for min + max) |
equals |
number |
No | None | Exact expected non-null count |
tolerance |
float |
No | 1e-9 |
Acceptable variance |
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: Exact Count
columns:
- name: employee_id
type: int
nullable: false
description: "Employee identifier"
checks:
- name: "Exactly 500 active employees"
type: count
equals: 500
tolerance: 0
severity: P1
Example 2: Minimum Count
columns:
- name: transaction_id
type: int
nullable: false
description: "Transaction identifier"
checks:
- name: "At least 1000 transactions daily"
type: count
min: 1000
severity: P2
Example 3: Count Range
columns:
- name: order_id
type: int
nullable: false
description: "Order identifier"
checks:
- name: "Daily order count is stable"
type: count
between: [100, 1000]
severity: P1
variance¶
The variance check validates that the statistical variance of numeric values in a column falls within specified bounds, providing a measure of data spread.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
min |
number |
No | None | Minimum allowed variance |
max |
number |
No | None | Maximum allowed variance |
between |
[number, number] |
No | None | Inclusive range (shorthand for min + max) |
equals |
number |
No | None | Exact expected variance |
tolerance |
float |
No | 1e-9 |
Acceptable variance |
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: Maximum Bound (Most Common)
columns:
- name: temperature
type: float
nullable: false
description: "Sensor temperature reading"
checks:
- name: "Temperature variance is stable"
type: variance
max: 10.0 # Variance ≤ 10.0
severity: P2
Example 2: Variance Range
columns:
- name: daily_sales
type: decimal
nullable: false
description: "Daily sales amount"
checks:
- name: "Sales variance is within expected range"
type: variance
between: [1000.0, 10000.0] # Expected variance range
severity: P2
Example 3: Exact Variance
columns:
- name: transaction_amount
type: decimal
nullable: false
description: "Transaction amount"
checks:
- name: "Transaction variance matches expectation"
type: variance
equals: 50000.0 # Exactly 50000.0 variance
tolerance: 100.0 # Within 100
severity: P1
stddev¶
The stddev check computes the standard deviation of numeric values in a column, returning a single float metric.
Validators:
min,max,between,not_between,equals— see Validators.
Example 1: Maximum Standard Deviation
columns:
- name: response_time_ms
type: float
nullable: false
description: "API response time in milliseconds"
checks:
- name: "Response time is not too volatile"
type: stddev
max: 500.0
severity: P1
Example 2: Standard Deviation Within Expected Range
columns:
- name: daily_revenue
type: decimal
nullable: false
description: "Daily revenue in USD"
checks:
- name: "Revenue standard deviation is within expected range"
type: stddev
between: [100.0, 50000.0]
severity: P2
percentile¶
The percentile check validates that a specific percentile value in a column falls within specified bounds.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
percentile |
float |
Yes | None | Percentile to compute, in [0.0, 1.0] (e.g. 0.5 for median, 0.99 for 99th percentile) |
min |
number |
No | None | Minimum allowed percentile value |
max |
number |
No | None | Maximum allowed percentile value |
between |
[number, number] |
No | None | Inclusive range (shorthand for min + max) |
equals |
number |
No | None | Exact expected percentile value |
tolerance |
float |
No | 1e-9 |
Acceptable variance |
Validators:
min,max,between,not_between,equals— see Validators. Thepercentileparameter shares the same name as the check type but operates at a different YAML scope —typeselects the check,percentilespecifies the value to compute. They do not conflict.
Use min, max, between, or equals to set bounds. Most use cases require range validation to detect drift in either direction.
Example 1: P95 Response Time (Max Only)
columns:
- name: response_time_ms
type: int
nullable: false
description: "API response time in milliseconds"
checks:
- name: "P95 response time is acceptable"
type: percentile
percentile: 0.95
max: 1000.0 # P95 under 1 second
severity: P1
Example 2: Median Value (Range)
columns:
- name: order_amount
type: decimal
nullable: false
description: "Order amount in USD"
checks:
- name: "Median order value is typical"
type: percentile
percentile: 0.5
between: [30.0, 100.0] # Median between $30-$100
severity: P2
Example 3: Exact Percentile Value
columns:
- name: processing_time_sec
type: float
nullable: false
description: "Processing time in seconds"
checks:
- name: "P99 processing time matches SLA"
type: percentile
percentile: 0.99
equals: 300.0 # P99 exactly 300 seconds
tolerance: 5.0 # Within 5 seconds
severity: P0
Check Composition Patterns¶
Checks can be combined to create comprehensive validation strategies. Here are common patterns:
Pattern 1: Layered Validation (Progressive Severity)¶
columns:
- name: email
type: string
nullable: false
description: "Customer email address"
checks:
# Layer 1: Important - must have valid format
- name: "Email format is valid"
type: pattern
format: email
severity: P0
# Layer 2: Nice-to-have - cardinality check
- name: "Email domains are diverse"
type: cardinality
min: 10 # At least 10 distinct domains
severity: P2
Pattern 2: Range Validation with Context¶
columns:
- name: order_amount
type: decimal
nullable: false
description: "Order amount in USD"
checks:
# Hard constraint: Minimum value must be non-negative
- name: "Minimum order amount is non-negative"
type: min
min: 0.0
severity: P0
# Business rule: Maximum value is reasonable
- name: "Maximum order amount is reasonable"
type: max
max: 1000000.0
severity: P1
# Statistical monitoring: Average is stable
- name: "Average order value is typical"
type: mean
between: [50.0, 500.0]
severity: P2
Pattern 3: Multi-Level Validation¶
columns:
- name: phone_number
type: string
nullable: true
description: "Customer phone number"
checks:
# Null percentage constraint (uses 0-1 scale)
- name: "Most customers provide phone number"
type: missing
return: pct
max: 0.20 # At most 20% null
severity: P1
# Format validation (when present)
- name: "Phone number format is valid"
type: pattern
pattern: "^\\+?[1-9]\\d{1,14}$"
severity: P0
# Length validation
- name: "Phone number length is reasonable"
type: min_length
min: 10
severity: P1
Pattern 4: Table-Level Reconciliation¶
# Table-level checks
checks:
# Volume check
- name: "Daily volume within bounds"
type: num_rows
between: [1000, 100000]
severity: P1
# Freshness check
- name: "Data arrived within SLA"
type: freshness
max_age_hours: 25.0
timestamp_column: order_date
severity: P0
# Completeness check
- name: "No missing daily partitions"
type: completeness
partition_column: order_date
granularity: daily
lookback_days: 7
severity: P1
Pattern 5: Categorical with Cardinality¶
columns:
- name: status
type: string
nullable: false
description: "Order status"
checks:
# Whitelist validation
- name: "Status is valid"
type: whitelist
values: ["pending", "processing", "shipped", "delivered", "cancelled"]
severity: P0
# Low cardinality validation
- name: "Status is low-cardinality"
type: cardinality
max: 10 # At most 10 distinct values
severity: P1