Skip to content

Checks & Validators

Part of the Data Contracts Technical Specification.


Overview

Data contract checks validate the quality of a single dataset. Each check computes a metric — a row count, a null percentage, a column mean — and compares it against a declared acceptance condition. Checks that exceed their acceptance condition produce failures at the declared severity level, giving data teams a structured signal to act on.

Data contracts support two categories of checks:

  1. Table-Level Checks: Validate table-wide properties (row counts, freshness, duplicates, partitions)
  2. Column-Level Checks: Validate individual column values (nullability, uniqueness, ranges, patterns, distributions)

Naming Conventions:

  • Check names should be descriptive business statements (e.g., "Order ID is unique")
  • Severity levels: P0 (critical), P1 (important, default), P2 (nice-to-have), P3 (informational). Omitting severity defaults to P1.
  • All checks support optional tags parameter for categorization
  • Check type names in YAML contracts use snake_case (e.g., num_rows, duplicates, missing).

Check Structure:

Every check shares a common structure:

- name: "Descriptive business name"   # required
  type: <check_type>                   # required
  severity: P0 | P1 | P2 | P3         # optional, defaults to P1
  <validator>: <value>                 # optional — exactly one of: min, max, between, not_between, equals
  tolerance: <value>                   # optional
  # check-specific params (e.g. return: count/pct, columns:, values:, pattern:)

The name is a descriptive business statement. The type identifies which check to run. severity sets the priority level and defaults to P1 when omitted. A single validatormin, max, between, not_between, or equals — defines the acceptance condition; only one may be specified per check. The optional tolerance parameter allows acceptable variance. Check-specific parameters (such as return, columns, values, pattern) are documented in each check's detail section.

Note: Cross-dataset validation (referential integrity, schema consistency, aggregate reconciliation) should be handled at the orchestration/workflow level rather than in individual data contracts. This keeps contracts focused on single-dataset quality while allowing orchestration tools to manage relationships between datasets.


Checks Summary

Table-Level Checks

Validate table-wide properties and aggregates. Specified in the top-level checks array.

Check Type Description Return
num_rows Total row count validation count
duplicates Count of duplicate rows count or pct
freshness Data recency validation age_hours
completeness Partition gap detection gap_count

Total: 4 table-level checks


Column-Level Checks

Validate individual column values and properties. Specified within the checks array of a column definition.

Statistical Checks

Statistical checks compute aggregate metrics over the entire column and return a single numeric value.

Check Type Description
cardinality Count of distinct values
min Minimum value in column
max Maximum value in column
mean Average value validation
sum Sum of values validation
count Count of non-null values
variance Variance validation
stddev Standard deviation validation
percentile Percentile value validation

Value Checks

Value checks validate individual values within a column. Each check returns either an absolute count (count) or percentage (pct) controlled by the return parameter. For missing and duplicates, the return value represents the count/percentage of missing or duplicate values found. For whitelist, blacklist, pattern, min_length, max_length, and avg_length, the return value represents the count/percentage of valid (conforming) rows.

Check Type Description
missing Null value validation
duplicates Duplicate value validation
whitelist Values in allowed set
blacklist Values in disallowed set
pattern Values matching regex pattern
min_length Minimum length of string, list, or map values
max_length Maximum length of string, list, or map values
avg_length Average length of string, list, or map values

Total: 17 column-level checks (9 Statistical + 8 Value Checks)


Validators

Every check returns a single float metric. Validators compare that metric against a declared acceptance condition. At most one validator may be specified per check. The optional tolerance parameter applies to all four and is a float that defaults to 1e-9.

min

Sets a lower bound. The check passes if actual ≥ (min - tolerance).

- name: "At least 1000 rows daily"
  type: num_rows
  min: 1000
  severity: P0

max

Sets an upper bound. The check passes if actual ≤ (max + tolerance).

- name: "Low null percentage"
  type: missing
  return: pct
  max: 0.05
  severity: P1

between

Inclusive range — shorthand for specifying both min and max. The check passes if actual ≥ (min - tolerance) AND actual ≤ (max + tolerance). Cannot be combined with min or max; doing so raises a ContractValidationError.

- name: "Stable row count"
  type: num_rows
  between: [1000, 100000]
  severity: P1

not_between

Inverse range — the check passes if actual < (min - tolerance) OR actual > (max + tolerance). Cannot be combined with min, max, or between; doing so raises a ContractValidationError.

- name: "Order ID is not in reserved test range"
  type: num_rows
  not_between: [9000000, 9999999]
  severity: P1

equals

Exact match. The check passes if |actual - expected| ≤ tolerance.

- name: "Total allocated budget is exactly $1M"
  type: sum
  equals: 1000000.0
  tolerance: 0.01
  severity: P0

Noop (no validator)

Omitting all validators produces a noop: the check runs and records the computed metric but never fails on the value. Use this for informational checks or when you want to observe a metric without enforcing a bound.

- name: "Record average order amount (informational)"
  type: mean
  severity: P3

Table-Level Checks

Table-level checks are specified in the top-level checks array (sibling to columns). They validate table-wide properties and aggregates.


Volume Checks

num_rows

The num_rows check validates that the total row count falls within specified bounds.

Validators: min, max, between, not_between, equals — see Validators.

Example 1: Range Validation

checks:
  - name: "Daily volume within bounds"
    type: num_rows
    between: [100, 1000000]  # Between 100 and 1M rows
    severity: P1

Example 2: Minimum Volume

checks:
  - name: "At least 1000 transactions per day"
    type: num_rows
    min: 1000  # At least 1000 rows
    severity: P0

Example 3: Exact Row Count

checks:
  - name: "Partition has exactly 1000 records"
    type: num_rows
    equals: 1000  # Exactly 1000 rows
    tolerance: 0
    severity: P1

duplicates

The duplicates check validates that the number of duplicate rows (based on specified columns) is within bounds.

Parameters:

Parameter Type Required Default Description
columns list[string] Yes None Columns to check for duplicates (composite key)
return string No count Return type: "count" (absolute) or "pct" (percentage 0-1)

Validators: min, max, between, not_between, equals — see Validators.

Example 1: No Duplicates (Most Common)

checks:
  - name: "No duplicate orders"
    type: duplicates
    columns: ["order_id"]
    max: 0  # No duplicates allowed
    severity: P0

Example 2: Allow Small Number of Duplicates

checks:
  - name: "Few duplicate user-date combinations"
    type: duplicates
    columns: ["user_id", "event_date"]
    max: 10  # At most 10 duplicate rows
    severity: P1

Example 3: Exact Duplicate Count

checks:
  - name: "Expected duplicate rate"
    type: duplicates
    columns: ["transaction_id", "timestamp"]
    equals: 50  # Exactly 50 duplicates expected
    tolerance: 5  # Within 5
    severity: P2

Example 4: Duplicate Percentage

checks:
  - name: "Low duplicate rate"
    type: duplicates
    columns: ["order_id"]
    return: pct
    max: 0.01  # At most 1% duplicate rows
    severity: P1

Freshness Checks

freshness

The freshness check validates that data is not stale — the most recent timestamp falls within an acceptable age.

Returns: age_hours — the age of the most recent (or oldest, depending on aggregation) record in hours. Validated implicitly: the check passes when age_hours <= max_age_hours.

Parameters:

Parameter Type Required Default Description
max_age_hours float Yes None Maximum allowed age in hours
timestamp_column string Yes None Column containing timestamps
aggregation string No "max" Aggregation method ("max" or "min")

Validators: not applicable — uses max_age_hours instead of standard validators.

Example 1: Basic Usage (Auto-generated from SLA)

checks:
  - name: "Data is fresh"
    type: freshness
    max_age_hours: 24
    timestamp_column: created_at
    severity: P0

Example 2: Partition Freshness

checks:
  - name: "Latest partition is within 2 hours"
    type: freshness
    max_age_hours: 2.0
    timestamp_column: event_hour
    severity: P0

Example 3: Oldest Record Within 7 Days

checks:
  - name: "Oldest record within 7 days"
    type: freshness
    max_age_hours: 168.0  # oldest record must be no more than 7 days old
    timestamp_column: ingestion_time
    aggregation: min
    severity: P2

completeness

The completeness check validates that partitioned data has no missing partitions within a time range (gap detection). This check only applies to tables with metadata.partitioned_by defined.

Returns: gap_count — the number of missing partitions found. Validated implicitly: the check passes when gap_count <= max_gap_count.

Note: For column-level null validation, use missing check (with return: count for absolute counts or return: pct for percentages).

Parameters:

Parameter Type Required Default Description
partition_column string Yes None Column used for partitioning (date/timestamp)
granularity string Yes None Expected granularity ("hourly", "daily", "weekly", "monthly")
lookback_days int No 30 Days to check for gaps
allow_future_gaps bool No true Ignore gaps in future dates
max_gap_count int No 0 Maximum allowed missing partitions

Validators: not applicable — uses max_gap_count instead of standard validators.

Example 1: Daily Partitions (No Gaps)

checks:
  - name: "No missing daily partitions in last 7 days"
    type: completeness
    partition_column: order_date
    granularity: daily
    lookback_days: 7
    allow_future_gaps: true
    severity: P1

Example 2: Hourly Partitions

checks:
  - name: "Hourly partitions are complete"
    type: completeness
    partition_column: event_hour
    granularity: hourly
    lookback_days: 1
    severity: P0

Example 3: Allow Some Gaps (Best Effort)

checks:
  - name: "Weekly partitions mostly complete"
    type: completeness
    partition_column: report_week
    granularity: weekly
    lookback_days: 90
    max_gap_count: 2
    severity: P2

Column-Level Checks

Column-level checks are specified within the checks array of a column definition. They validate individual column values and statistical properties.


Value Checks

Value checks validate individual values within a column (rather than computing aggregates over the column). Each check defines its own return semantics. For missing and duplicates, the return value represents missing or duplicate values found. For whitelist, blacklist, pattern, min_length, max_length, and avg_length, the return value represents conforming (valid) rows. Use the return parameter to specify the return type (count or pct). When return: pct is used, the percentage denominator is always the total row count including nulls; the numerator differs by check family: for missing and duplicates the numerator is the count of missing or duplicate values respectively, while for whitelist, blacklist, pattern, min_length, max_length, and avg_length the numerator is the count of conforming (valid) rows.


missing

The missing check validates null values in a column, returning the count or percentage of null values based on the return parameter.

Return Parameter: Use return: count (default) to return absolute null count, or return: pct to return null percentage (0-1 scale).

Parameters:

Parameter Type Required Default Description
return string No count Return type: "count" (absolute) or "pct" (percentage 0-1)

Validators: min, max, between, not_between, equals — see Validators.

Example 1: Maximum Bound (Most Common)

columns:
  - name: notes
    type: string
    nullable: true
    description: "Optional customer notes"
    checks:
      - name: "Most orders have notes"
        type: missing
        max: 1000  # At most 1000 nulls
        severity: P2

Example 2: Range Validation

columns:
  - name: deleted_at
    type:
      kind: timestamp
    nullable: true
    description: "Deletion timestamp (null if not deleted)"
    checks:
      - name: "Few deletions expected"
        type: missing
        between: [0, 50]  # Between 0 and 50 nulls
        severity: P1

Example 3: Exact Null Count

columns:
  - name: optional_metadata
    type: string
    nullable: true
    description: "Optional metadata field"
    checks:
      - name: "Exactly 100 records have metadata"
        type: missing
        equals: 900  # Expecting 900 nulls (100 non-null)
        tolerance: 0
        severity: P2

Example 4: Using Percentage Return

columns:
  - name: email
    type: string
    nullable: true
    description: "Customer email address"
    checks:
      - name: "Low null percentage"
        type: missing
        return: pct
        max: 0.10  # At most 10% nulls
        severity: P1

whitelist

The whitelist check validates that non-null values match a set of allowed values, returning the count or percentage of rows that conform.

Return Parameter: Use return: count (default) to return the count of matching rows, or return: pct to return the percentage (0-1 scale).

Parameters:

Parameter Type Required Default Description
values list Yes None List of allowed values
return string No count Return type: "count" (absolute) or "pct" (percentage 0-1)
case_sensitive bool No true Whether comparison is case-sensitive (strings only)

Validators: min, max, between, not_between, equals — see Validators.

Example 1: Basic Usage

columns:
  - name: status
    type: string
    nullable: false
    description: "Order status"
    checks:
      - name: "Status is valid"
        type: whitelist
        values: ["pending", "processing", "shipped", "delivered", "cancelled"]
        severity: P0

Example 2: Case-Insensitive

columns:
  - name: country_code
    type: string
    nullable: false
    description: "ISO country code"
    checks:
      - name: "Country code is valid"
        type: whitelist
        values: ["US", "CA", "MX", "GB", "DE", "FR"]
        case_sensitive: false
        severity: P1

Example 3: Numeric Enum

columns:
  - name: priority
    type: int
    nullable: false
    description: "Task priority level"
    checks:
      - name: "Priority is valid (1-5)"
        type: whitelist
        values: [1, 2, 3, 4, 5]
        severity: P0

Example 4: Using Percentage Return

columns:
  - name: status
    type: string
    nullable: false
    description: "Order status"
    checks:
      - name: "Most statuses are valid"
        type: whitelist
        values: ["pending", "processing", "shipped", "delivered", "cancelled"]
        return: pct
        min: 0.95  # At least 95% match
        severity: P1

blacklist

The blacklist check validates that non-null values do not match any forbidden value, returning the count or percentage of rows that pass (rows not in the blacklist).

Return Parameter: Use return: count (default) to return the count of passing rows, or return: pct to return the percentage (0-1 scale). Both return values represent conforming rows — use min to set a lower bound (e.g., at least 95% must pass).

Parameters:

Parameter Type Required Default Description
values list Yes None List of forbidden values
return string No count Return type: "count" (absolute) or "pct" (percentage 0-1)
case_sensitive bool No true Whether comparison is case-sensitive (strings only)

Validators: min, max, between, not_between, equals — see Validators.

Example 1: Basic Usage

columns:
  - name: username
    type: string
    nullable: false
    description: "User username"
    checks:
      - name: "Username is not reserved"
        type: blacklist
        values: ["admin", "root", "system", "null", "undefined"]
        case_sensitive: false
        severity: P0

Example 2: Exclude Test Data

columns:
  - name: email
    type: string
    nullable: false
    description: "Customer email address"
    checks:
      - name: "Email is not test account"
        type: blacklist
        values: ["test@example.com", "noreply@example.com", "admin@test.com"]
        severity: P1

Example 3: Percentage Passing

columns:
  - name: username
    type: string
    nullable: false
    description: "User username"
    checks:
      - name: "Almost all usernames are not reserved"
        type: blacklist
        values: ["admin", "root", "system"]
        return: pct
        min: 0.99  # At least 99% must pass
        severity: P1

duplicates

The duplicates check (column-level) validates that the count of duplicate values in a column is within specified bounds. This is the column-level version of the table-level duplicates check, which validates duplicates across multiple columns.

Semantics: Duplicate count is computed as COUNT(*) - COUNT(DISTINCT col) — the total number of non-first occurrences. If value "A" appears 3 times, it contributes 2 to the count. NULL handling: COUNT(*) includes all rows while COUNT(DISTINCT col) excludes NULLs, so NULL values are counted as duplicates. For example, [A, A, NULL] yields a duplicate count of 2 (one extra "A" plus one NULL). This is the intended behavior: NULLs are treated as duplicate occurrences. If you need to exclude NULLs, use COUNT(col) - COUNT(DISTINCT col) in a manual check instead.

Return Parameter: Use return: count (default) to return absolute duplicate count, or return: pct to return duplicate percentage (0-1 scale).

Parameters:

Parameter Type Required Default Description
return string No count Return type: "count" (absolute) or "pct" (percentage 0-1)

Validators: min, max, between, not_between, equals — see Validators.

Note: To enforce uniqueness (no duplicates), use max: 0 or equals: 0.

Example 1: No Duplicates (Uniqueness)

columns:
  - name: order_id
    type: int
    nullable: false
    description: "Unique order identifier"
    checks:
      - name: "Order ID is unique"
        type: duplicates
        max: 0  # No duplicates allowed
        severity: P0

Example 2: Allow Some Duplicates

columns:
  - name: user_id
    type: int
    nullable: false
    description: "User identifier (repeat customers allowed)"
    checks:
      - name: "User ID has reasonable duplicates"
        type: duplicates
        max: 100  # Up to 100 duplicate user IDs
        severity: P1

Example 3: Exact Duplicate Count (Testing)

columns:
  - name: test_category
    type: string
    nullable: false
    description: "Test category with expected duplicates"
    checks:
      - name: "Category has expected duplicates"
        type: duplicates
        equals: 50  # Expect exactly 50 duplicates
        tolerance: 0
        severity: P2

Example 4: Duplicate Percentage

columns:
  - name: order_id
    type: int
    nullable: false
    description: "Order identifier"
    checks:
      - name: "Low duplicate rate"
        type: duplicates
        return: pct
        max: 0.01  # At most 1% duplicate values
        severity: P1

String and Collection Checks

String and collection checks are part of Value Checks. See the Value Checks section above for context.

pattern

The pattern check validates that string values match a pattern, supporting either explicit regex or predefined format shortcuts, and returns the count or percentage of conforming values.

Pattern Specification: Use either:

  • pattern parameter with a regex pattern (e.g., "^[A-Z]{2}\\d{6}$")
  • format parameter with a predefined format name (e.g., "email", "phone", "uuid")

Return Parameter: Use return: count (default) to return count of matching values, or return: pct to return percentage (0-1 scale) of matching values.

Parameters:

Parameter Type Required Default Description
pattern string No* None Regular expression pattern to match
format string No* None Predefined format name ("email", "phone", "uuid", "url", "ipv4", "ipv6", "date", "datetime")
flags list[string] No [] Regex flags (e.g., "IGNORECASE", "MULTILINE") - only for pattern
return string No count Return type: "count" (absolute) or "pct" (percentage 0-1)

* Exactly ONE of pattern or format must be specified. Specifying flags together with format is invalid and raises a ContractValidationError.

Validators: min, max, between, not_between, equals — see Validators.

Example 1: Email Validation

columns:
  - name: email
    type: string
    nullable: false
    description: "Customer email address"
    checks:
      - name: "Email format is valid"
        type: pattern
        pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
        severity: P0

Example 2: Phone Number Format

columns:
  - name: phone
    type: string
    nullable: true
    description: "Phone number (US format)"
    checks:
      - name: "Phone number matches US format"
        type: pattern
        pattern: "^\\+1-\\d{3}-\\d{3}-\\d{4}$"
        severity: P1

Example 3: Product SKU Pattern

columns:
  - name: sku
    type: string
    nullable: false
    description: "Product SKU code"
    checks:
      - name: "SKU follows naming convention"
        type: pattern
        pattern: "^[A-Z]{2}\\d{6}[A-Z]$"
        severity: P0

Example 4: Percentage Return

columns:
  - name: email
    type: string
    nullable: false
    description: "Customer email address"
    checks:
      - name: "95% emails are valid"
        type: pattern
        pattern: "^[^@]+@[^@]+\\.[^@]+$"
        return: pct
        min: 0.95  # At least 95%
        severity: P1

Example 5: Using Predefined Format (Email)

columns:
  - name: email
    type: string
    nullable: false
    description: "Customer email address"
    checks:
      - name: "All emails are valid"
        type: pattern
        format: email  # Predefined format instead of regex
        return: pct
        min: 0.95  # At least 95% valid
        severity: P1

Example 6: Using Predefined Format (Phone)

columns:
  - name: phone
    type: string
    nullable: true
    description: "Phone number"
    checks:
      - name: "Phone numbers are valid"
        type: pattern
        format: phone  # Matches common phone formats
        return: count
        min: 1000  # At least 1000 valid phone numbers
        severity: P2

Example 7: Using Predefined Format (UUID)

columns:
  - name: transaction_id
    type: string
    nullable: false
    description: "Transaction UUID"
    checks:
      - name: "Transaction IDs are valid UUIDs"
        type: pattern
        format: uuid  # Standard UUID format
        return: pct
        equals: 1.0  # 100% must be valid UUIDs
        severity: P0

Example 8: Using Predefined Format (URL)

columns:
  - name: website
    type: string
    nullable: true
    description: "Website URL"
    checks:
      - name: "Websites are valid URLs"
        type: pattern
        format: url  # HTTP/HTTPS URLs
        return: pct
        min: 0.90  # At least 90% valid
        severity: P1

Supported Format Shortcuts:

Format Description Equivalent Regex Pattern
email Email address ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$
phone Phone number (international) ^\\+?[1-9]\\d{1,14}$
uuid UUID (any version) ^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$
url HTTP/HTTPS URL ^https?://[^\\s/$.?#].[^\\s]*$
ipv4 IPv4 address ^(?:[0-9]{1,3}\\.){3}[0-9]{1,3}$
ipv6 IPv6 address ^(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}$
date ISO 8601 date (YYYY-MM-DD) ^\\d{4}-\\d{2}-\\d{2}$
datetime ISO 8601 datetime ^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}

Note: Format shortcuts provide common patterns for convenience. For custom formats or more specific validation, use the pattern parameter with an explicit regex.


min_length

The min_length check computes the minimum length across all non-null values in a string, list, or map column. For strings, length is the character count. For lists and maps, length is the element count.

Parameters:

Parameter Type Required Default Description
return string No count Return type: count (rows below minimum) or pct

Validators: min, max, between, not_between, equals — see Validators.

Example 1: String Column — Minimum Character Count

columns:
  - name: username
    type: string
    nullable: false
    description: "User username"
    checks:
      - name: "Username is at least 3 characters"
        type: min_length
        min: 3
        severity: P1

Example 2: List Column — Minimum Element Count

columns:
  - name: tags
    type:
      kind: list
      value_type: string
    nullable: true
    description: "Product tags"
    checks:
      - name: "Tags list has at least one element"
        type: min_length
        min: 1
        severity: P2

max_length

The max_length check computes the maximum length across all non-null values in a string, list, or map column. For strings, length is the character count. For lists and maps, length is the element count.

Parameters:

Parameter Type Required Default Description
return string No count Return type: count (rows exceeding maximum) or pct

Validators: min, max, between, not_between, equals — see Validators.

Example 1: String Column — Maximum Character Count

columns:
  - name: description
    type: string
    nullable: true
    description: "Product description"
    checks:
      - name: "Description does not exceed 500 characters"
        type: max_length
        max: 500
        severity: P2

Example 2: Map Column — Maximum Element Count

columns:
  - name: properties
    type:
      kind: map
      key_type: string
      value_type: string
    nullable: true
    description: "Custom properties"
    checks:
      - name: "Properties map has at most 50 entries"
        type: max_length
        max: 50
        severity: P2

avg_length

The avg_length check computes the average length of non-null values in a string, list, or map column, returning a single float metric. For strings, length is the character count. For lists and maps, length is the element count.

Parameters:

Parameter Type Required Default Description
return string No count Return type: count (rows below/above threshold) or pct

Validators: min, max, between, not_between, equals — see Validators.

Example 1: Average String Length Within Range

columns:
  - name: product_name
    type: string
    nullable: false
    description: "Product display name"
    checks:
      - name: "Product names have reasonable average length"
        type: avg_length
        between: [5, 100]
        severity: P2

Example 2: Average List Length

columns:
  - name: tags
    type:
      kind: list
      value_type: string
    nullable: true
    description: "Product tags"
    checks:
      - name: "Tags list has reasonable average size"
        type: avg_length
        between: [1, 10]
        severity: P3

Example 3: Using Percentage Return

columns:
  - name: description
    type: string
    nullable: true
    description: "Product description"
    checks:
      - name: "Most descriptions have reasonable average length"
        type: avg_length
        return: pct
        min: 0.90  # At least 90% are within acceptable range
        severity: P2

Statistical Checks

Statistical checks compute aggregate metrics over the entire column and return a single numeric value.

cardinality

The cardinality check counts distinct non-null values in a column and validates that the count falls within specified bounds.

Parameters:

Parameter Type Required Default Description
min number No None Minimum allowed distinct value count
max number No None Maximum allowed distinct value count
between [number, number] No None Inclusive range (shorthand for min + max)
equals number No None Exact expected distinct value count
tolerance float No 1e-9 Acceptable variance

Validators: min, max, between, not_between, equals — see Validators.

Example 1: Low Cardinality (Categorical)

columns:
  - name: status
    type: string
    nullable: false
    description: "Order status"
    checks:
      - name: "Status is low-cardinality categorical"
        type: cardinality
        max: 10  # At most 10 distinct values
        severity: P1

Example 2: Exact Cardinality

columns:
  - name: priority
    type: int
    nullable: false
    description: "Priority level"
    checks:
      - name: "Priority has exactly 5 levels"
        type: cardinality
        equals: 5  # Exactly 5 distinct values
        severity: P0

Example 3: Cardinality Range

columns:
  - name: customer_segment
    type: string
    nullable: false
    description: "Customer segmentation tier"
    checks:
      - name: "Segment cardinality is stable"
        type: cardinality
        between: [3, 8]  # Between 3 and 8 segments
        severity: P1

min

The min check validates that the minimum value in a column meets specified criteria — confirming that no value falls below an acceptable floor.

Parameters:

Parameter Type Required Default Description
min number No None Minimum allowed value for the column minimum
max number No None Maximum allowed value for the column minimum
between [number, number] No None Inclusive range (shorthand for min + max)
equals number No None Exact expected value for the column minimum
tolerance float No 1e-9 Acceptable variance

Validators: min, max, between, not_between, equals — see Validators.

Example 1: Minimum Must Be Non-Negative

columns:
  - name: price
    type: decimal
    nullable: false
    description: "Product price in USD"
    checks:
      - name: "All prices are non-negative"
        type: min
        min: 0.0
        severity: P0

Example 2: Minimum Value in Range

columns:
  - name: temperature
    type: float
    nullable: false
    description: "Temperature in Celsius"
    checks:
      - name: "Minimum temperature is reasonable"
        type: min
        between: [-50.0, 0.0]
        severity: P1

max

The max check validates that the maximum value in a column meets specified criteria — confirming that no value exceeds an acceptable ceiling.

Parameters:

Parameter Type Required Default Description
min number No None Minimum allowed value for the column maximum
max number No None Maximum allowed value for the column maximum
between [number, number] No None Inclusive range (shorthand for min + max)
equals number No None Exact expected value for the column maximum
tolerance float No 1e-9 Acceptable variance

Validators: min, max, between, not_between, equals — see Validators.

Example 1: Maximum Must Be Reasonable

columns:
  - name: age
    type: int
    nullable: false
    description: "Customer age in years"
    checks:
      - name: "Maximum age is reasonable"
        type: max
        max: 120
        severity: P1

Example 2: Maximum Value Equals Expected

columns:
  - name: priority
    type: int
    nullable: false
    description: "Task priority (1-5)"
    checks:
      - name: "Maximum priority is 5"
        type: max
        equals: 5
        severity: P0

mean

The mean check validates that the arithmetic mean of numeric values in a column falls within specified bounds.

Parameters:

Parameter Type Required Default Description
min number No None Minimum allowed mean value
max number No None Maximum allowed mean value
between [number, number] No None Inclusive range (shorthand for min + max)
equals number No None Exact expected mean value
tolerance float No 1e-9 Acceptable variance

Validators: min, max, between, not_between, equals — see Validators.

Use min, max, between, or equals to set bounds. Most use cases require range validation to detect drift in either direction.

Example 1: Range Validation

columns:
  - name: order_amount
    type: decimal
    nullable: false
    description: "Order amount in USD"
    checks:
      - name: "Average order value is typical"
        type: mean
        between: [50.0, 150.0]  # Between $50 and $150 average
        severity: P2

Example 2: Monitoring Drift

columns:
  - name: response_time_ms
    type: int
    nullable: false
    description: "API response time in milliseconds"
    checks:
      - name: "Response time is acceptable"
        type: mean
        max: 500.0  # Average response under 500ms
        severity: P1

Example 3: Exact Mean

columns:
  - name: rating
    type: int
    nullable: false
    description: "Product rating (1-5)"
    checks:
      - name: "Average rating matches target"
        type: mean
        equals: 3.5  # Exactly 3.5 average rating
        tolerance: 0.1  # Within 0.1
        severity: P2

sum

The sum check validates that the total of all non-null values in a column meets specified criteria.

Parameters:

Parameter Type Required Default Description
min number No None Minimum allowed sum
max number No None Maximum allowed sum
between [number, number] No None Inclusive range (shorthand for min + max)
equals number No None Exact expected sum
tolerance float No 1e-9 Acceptable variance

Validators: min, max, between, not_between, equals — see Validators.

Example 1: Exact Sum

columns:
  - name: allocated_budget
    type: decimal
    nullable: false
    description: "Budget allocated to departments"
    checks:
      - name: "Total allocated budget is exactly $1M"
        type: sum
        equals: 1000000.0
        tolerance: 0.01
        severity: P0

Example 2: Sum Range

columns:
  - name: daily_sales
    type: decimal
    nullable: false
    description: "Daily sales amount"
    checks:
      - name: "Daily sales within expected range"
        type: sum
        between: [10000.0, 100000.0]
        severity: P1

Example 3: Minimum Sum

columns:
  - name: quantity
    type: int
    nullable: false
    description: "Order quantity"
    checks:
      - name: "Total quantity is positive"
        type: sum
        min: 0
        tolerance: 0
        severity: P0

count

The count check validates that the number of non-null values in a column falls within specified bounds.

Parameters:

Parameter Type Required Default Description
min number No None Minimum allowed non-null count
max number No None Maximum allowed non-null count
between [number, number] No None Inclusive range (shorthand for min + max)
equals number No None Exact expected non-null count
tolerance float No 1e-9 Acceptable variance

Validators: min, max, between, not_between, equals — see Validators.

Example 1: Exact Count

columns:
  - name: employee_id
    type: int
    nullable: false
    description: "Employee identifier"
    checks:
      - name: "Exactly 500 active employees"
        type: count
        equals: 500
        tolerance: 0
        severity: P1

Example 2: Minimum Count

columns:
  - name: transaction_id
    type: int
    nullable: false
    description: "Transaction identifier"
    checks:
      - name: "At least 1000 transactions daily"
        type: count
        min: 1000
        severity: P2

Example 3: Count Range

columns:
  - name: order_id
    type: int
    nullable: false
    description: "Order identifier"
    checks:
      - name: "Daily order count is stable"
        type: count
        between: [100, 1000]
        severity: P1

variance

The variance check validates that the statistical variance of numeric values in a column falls within specified bounds, providing a measure of data spread.

Parameters:

Parameter Type Required Default Description
min number No None Minimum allowed variance
max number No None Maximum allowed variance
between [number, number] No None Inclusive range (shorthand for min + max)
equals number No None Exact expected variance
tolerance float No 1e-9 Acceptable variance

Validators: min, max, between, not_between, equals — see Validators.

Example 1: Maximum Bound (Most Common)

columns:
  - name: temperature
    type: float
    nullable: false
    description: "Sensor temperature reading"
    checks:
      - name: "Temperature variance is stable"
        type: variance
        max: 10.0  # Variance ≤ 10.0
        severity: P2

Example 2: Variance Range

columns:
  - name: daily_sales
    type: decimal
    nullable: false
    description: "Daily sales amount"
    checks:
      - name: "Sales variance is within expected range"
        type: variance
        between: [1000.0, 10000.0]  # Expected variance range
        severity: P2

Example 3: Exact Variance

columns:
  - name: transaction_amount
    type: decimal
    nullable: false
    description: "Transaction amount"
    checks:
      - name: "Transaction variance matches expectation"
        type: variance
        equals: 50000.0  # Exactly 50000.0 variance
        tolerance: 100.0  # Within 100
        severity: P1

stddev

The stddev check computes the standard deviation of numeric values in a column, returning a single float metric.

Validators: min, max, between, not_between, equals — see Validators.

Example 1: Maximum Standard Deviation

columns:
  - name: response_time_ms
    type: float
    nullable: false
    description: "API response time in milliseconds"
    checks:
      - name: "Response time is not too volatile"
        type: stddev
        max: 500.0
        severity: P1

Example 2: Standard Deviation Within Expected Range

columns:
  - name: daily_revenue
    type: decimal
    nullable: false
    description: "Daily revenue in USD"
    checks:
      - name: "Revenue standard deviation is within expected range"
        type: stddev
        between: [100.0, 50000.0]
        severity: P2

percentile

The percentile check validates that a specific percentile value in a column falls within specified bounds.

Parameters:

Parameter Type Required Default Description
percentile float Yes None Percentile to compute, in [0.0, 1.0] (e.g. 0.5 for median, 0.99 for 99th percentile)
min number No None Minimum allowed percentile value
max number No None Maximum allowed percentile value
between [number, number] No None Inclusive range (shorthand for min + max)
equals number No None Exact expected percentile value
tolerance float No 1e-9 Acceptable variance

Validators: min, max, between, not_between, equals — see Validators. The percentile parameter shares the same name as the check type but operates at a different YAML scope — type selects the check, percentile specifies the value to compute. They do not conflict.

Use min, max, between, or equals to set bounds. Most use cases require range validation to detect drift in either direction.

Example 1: P95 Response Time (Max Only)

columns:
  - name: response_time_ms
    type: int
    nullable: false
    description: "API response time in milliseconds"
    checks:
      - name: "P95 response time is acceptable"
        type: percentile
        percentile: 0.95
        max: 1000.0  # P95 under 1 second
        severity: P1

Example 2: Median Value (Range)

columns:
  - name: order_amount
    type: decimal
    nullable: false
    description: "Order amount in USD"
    checks:
      - name: "Median order value is typical"
        type: percentile
        percentile: 0.5
        between: [30.0, 100.0]  # Median between $30-$100
        severity: P2

Example 3: Exact Percentile Value

columns:
  - name: processing_time_sec
    type: float
    nullable: false
    description: "Processing time in seconds"
    checks:
      - name: "P99 processing time matches SLA"
        type: percentile
        percentile: 0.99
        equals: 300.0  # P99 exactly 300 seconds
        tolerance: 5.0  # Within 5 seconds
        severity: P0

Check Composition Patterns

Checks can be combined to create comprehensive validation strategies. Here are common patterns:

Pattern 1: Layered Validation (Progressive Severity)

columns:
  - name: email
    type: string
    nullable: false
    description: "Customer email address"
    checks:
      # Layer 1: Important - must have valid format
      - name: "Email format is valid"
        type: pattern
        format: email
        severity: P0

      # Layer 2: Nice-to-have - cardinality check
      - name: "Email domains are diverse"
        type: cardinality
        min: 10  # At least 10 distinct domains
        severity: P2

Pattern 2: Range Validation with Context

columns:
  - name: order_amount
    type: decimal
    nullable: false
    description: "Order amount in USD"
    checks:
      # Hard constraint: Minimum value must be non-negative
      - name: "Minimum order amount is non-negative"
        type: min
        min: 0.0
        severity: P0

      # Business rule: Maximum value is reasonable
      - name: "Maximum order amount is reasonable"
        type: max
        max: 1000000.0
        severity: P1

      # Statistical monitoring: Average is stable
      - name: "Average order value is typical"
        type: mean
        between: [50.0, 500.0]
        severity: P2

Pattern 3: Multi-Level Validation

columns:
  - name: phone_number
    type: string
    nullable: true
    description: "Customer phone number"
    checks:
      # Null percentage constraint (uses 0-1 scale)
      - name: "Most customers provide phone number"
        type: missing
        return: pct
        max: 0.20  # At most 20% null
        severity: P1

      # Format validation (when present)
      - name: "Phone number format is valid"
        type: pattern
        pattern: "^\\+?[1-9]\\d{1,14}$"
        severity: P0

      # Length validation
      - name: "Phone number length is reasonable"
        type: min_length
        min: 10
        severity: P1

Pattern 4: Table-Level Reconciliation

# Table-level checks
checks:
  # Volume check
  - name: "Daily volume within bounds"
    type: num_rows
    between: [1000, 100000]
    severity: P1

  # Freshness check
  - name: "Data arrived within SLA"
    type: freshness
    max_age_hours: 25.0
    timestamp_column: order_date
    severity: P0

  # Completeness check
  - name: "No missing daily partitions"
    type: completeness
    partition_column: order_date
    granularity: daily
    lookback_days: 7
    severity: P1

Pattern 5: Categorical with Cardinality

columns:
  - name: status
    type: string
    nullable: false
    description: "Order status"
    checks:
      # Whitelist validation
      - name: "Status is valid"
        type: whitelist
        values: ["pending", "processing", "shipped", "delivered", "cancelled"]
        severity: P0

      # Low cardinality validation
      - name: "Status is low-cardinality"
        type: cardinality
        max: 10  # At most 10 distinct values
        severity: P1