Data Contracts Technical Specification¶
| Field | Value |
|---|---|
| Author | DQX |
| Created | 2025-02-15 |
| Last Updated | 2026-03-07 |
| Version | 2.3 |
| Status | Ready for review |
| Tags | data-contracts, data-quality, sla |
| Related | None |
Overview¶
A data contract is a versioned YAML document that defines the schema, quality checks, and freshness guarantees for a dataset. Each contract names its dataset, declares the type and nullability of every column, specifies optional SLA schedules, and attaches quality checks directly to the columns or dataset they govern.
Teams need contracts because data requirements must be explicit, version-controlled, testable, and portable. Without contracts, quality rules live in ad-hoc Python scripts that differ across environments, drift from the actual schema, and cannot be reviewed as data specifications. A contract replaces that scattered Python code with a single declarative YAML file that any engineer can read, diff, and own.
Contracts generate lists of DecoratedCheck functions. The user combines contract-generated checks with hand-coded checks in a single suite, so contract-based and custom validations run together and produce AssertionResult objects.
Architecture¶
Core Design Principle¶
Contracts are column-centric YAML specifications that generate checks composable with hand-coded checks inside a standard VerificationSuite.
Contract YAML (schema + checks)
↓ Contract.from_yaml()
Contract instance (with resolved schema)
↓ contract.to_checks()
list[DecoratedCheck]
↓ VerificationSuite(checks=contract.to_checks() + [...], ...)
VerificationSuite
↓ suite.run([datasource], result_key)
None
↓ suite.collect_results()
list[AssertionResult]
Contract.from_yaml() parses the YAML and builds a Contract instance with a fully resolved schema. contract.to_checks() translates every column definition and check into a list of DecoratedCheck functions. The user merges contract-generated checks with any hand-coded checks — VerificationSuite(checks=contract.to_checks() + [custom_check], db=db, name=...) — and calls suite.run([datasource], result_key) to execute all checks. Results are collected separately via suite.collect_results(), which returns list[AssertionResult]. Schema type mismatches raise SchemaValidationError; contract parse errors raise ContractValidationError. SchemaValidationError is raised when a column's actual storage type does not match the declared contract type. ContractValidationError is raised when the YAML cannot be parsed into a valid Contract (e.g., missing required fields, invalid cron expression, or unknown check type).
Contract Structure¶
The complete contract below shows every top-level section. The prose paragraphs that follow explain each section.
# Metadata
name: "Contract Name"
version: "1.0.0"
description: "What this data represents"
owner: "team-name"
dataset: "table_name"
tags: ["tag1", "tag2"]
# Optional SLA (2 fields)
sla:
schedule: "0 0 * * *" # Cron expression
lag_hours: 24 # Availability lag
# Optional partitioning (timestamp_column inferred for freshness checks)
metadata:
partitioned_by: ["event_date"]
# Optional table-level checks
checks:
- name: "Row count check"
type: num_rows
min: 100
severity: P1
# Schema with unified type field
columns:
- name: column_name
type: int # Simple type (string)
nullable: false
description: "Required description"
- name: complex_column
type: # Complex type (object)
kind: list
value_type: string
nullable: true
description: "Required description"
checks: # Optional checks
- name: "Check name"
type: duplicates
max: 0
severity: P0
Metadata. Every contract begins with five required metadata fields that identify the dataset and its owner: name (a human-readable label), version (a version string; semantic versioning is recommended but not enforced — e.g., "1.0.0" or "2025-03-07"), description (a plain-English statement of what the data represents), owner (the responsible team), and dataset (the table or view name used at query time). An optional tags field accepts a list of strings for filtering and discovery.
SLA. The optional sla block defines when data should arrive. It takes two fields: schedule, a standard 5-field cron expression that declares the expected delivery cadence, and lag_hours, the number of hours the data may lag behind the scheduled time before triggering a failure. When both fields are present, DQX auto-generates a freshness check — no additional configuration required. See SLA Specification for cron format reference and examples.
Partitioning. The optional metadata block declares the partitioning columns for the dataset. DQX reads partitioned_by to infer which column carries the timestamp used in freshness and completeness checks. When the SLA block references a freshness check and partitioned_by is set, DQX selects the first listed column as the timestamp column automatically.
Table-level checks. The top-level checks section validates properties of the dataset as a whole. num_rows asserts that the row count falls within a specified range. duplicates asserts that duplicate rows stay below a threshold. freshness asserts that data is not stale by checking record age against max_age_hours (defaults to the most recent record; set aggregation: min to check the oldest). completeness asserts that partition gaps — missing dates or time windows — stay below a specified count. num_rows and duplicates accept standard validators (min, max, between, equals, tolerance). freshness uses the implicit max_age_hours parameter instead of standard validators; completeness uses the implicit max_gap_count parameter instead.
Columns. The columns section is the heart of the contract. Each entry co-locates four pieces of information that belong together: the column's type (one of 12 contract types designed for simplicity and broad storage compatibility — int accepts any integer width, float accepts 32-bit and 64-bit), its nullable flag (defaults to true when omitted), its required description, and an optional checks list. Co-locating schema and checks in a single entry makes the contract self-documenting: a reader sees the column's semantics and its quality requirements in one place. See Type System for the full compatibility matrix.
Complete Schema Structure¶
The annotated schema below shows every field a contract file accepts, with types and defaults.
# Required: Contract metadata
name: string # Contract name (1-255 characters)
version: string # Version string (e.g., "1.0.0"); semantic versioning recommended but not enforced
description: string # Contract/table description
owner: string # Team or individual owner
dataset: string # Dataset name to validate (must match datasource.name)
tags: [string, ...] # Optional tags (e.g., ["revenue", "core"])
# Optional: Structured SLA (see SLA Specification section)
sla:
schedule: string # Cron expression for data arrival schedule
lag_hours: number # Hours after schedule until data available (fractional values allowed, e.g. 1.5)
# Optional: Table-level metadata (flat at top level)
metadata:
partitioned_by: [string, ...] # Column names used for partitioning
timestamp_column: string # Required for non-partitioned SLA tables
# ... custom metadata key-value pairs
# Optional: Table-level checks
checks:
- name: string # Check name (required)
type: string # Check type (e.g., "num_rows", "freshness")
severity: "P0"|"P1"|"P2"|"P3" # Required
# Type-specific parameters...
# Required: Unified columns (schema + checks together)
columns:
- name: string # Required: Column name
type: string | object # Required: Simple type (string) or complex type (object)
nullable: true|false # Optional: Defaults to true if not specified
description: string # Required: Column description
# Optional: Field-level metadata
metadata:
# ... custom metadata key-value pairs
# Optional: Column checks (can be omitted for schema-only columns)
checks:
- name: string # Check name (required)
type: string # Check type (e.g., "duplicates", "min")
severity: "P0"|"P1"|"P2"|"P3" # Required
# Type-specific parameters...
Omitting the checks key from a column produces a schema-only column: DQX validates its type and nullability but runs no quality assertions against it. Checks attach only to top-level columns, not to nested struct fields.
Co-location Principle¶
Schema definitions and quality checks live together inside each column entry by design. Proximity keeps related information together, so a reader sees a column's type, nullability, and constraints in one place without jumping between sections. It also eliminates a common class of authoring error: a check that references a column not present in the schema cannot be written, because the check must nest inside a column that already declares its type.
Type Field Format¶
Simple types use strings; complex types use objects with a kind field:
# Simple type (string)
- name: order_id
type: int
nullable: false
description: "Order ID"
# Complex type (object)
- name: created_at
type:
kind: timestamp
tz: "UTC"
nullable: false
description: "Creation timestamp"
Minimal Contract Example¶
A minimal contract defines only metadata and columns. Without checks, DQX enforces the declared schema at load time but generates no quality checks.
name: "Products Contract"
version: "1.0.0"
description: "Product catalog records"
owner: "catalog-team"
dataset: "products"
columns:
- name: product_id
type: int
nullable: false
description: "Unique product identifier"
- name: name
type: string
nullable: false
description: "Product display name"
- name: price_usd
type: decimal
nullable: false
description: "List price in USD"
- name: discontinued
type: bool
nullable: false
description: "Whether the product is discontinued"
Basic Contract Example¶
name: "Orders Contract"
version: "1.0.0"
description: "Daily order records"
owner: "data-platform-team"
dataset: "orders"
tags: ["revenue"]
metadata:
partitioned_by: ["order_date"]
columns:
- name: order_id
type: int
nullable: false
description: "Unique order identifier"
metadata:
primary_key: "true"
checks:
- name: "Order ID is unique"
type: duplicates
max: 0
severity: P0
- name: "Order ID is positive"
type: min
min: 1
severity: P0
- name: customer_id
type: int
nullable: false
description: "Customer identifier"
checks:
- name: "Customer ID is positive"
type: min
min: 1
severity: P1
- name: total_amount
type: decimal
nullable: false
description: "Total order amount in USD"
checks:
- name: "Amount is non-negative"
type: min
min: 0.0
severity: P1
- name: "Amount is reasonable"
type: max
max: 1000000.0
severity: P1
- name: status
type: string
nullable: false
description: "Order status"
checks:
- name: "Status is valid"
type: whitelist
values: ["pending", "processing", "shipped", "delivered", "cancelled"]
severity: P0
# Schema-only columns (no checks)
- name: order_date
type: date
nullable: false
description: "Order date (for partitioning)"
- name: payment_method
type: string
nullable: false
description: "Payment method used"
- name: is_gift
type: bool
nullable: false
description: "Whether order is a gift"
- name: notes
type: string
nullable: true
description: "Order notes from customer"
# Table-level checks
checks:
- name: "Daily volume within bounds"
type: num_rows
between: [100, 1000000]
severity: P1
This contract generates checks for four columns (order_id, customer_id, total_amount, status) plus one table-level check. The four schema-only columns (order_date, payment_method, is_gift, notes) produce no checks; DQX validates their types and nullability at load time. Because the checks bind to the dataset name orders, the same checks run unchanged against any datasource whose registered name matches and whose schema satisfies the declared types.
Type System Summary¶
DQX defines its own contract type system aimed at simplicity and broad data quality coverage. The 12 contract types map common data concepts — integers, decimals, timestamps, lists — to their storage representations without requiring users to know the underlying storage format. A column declared as int passes validation for any integer width; only the semantic category matters.
| Category | Types | Format |
|---|---|---|
| Primitive | int, float, bool, string, bytes | type: int |
| Temporal | date, timestamp, time | type: date or type: {kind: timestamp} |
| Decimal | decimal | type: decimal |
| Complex | list, struct, map | type: {kind: list, value_type: string} |
Simple types (primitive, temporal, decimal) use a plain string value. Complex types (list, struct, map) use an object with a kind field and optional subtype fields. See Type System for the full compatibility rules per type.
Checks Summary¶
DQX contracts define 21 check types across two scopes.
4 table-level checks validate the dataset as a whole:
num_rows— asserts total row countduplicates— asserts count of duplicate rowsfreshness— asserts that data is not stale (record age does not exceedmax_age_hours; defaults to most recent, optionally oldest viaaggregation: min)completeness— asserts absence of partition gaps
17 column-level checks validate individual columns. 9 are statistical:
cardinality— distinct value countmin— minimum valuemax— maximum valuemean— arithmetic meansum— column sumcount— non-null countvariance— statistical variancestddev— standard deviationpercentile— value at a specified percentile
8 are value checks:
missing— null value countduplicates— duplicate value count within the columnwhitelist— all values belong to an allowed setblacklist— no values belong to a forbidden setpattern— all values match a regular expressionmin_length— minimum string, list, or map element countmax_length— maximum string, list, or map element countavg_length— average string, list, or map element count
Most checks, table-level or column-level, support validators: min, max, between, not_between, and equals. tolerance is an auxiliary parameter used alongside a validator (not a mutually exclusive validator itself). Exceptions are freshness (uses max_age_hours) and completeness (uses max_gap_count), which use check-specific implicit parameters instead. See Checks & Validators for validators and composition patterns.
Detailed References¶
- Type System — Contract type definitions: primitive, temporal, decimal, and complex types
- SLA Specification — Service level agreements, scheduling, auto-generated checks, and examples
- Checks & Validators — Overview, parameter conventions, table-level checks, column-level checks, and composition patterns