Data Validation

VerifiedSafe

Provides data validation patterns for custom business rules, including validation functions, test assertions, and schema evolution. Useful for data engineers building validation pipelines with checks like duplicates, referential integrity, and date ranges.

Sby Skills Guide Bot
Data & AIIntermediate
1006/2/2026
Claude Code
#data-validation#data-quality#etl-validation#schema-validation#business-rules

Recommended for

Our review

This skill provides data validation patterns with custom functions, schema evolution handling, and test assertions for data pipelines.

Strengths

  • Modular approach with reusable validators and configurable pipelines.
  • Test assertions integrated for easy quality assurance in unit tests.
  • Supports Pydantic, Pandera, and Great Expectations depending on use case.
  • Enables checking referential integrity, duplicates, and date ranges.

Limitations

  • Requires basic pandas knowledge to configure pipelines.
  • Examples focus on DataFrames, less suited for streaming data.
  • Documentation does not cover production error handling (logging, alerts).
When to use it

Use this skill when you need to implement custom business rule validation on DataFrames in an ETL pipeline or for data quality tests.

When not to use it

Avoid this skill if you work with real-time streaming data that requires on-the-fly validation without pandas.

Security analysis

Safe
Quality score78/100

The skill describes data validation patterns using pandas and custom Python functions. It does not instruct or perform any harmful actions, network operations, or execution of arbitrary commands. The allowed tools (Read, Write, Edit, Bash) are listed but not exploited for risky purposes in the provided examples. No exfiltration, obfuscation, or destructive instructions are present.

No concerns found

Examples

Validate duplicates in a DataFrame
I have a DataFrame 'df' with column 'id'. Use the validate_no_duplicates function to check for duplicate rows and print the failed rows.
Build a validation pipeline
Create a DataValidator that checks for no duplicates on 'id' and ensures 'created_at' is within 2020-01-01 to 2025-12-31. Apply it to df and print failed checks.
Test assertion for null values
In a pytest test, assert that columns 'id' and 'email' have no null values using assert_no_nulls.

name: data-validation description: Data validation patterns and pipeline helpers. Custom validation functions, schema evolution, and test assertions. allowed-tools: Read Write Edit Bash

Data Validation

Audience: Data engineers building validation pipelines.

Goal: Provide validation patterns for custom business rules.

Framework-specific skills:

  • pydantic-validation - Record-level validation with Pydantic
  • pandera-validation - DataFrame schema validation
  • great-expectations - Pipeline expectations and monitoring

Scripts

Execute validation functions from scripts/validators.py:

from scripts.validators import (
    ValidationResult,
    DataValidator,
    validate_no_duplicates,
    validate_referential_integrity,
    validate_date_range,
    validate_value_in_set,
    run_validation_pipeline,
    validate_with_schema_version,
    assert_schema_match,
    assert_no_nulls,
    assert_unique,
    assert_values_in_set
)

Framework Selection

| Use Case | Framework | |----------|-----------| | API request/response | Pydantic | | Record-by-record ETL | Pydantic | | DataFrame validation | Pandera | | Type hints for DataFrames | Pandera | | Pipeline monitoring | Great Expectations | | Data warehouse checks | Great Expectations | | Custom business rules | Custom functions (this skill) |

Usage Examples

Basic Validation

from scripts.validators import validate_no_duplicates, validate_referential_integrity

# Check duplicates
result = validate_no_duplicates(df, cols=['id'])
if not result.passed:
    print(f"Error: {result.message}")
    print(result.failed_rows)

# Check referential integrity
result = validate_referential_integrity(df, 'user_id', users_df, 'id')

Validation Pipeline

from scripts.validators import DataValidator, validate_no_duplicates, validate_date_range

validator = DataValidator()
validator.add_check(lambda df: validate_no_duplicates(df, ['id']))
validator.add_check(lambda df: validate_date_range(df, 'created_at', '2020-01-01', '2025-12-31'))

results = validator.validate(df)
if not results['passed']:
    for check in results['checks']:
        if not check['passed']:
            print(f"Failed: {check['message']}")

Config-Driven Pipeline

from scripts.validators import run_validation_pipeline

config = {
    'unique_columns': ['id'],
    'date_ranges': {
        'created_at': ('2020-01-01', '2025-12-31'),
        'updated_at': ('2020-01-01', '2025-12-31')
    }
}

clean_df, results = run_validation_pipeline(df, config)

Test Assertions

from scripts.validators import assert_schema_match, assert_no_nulls, assert_unique

# In pytest
def test_data_quality():
    assert_schema_match(df, {'id': 'int64', 'email': 'object'})
    assert_no_nulls(df, ['id', 'email'])
    assert_unique(df, ['id'])

Dependencies

pandas
Related skills