Our review
This skill provides data validation patterns with custom functions, schema evolution handling, and test assertions for data pipelines.
Strengths
- Modular approach with reusable validators and configurable pipelines.
- Test assertions integrated for easy quality assurance in unit tests.
- Supports Pydantic, Pandera, and Great Expectations depending on use case.
- Enables checking referential integrity, duplicates, and date ranges.
Limitations
- Requires basic pandas knowledge to configure pipelines.
- Examples focus on DataFrames, less suited for streaming data.
- Documentation does not cover production error handling (logging, alerts).
Use this skill when you need to implement custom business rule validation on DataFrames in an ETL pipeline or for data quality tests.
Avoid this skill if you work with real-time streaming data that requires on-the-fly validation without pandas.
Security analysis
SafeThe skill describes data validation patterns using pandas and custom Python functions. It does not instruct or perform any harmful actions, network operations, or execution of arbitrary commands. The allowed tools (Read, Write, Edit, Bash) are listed but not exploited for risky purposes in the provided examples. No exfiltration, obfuscation, or destructive instructions are present.
No concerns found
Examples
I have a DataFrame 'df' with column 'id'. Use the validate_no_duplicates function to check for duplicate rows and print the failed rows.Create a DataValidator that checks for no duplicates on 'id' and ensures 'created_at' is within 2020-01-01 to 2025-12-31. Apply it to df and print failed checks.In a pytest test, assert that columns 'id' and 'email' have no null values using assert_no_nulls.name: data-validation description: Data validation patterns and pipeline helpers. Custom validation functions, schema evolution, and test assertions. allowed-tools: Read Write Edit Bash
Data Validation
Audience: Data engineers building validation pipelines.
Goal: Provide validation patterns for custom business rules.
Framework-specific skills:
pydantic-validation- Record-level validation with Pydanticpandera-validation- DataFrame schema validationgreat-expectations- Pipeline expectations and monitoring
Scripts
Execute validation functions from scripts/validators.py:
from scripts.validators import (
ValidationResult,
DataValidator,
validate_no_duplicates,
validate_referential_integrity,
validate_date_range,
validate_value_in_set,
run_validation_pipeline,
validate_with_schema_version,
assert_schema_match,
assert_no_nulls,
assert_unique,
assert_values_in_set
)
Framework Selection
| Use Case | Framework | |----------|-----------| | API request/response | Pydantic | | Record-by-record ETL | Pydantic | | DataFrame validation | Pandera | | Type hints for DataFrames | Pandera | | Pipeline monitoring | Great Expectations | | Data warehouse checks | Great Expectations | | Custom business rules | Custom functions (this skill) |
Usage Examples
Basic Validation
from scripts.validators import validate_no_duplicates, validate_referential_integrity
# Check duplicates
result = validate_no_duplicates(df, cols=['id'])
if not result.passed:
print(f"Error: {result.message}")
print(result.failed_rows)
# Check referential integrity
result = validate_referential_integrity(df, 'user_id', users_df, 'id')
Validation Pipeline
from scripts.validators import DataValidator, validate_no_duplicates, validate_date_range
validator = DataValidator()
validator.add_check(lambda df: validate_no_duplicates(df, ['id']))
validator.add_check(lambda df: validate_date_range(df, 'created_at', '2020-01-01', '2025-12-31'))
results = validator.validate(df)
if not results['passed']:
for check in results['checks']:
if not check['passed']:
print(f"Failed: {check['message']}")
Config-Driven Pipeline
from scripts.validators import run_validation_pipeline
config = {
'unique_columns': ['id'],
'date_ranges': {
'created_at': ('2020-01-01', '2025-12-31'),
'updated_at': ('2020-01-01', '2025-12-31')
}
}
clean_df, results = run_validation_pipeline(df, config)
Test Assertions
from scripts.validators import assert_schema_match, assert_no_nulls, assert_unique
# In pytest
def test_data_quality():
assert_schema_match(df, {'id': 'int64', 'email': 'object'})
assert_no_nulls(df, ['id', 'email'])
assert_unique(df, ['id'])
Dependencies
pandas
Prompt Engineering
Data & AI
Prompt engineering best practices and templates to maximize AI outputs.
Data Visualization
Data & AI
Generates data visualizations and charts tailored to your data.
RAG Architecture Setup
Data & AI
Setup guide for RAG (Retrieval-Augmented Generation) architectures.