Validators

Pavise provides validators to enforce data quality constraints beyond type checking. Validators are attached to column types using typing.Annotated.

Available Validators

Range

Validates that numeric values are within a specified range.

from typing import Annotated, Protocol
from pavise.validators import Range
from pavise.pandas import DataFrame

class Schema(Protocol):
    age: Annotated[int, Range(0, 150)]
    score: Annotated[float, Range(0.0, 100.0)]

validated_df = DataFrame[Schema](df)

Error message example:

ValueError: Column 'age': values must be in range [0, 150]

Sample invalid values (showing first 3 of 5):
  Row 1: 200
  Row 3: -5
  Row 5: 300

Unique

Validates that column values are unique (no duplicates).

from pavise.validators import Unique

class Schema(Protocol):
    user_id: Annotated[int, Unique()]
    email: Annotated[str, Unique()]

Error message example:

ValueError: Column 'user_id': contains duplicate values

Sample duplicate values (showing first 2):
  Value 2 at rows: [1, 3]
  Value 5 at rows: [2, 4]

In

Validates that column values are within a set of allowed values.

from pavise.validators import In

class Schema(Protocol):
    status: Annotated[str, In(["pending", "approved", "rejected"])]
    priority: Annotated[int, In([1, 2, 3, 4, 5])]

Error message example:

ValueError: Column 'status': contains values not in allowed values

Sample invalid values (showing first 2 of 3):
  Row 1: 'invalid'
  Row 3: 'bad'

Regex

Validates that string values match a regular expression pattern.

from pavise.validators import Regex

class Schema(Protocol):
    email: Annotated[str, Regex(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')]
    phone: Annotated[str, Regex(r'^\d{3}-\d{4}-\d{4}$')]

Error message example:

ValueError: Column 'email': contains values that don't match the pattern

Sample invalid values (showing first 2 of 4):
  Row 1: 'invalid'
  Row 3: 'bad@'

MinLen / MaxLen

Validates string length constraints.

from pavise.validators import MinLen, MaxLen

class Schema(Protocol):
    username: Annotated[str, MinLen(3), MaxLen(20)]
    password: Annotated[str, MinLen(8)]

Error message example:

ValueError: Column 'username': contains strings shorter than minimum length

Sample invalid values (showing first 2 of 3):
  Row 0: 'ab' (length: 2)
  Row 2: 'x' (length: 1)

Custom

Create custom validators for business-specific logic.

from pavise.validators import Custom

def is_positive(value) -> bool:
    return value > 0

def is_business_day(value) -> bool:
    return value.weekday() < 5

class Schema(Protocol):
    amount: Annotated[int, Custom(is_positive, "must be positive")]
    date: Annotated[datetime.date, Custom(is_business_day, "must be a business day")]

Error message example:

ValueError: Column 'amount': must be positive

Sample invalid values (showing first 3 of 5):
  Row 1: -100
  Row 3: 0
  Row 7: -50

Combining Multiple Validators

You can attach multiple validators to a single column:

from typing import Annotated, Protocol
from pavise.validators import Range, Custom

def is_even(value) -> bool:
    return value % 2 == 0

class Schema(Protocol):
    score: Annotated[int, Range(0, 100), Custom(is_even, "must be even")]

All validators are checked in order, and the first failure is reported.

Performance Considerations

Validators are only executed during runtime validation:

# No validation, no performance cost
def process(df: DataFrame[Schema]) -> DataFrame[Schema]:
    return df

# Validation happens here
validated_df = DataFrame[Schema](raw_df)

# No further validation cost
result = process(validated_df)

For large DataFrames, consider validating only at system boundaries and relying on type checking for internal functions.