Getting Started

Installation

Pavise requires Python 3.9 or later. Install it via pip with your preferred backend:

For pandas backend:

pip install pavise[pandas]

For polars backend:

pip install pavise[polars]

For both backends:

pip install pavise[all]

Basic Usage

Define a Schema

Define your DataFrame schema using Python’s Protocol:

from typing import Protocol

class UserSchema(Protocol):
    user_id: int
    name: str
    age: int
    email: str

Static Type Checking

Use the schema for static type checking with mypy, pyright, or other type checkers:

from pavise.pandas import DataFrame

def process_users(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]:
    # Type checker validates the schema
    # No runtime overhead
    return df

Runtime Validation

Validate DataFrames at runtime, typically at system boundaries:

import pandas as pd
from pavise.pandas import DataFrame
from pavise.exceptions import ValidationError

# Load data from external source
raw_df = pd.read_csv("users.csv")

# Validate at system boundary
try:
    validated_df = DataFrame[UserSchema](raw_df)
except ValidationError as e:
    print(f"Validation failed: {e}")

If validation fails, you’ll get a detailed error message from ValidationError:

Validation failed: Column 'age': expected int, got object

Sample invalid values (showing first 3 of 10):
  Row 1: 'invalid' (str)
  Row 5: None (NoneType)
  Row 8: 200.5 (float)

Using Validators

Add validators using typing.Annotated:

from typing import Annotated
from pavise.validators import Range, Regex

class UserSchema(Protocol):
    user_id: int
    name: str
    age: Annotated[int, Range(0, 150)]
    email: Annotated[str, Regex(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')]

# Runtime validation with validators
validated_df = DataFrame[UserSchema](raw_df)

Using Literal Types

Restrict column values to specific literals using Literal:

from typing import Literal, Protocol

class OrderSchema(Protocol):
    order_id: int
    status: Literal["pending", "approved", "rejected"]
    priority: Literal[1, 2, 3]

# Only these exact values are allowed
validated_df = DataFrame[OrderSchema](raw_df)

Optional Columns

Use NotRequiredColumn[T] for columns that may not exist in the DataFrame:

from typing import Optional, Protocol
from pavise.pandas import DataFrame, NotRequiredColumn

class UserSchema(Protocol):
    user_id: int
    name: str
    age: NotRequiredColumn[int]  # Column can be missing
    email: NotRequiredColumn[Optional[str]]  # Column can be missing, or contain None

# Both of these are valid
df1 = pd.DataFrame({"user_id": [1], "name": ["Alice"]})  # age and email missing
df2 = pd.DataFrame({"user_id": [1], "name": ["Alice"], "age": [25]})  # only email missing

validated_df1 = DataFrame[UserSchema](df1)  # OK
validated_df2 = DataFrame[UserSchema](df2)  # OK

Note: NotRequiredColumn[T] means the column is optional, while Optional[T] means the column can contain None values. Use NotRequiredColumn[Optional[T]] for columns that are both optional and nullable.

Next Steps