pandas Backend

The pandas backend provides validation for pandas DataFrames.

Installation

pip install pavise[pandas]

Basic Usage

from typing import Protocol
from pavise.pandas import DataFrame
import pandas as pd

class UserSchema(Protocol):
    user_id: int
    name: str
    age: int

# Create a pandas DataFrame
df = pd.DataFrame({
    "user_id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35]
})

# Validate
validated_df = DataFrame[UserSchema](df)

Type Mapping

Pavise maps Python types to pandas dtypes:

Python Type

pandas dtype

int

int64

float

float64

str

object (str)

bool

bool

datetime

datetime64[ns]

date

datetime64[ns]

timedelta

timedelta64[ns]

Optional[T]

Nullable version of T

pandas ExtensionDtype

You can use pandas extension dtypes directly:

import pandas as pd

class Schema(Protocol):
    category: pd.CategoricalDtype
    nullable_int: pd.Int64Dtype
    string: pd.StringDtype

validated_df = DataFrame[Schema](df)

This gives you more control over the exact dtype used.

Index Validation

Validate the index type using the special __index__ attribute:

from typing import Protocol

class Schema(Protocol):
    __index__: int  # Validates index is int64
    value: float

# Create DataFrame with integer index
df = pd.DataFrame({"value": [1.0, 2.0, 3.0]}, index=[0, 1, 2])

validated_df = DataFrame[Schema](df)

Named Index Validation

Use Annotated to validate both the index type and name:

from typing import Protocol, Annotated

class UserSchema(Protocol):
    __index__: Annotated[int, "user_id"]  # Validates type AND name
    username: str
    score: float

# Create DataFrame with named index
df = pd.DataFrame(
    {"username": ["alice", "bob"], "score": [95.0, 87.0]},
    index=pd.Index([1, 2], name="user_id")
)

validated_df = DataFrame[UserSchema](df)

# This will fail - wrong index name
df_wrong = pd.DataFrame(
    {"username": ["alice"], "score": [95.0]},
    index=pd.Index([1], name="id")  # Expected "user_id"
)
# ValidationError: Index name expected 'user_id', got 'id'

MultiIndex Validation

For MultiIndex, use a tuple of types with a tuple of names:

from typing import Protocol, Annotated

class RegionalSalesSchema(Protocol):
    __index__: Annotated[tuple[str, int], ("region", "user_id")]
    sales: float
    quantity: int

# Create DataFrame with MultiIndex
df = pd.DataFrame(
    {"sales": [100.0, 200.0, 150.0], "quantity": [5, 10, 7]},
    index=pd.MultiIndex.from_tuples(
        [("East", 1), ("East", 2), ("West", 1)],
        names=["region", "user_id"]
    )
)

validated_df = DataFrame[RegionalSalesSchema](df)

Nullable Types

pandas handles nullable integers specially:

from typing import Optional

class Schema(Protocol):
    value: Optional[int]

# pandas converts int to float64 when there are nulls
df = pd.DataFrame({"value": [1, 2, None]})  # dtype: float64
validated_df = DataFrame[Schema](df)

For true nullable integers, use pd.Int64Dtype:

class Schema(Protocol):
    value: pd.Int64Dtype

df = pd.DataFrame({"value": pd.array([1, 2, None], dtype=pd.Int64Dtype())})
validated_df = DataFrame[Schema](df)

Method Chaining

Note: pandas method chaining may lose Pavise type information:

validated_df = DataFrame[UserSchema](df)

# Type information is lost after pandas operations
result = validated_df.groupby("age").mean()  # result is not DataFrame[UserSchema]

# Re-validate if needed
revalidated = DataFrame[ResultSchema](result)

Performance Considerations

Validation checks all rows for type correctness, which can be slow for large DataFrames. For performance-critical code:

  1. Validate once at system boundaries

  2. Use type annotations without validation for internal functions

  3. Trust the type system after initial validation

# Validate once
validated_df = DataFrame[UserSchema](raw_df)

# No validation overhead in internal functions
def process(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]:
    return df

result = process(validated_df)