pandas Backend
The pandas backend provides validation for pandas DataFrames.
Installation
pip install pavise[pandas]
Basic Usage
from typing import Protocol
from pavise.pandas import DataFrame
import pandas as pd
class UserSchema(Protocol):
user_id: int
name: str
age: int
# Create a pandas DataFrame
df = pd.DataFrame({
"user_id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35]
})
# Validate
validated_df = DataFrame[UserSchema](df)
Type Mapping
Pavise maps Python types to pandas dtypes:
Python Type |
pandas dtype |
|---|---|
|
int64 |
|
float64 |
|
object (str) |
|
bool |
|
datetime64[ns] |
|
datetime64[ns] |
|
timedelta64[ns] |
|
Nullable version of T |
pandas ExtensionDtype
You can use pandas extension dtypes directly:
import pandas as pd
class Schema(Protocol):
category: pd.CategoricalDtype
nullable_int: pd.Int64Dtype
string: pd.StringDtype
validated_df = DataFrame[Schema](df)
This gives you more control over the exact dtype used.
Index Validation
Validate the index type using the special __index__ attribute:
from typing import Protocol
class Schema(Protocol):
__index__: int # Validates index is int64
value: float
# Create DataFrame with integer index
df = pd.DataFrame({"value": [1.0, 2.0, 3.0]}, index=[0, 1, 2])
validated_df = DataFrame[Schema](df)
Named Index Validation
Use Annotated to validate both the index type and name:
from typing import Protocol, Annotated
class UserSchema(Protocol):
__index__: Annotated[int, "user_id"] # Validates type AND name
username: str
score: float
# Create DataFrame with named index
df = pd.DataFrame(
{"username": ["alice", "bob"], "score": [95.0, 87.0]},
index=pd.Index([1, 2], name="user_id")
)
validated_df = DataFrame[UserSchema](df)
# This will fail - wrong index name
df_wrong = pd.DataFrame(
{"username": ["alice"], "score": [95.0]},
index=pd.Index([1], name="id") # Expected "user_id"
)
# ValidationError: Index name expected 'user_id', got 'id'
MultiIndex Validation
For MultiIndex, use a tuple of types with a tuple of names:
from typing import Protocol, Annotated
class RegionalSalesSchema(Protocol):
__index__: Annotated[tuple[str, int], ("region", "user_id")]
sales: float
quantity: int
# Create DataFrame with MultiIndex
df = pd.DataFrame(
{"sales": [100.0, 200.0, 150.0], "quantity": [5, 10, 7]},
index=pd.MultiIndex.from_tuples(
[("East", 1), ("East", 2), ("West", 1)],
names=["region", "user_id"]
)
)
validated_df = DataFrame[RegionalSalesSchema](df)
Nullable Types
pandas handles nullable integers specially:
from typing import Optional
class Schema(Protocol):
value: Optional[int]
# pandas converts int to float64 when there are nulls
df = pd.DataFrame({"value": [1, 2, None]}) # dtype: float64
validated_df = DataFrame[Schema](df)
For true nullable integers, use pd.Int64Dtype:
class Schema(Protocol):
value: pd.Int64Dtype
df = pd.DataFrame({"value": pd.array([1, 2, None], dtype=pd.Int64Dtype())})
validated_df = DataFrame[Schema](df)
Method Chaining
Note: pandas method chaining may lose Pavise type information:
validated_df = DataFrame[UserSchema](df)
# Type information is lost after pandas operations
result = validated_df.groupby("age").mean() # result is not DataFrame[UserSchema]
# Re-validate if needed
revalidated = DataFrame[ResultSchema](result)
Performance Considerations
Validation checks all rows for type correctness, which can be slow for large DataFrames. For performance-critical code:
Validate once at system boundaries
Use type annotations without validation for internal functions
Trust the type system after initial validation
# Validate once
validated_df = DataFrame[UserSchema](raw_df)
# No validation overhead in internal functions
def process(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]:
return df
result = process(validated_df)