Basic Usage
This page covers the fundamental concepts and usage patterns of Pavise.
Design Philosophy
Pavise is designed with these principles:
Type-first design: Leverage Python’s type system for DataFrame validation
Structural subtyping: Use Protocol for flexible schema definitions
Optional runtime validation: Type checking is free, validation is opt-in
Detailed error messages: Help users quickly identify and fix issues
Type Checking vs Runtime Validation
Type Checking Only
For internal functions, use type annotations without runtime overhead:
from typing import Protocol
from pavise.pandas import DataFrame
class UserSchema(Protocol):
user_id: int
name: str
def internal_processing(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]:
# No validation, just type hints
# Type checker ensures schema compliance
return df
Runtime Validation
At system boundaries (loading from CSV, database, API), validate explicitly:
import pandas as pd
from pavise.pandas import DataFrame
# Load data from external source
raw_df = pd.read_csv("users.csv")
# Validate at boundary
validated_df = DataFrame[UserSchema](raw_df)
# Now pass to internal functions with confidence
result = internal_processing(validated_df)
Covariance and Structural Subtyping
Pavise uses covariant type parameters, allowing schemas with more columns to be used where fewer are expected:
class MinimalSchema(Protocol):
user_id: int
class ExtendedSchema(Protocol):
user_id: int
name: str
age: int
def process_minimal(df: DataFrame[MinimalSchema]) -> None:
pass
extended_df: DataFrame[ExtendedSchema] = ...
process_minimal(extended_df) # OK: ExtendedSchema is compatible
Backend Selection
pandas Backend
from pavise.pandas import DataFrame
validated_df = DataFrame[UserSchema](pandas_df)
polars Backend
from pavise.polars import DataFrame
validated_df = DataFrame[UserSchema](polars_df)
The API is identical across backends, but they validate against their respective type systems.
Handling Optional Columns
Nullable Columns
Use Optional[T] for nullable columns:
from typing import Optional
class UserSchema(Protocol):
user_id: int
name: str
age: Optional[int] # Allows None values
Note: In pandas, nullable integers are stored as float64 when they contain nulls.
In polars, all types are nullable by default.
Optional Columns
Use NotRequiredColumn[T] for columns that may not exist in the DataFrame:
from typing import Optional
from pavise.pandas import DataFrame, NotRequiredColumn
class UserSchema(Protocol):
user_id: int
name: str
age: NotRequiredColumn[int] # Column can be missing
email: NotRequiredColumn[Optional[str]] # Column can be missing, or contain None
# Valid: age column is missing
df1 = pd.DataFrame({"user_id": [1], "name": ["Alice"]})
validated_df1 = DataFrame[UserSchema](df1) # OK
# Valid: age column is present
df2 = pd.DataFrame({"user_id": [1], "name": ["Alice"], "age": [25]})
validated_df2 = DataFrame[UserSchema](df2) # OK
# Invalid: age column is present but has wrong type
df3 = pd.DataFrame({"user_id": [1], "name": ["Alice"], "age": ["invalid"]})
DataFrame[UserSchema](df3) # ValidationError
Key differences:
Optional[T]: Column must exist, but can containNonevaluesNotRequiredColumn[T]: Column can be missing, but if present, must have typeT(noNoneallowed)NotRequiredColumn[Optional[T]]: Column can be missing, and if present, can containNonevalues
Supported Types
Basic Types
int: Integer valuesfloat: Floating point valuesstr: String valuesbool: Boolean values
Datetime Types
datetime.datetime: Date and time valuesdatetime.date: Date-only valuesdatetime.timedelta: Time duration values
Generic Types
Optional[T]: Nullable types (see “Handling Optional Columns” above)Literal[...]: Restricts values to specific literals
The Literal type is useful for columns that should only contain specific values:
from typing import Literal, Protocol
class OrderSchema(Protocol):
order_id: int
status: Literal["pending", "approved", "rejected"]
priority: Literal[1, 2, 3]
# Valid data
df = pd.DataFrame({
"order_id": [1, 2, 3],
"status": ["pending", "approved", "rejected"],
"priority": [1, 2, 3]
})
validated_df = DataFrame[OrderSchema](df) # OK
# Invalid data
df_invalid = pd.DataFrame({
"order_id": [1],
"status": ["invalid"], # Not in Literal values
"priority": [1]
})
DataFrame[OrderSchema](df_invalid) # ValidationError
pandas ExtensionDtype
pandas-specific extension dtypes can be used directly:
import pandas as pd
class Schema(Protocol):
category: pd.CategoricalDtype
value: pd.Int64Dtype
polars DataType
polars-specific data types can be used directly:
import polars as pl
class Schema(Protocol):
category: pl.Categorical
value: pl.Int64
Creating Empty DataFrames
You can create an empty DataFrame that conforms to your schema using the make_empty() class method.
This is useful for initializing DataFrames, creating templates, or testing.
pandas Backend
from typing import Protocol
from pavise.pandas import DataFrame
class UserSchema(Protocol):
user_id: int
name: str
age: int
# Create an empty DataFrame with the correct schema
empty_df = DataFrame[UserSchema].make_empty()
# Result: Empty DataFrame with columns [user_id, name, age]
# - len(empty_df) == 0
# - Columns have correct dtypes (int64, object, int64)
polars Backend
from typing import Protocol
from pavise.polars import DataFrame
class UserSchema(Protocol):
user_id: int
name: str
age: int
# Create an empty DataFrame with the correct schema
empty_df = DataFrame[UserSchema].make_empty()
# Result: Empty DataFrame with columns [user_id, name, age]
# - len(empty_df) == 0
# - Columns have correct dtypes (Int64, Utf8, Int64)
Supported Features
The make_empty() method supports all schema features:
Basic types (int, float, str, bool)
Datetime types (datetime, date, timedelta)
Optional types (
Optional[T])NotRequired columns (
NotRequiredColumn[T]- included as empty columns)Literal types (uses base type)
Annotated types with validators (uses base type, validators not applied)
Backend-specific types (pandas ExtensionDtype, polars DataType)