polars Backend
The polars backend provides validation for polars DataFrames.
Installation
pip install pavise[polars]
Basic Usage
from typing import Protocol
from pavise.polars import DataFrame
import polars as pl
class UserSchema(Protocol):
user_id: int
name: str
age: int
# Create a polars DataFrame
df = pl.DataFrame({
"user_id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35]
})
# Validate
validated_df = DataFrame[UserSchema](df)
Type Mapping
Pavise maps Python types to polars dtypes:
Python Type |
polars dtype |
|---|---|
|
Int64 |
|
Float64 |
|
Utf8 |
|
Boolean |
|
Datetime |
|
Date |
|
Duration |
|
Nullable version of T |
polars DataType
You can use polars data types directly:
import polars as pl
class Schema(Protocol):
category: pl.Categorical
value: pl.Int64
text: pl.Utf8
validated_df = DataFrame[Schema](df)
This gives you precise control over the polars dtype.
Nullable Types
Unlike pandas, polars types are nullable by default:
from typing import Optional
class Schema(Protocol):
value: Optional[int] # Allows null values
df = pl.DataFrame({"value": [1, 2, None]}) # dtype: Int64 (nullable)
validated_df = DataFrame[Schema](df)
For non-nullable columns, don’t use Optional:
class Schema(Protocol):
value: int # No nulls allowed
df = pl.DataFrame({"value": [1, 2, None]})
validated_df = DataFrame[Schema](df) # Raises ValueError
Performance Considerations
polars is designed for performance, and Pavise validation is fast on polars DataFrames. However, the same principles apply:
Validate once at system boundaries
Use type annotations without validation for internal functions
Trust the type system after initial validation
# Validate once
validated_df = DataFrame[UserSchema](raw_df)
# No validation overhead in internal functions
def process(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]:
return df
result = process(validated_df)
LazyFrame Support
Pavise also supports polars LazyFrame for lazy evaluation workflows:
from pavise.polars import LazyFrame, DataFrame
class UserSchema(Protocol):
user_id: int
name: str
# Wrap a LazyFrame with schema validation
lf = pl.scan_csv("users.csv")
validated_lf = LazyFrame[UserSchema](lf)
# Schema is validated immediately (column existence and types)
# Value-based validators are applied on collect()
df: DataFrame[UserSchema] = validated_lf.collect()
LazyFrame Validation Behavior
LazyFrame validation happens in two stages:
On construction: Schema-level validation (column existence and types) using
collect_schema()On collect(): Full validation including value-based validators (Range, Unique, etc.)
from typing import Annotated
from pavise.validators import Range
class UserSchema(Protocol):
user_id: int
age: Annotated[int, Range(0, 150)]
lf = pl.LazyFrame({"user_id": [1, 2], "age": [25, 200]})
# Schema validation passes (types are correct)
validated_lf = LazyFrame[UserSchema](lf)
# Range validation fails on collect()
df = validated_lf.collect() # Raises ValidationError: age out of range
Creating Empty LazyFrames
Like DataFrame, LazyFrame supports make_empty():
empty_lf = LazyFrame[UserSchema].make_empty()
# Returns LazyFrame with correct schema but no rows
Differences from pandas Backend
Nullable types: polars types are nullable by default, pandas are not
Type system: polars has a richer type system (e.g., Categorical, Utf8)
Performance: polars validation is generally faster
Index: polars doesn’t have an index concept, so
__index__validation is not supportedLazyFrame: polars backend supports LazyFrame for lazy evaluation workflows
Method Chaining
polars preserves immutability, but type information is still lost:
validated_df = DataFrame[UserSchema](df)
# Type information is lost after polars operations
result = validated_df.group_by("age").mean() # result is not DataFrame[UserSchema]
# Re-validate if needed
revalidated = DataFrame[ResultSchema](result)