polars Backend

The polars backend provides validation for polars DataFrames.

Installation

pip install pavise[polars]

Basic Usage

from typing import Protocol
from pavise.polars import DataFrame
import polars as pl

class UserSchema(Protocol):
    user_id: int
    name: str
    age: int

# Create a polars DataFrame
df = pl.DataFrame({
    "user_id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35]
})

# Validate
validated_df = DataFrame[UserSchema](df)

Type Mapping

Pavise maps Python types to polars dtypes:

Python Type	polars dtype
`int`	Int64
`float`	Float64
`str`	Utf8
`bool`	Boolean
`datetime`	Datetime
`date`	Date
`timedelta`	Duration
`Optional[T]`	Nullable version of T

polars DataType

You can use polars data types directly:

import polars as pl

class Schema(Protocol):
    category: pl.Categorical
    value: pl.Int64
    text: pl.Utf8

validated_df = DataFrame[Schema](df)

This gives you precise control over the polars dtype.

Nullable Types

Unlike pandas, polars types are nullable by default:

from typing import Optional

class Schema(Protocol):
    value: Optional[int]  # Allows null values

df = pl.DataFrame({"value": [1, 2, None]})  # dtype: Int64 (nullable)
validated_df = DataFrame[Schema](df)

For non-nullable columns, don’t use Optional:

class Schema(Protocol):
    value: int  # No nulls allowed

df = pl.DataFrame({"value": [1, 2, None]})
validated_df = DataFrame[Schema](df)  # Raises ValueError

Performance Considerations

polars is designed for performance, and Pavise validation is fast on polars DataFrames. However, the same principles apply:

Validate once at system boundaries
Use type annotations without validation for internal functions
Trust the type system after initial validation

# Validate once
validated_df = DataFrame[UserSchema](raw_df)

# No validation overhead in internal functions
def process(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]:
    return df

result = process(validated_df)

LazyFrame Support

Pavise also supports polars LazyFrame for lazy evaluation workflows:

from pavise.polars import LazyFrame, DataFrame

class UserSchema(Protocol):
    user_id: int
    name: str

# Wrap a LazyFrame with schema validation
lf = pl.scan_csv("users.csv")
validated_lf = LazyFrame[UserSchema](lf)

# Schema is validated immediately (column existence and types)
# Value-based validators are applied on collect()
df: DataFrame[UserSchema] = validated_lf.collect()

LazyFrame Validation Behavior

LazyFrame validation happens in two stages:

On construction: Schema-level validation (column existence and types) using collect_schema()
On collect(): Full validation including value-based validators (Range, Unique, etc.)

from typing import Annotated
from pavise.validators import Range

class UserSchema(Protocol):
    user_id: int
    age: Annotated[int, Range(0, 150)]

lf = pl.LazyFrame({"user_id": [1, 2], "age": [25, 200]})

# Schema validation passes (types are correct)
validated_lf = LazyFrame[UserSchema](lf)

# Range validation fails on collect()
df = validated_lf.collect()  # Raises ValidationError: age out of range

Creating Empty LazyFrames

Like DataFrame, LazyFrame supports make_empty():

empty_lf = LazyFrame[UserSchema].make_empty()
# Returns LazyFrame with correct schema but no rows

Differences from pandas Backend

Nullable types: polars types are nullable by default, pandas are not
Type system: polars has a richer type system (e.g., Categorical, Utf8)
Performance: polars validation is generally faster
Index: polars doesn’t have an index concept, so __index__ validation is not supported
LazyFrame: polars backend supports LazyFrame for lazy evaluation workflows

Method Chaining

polars preserves immutability, but type information is still lost:

validated_df = DataFrame[UserSchema](df)

# Type information is lost after polars operations
result = validated_df.group_by("age").mean()  # result is not DataFrame[UserSchema]

# Re-validate if needed
revalidated = DataFrame[ResultSchema](result)