polars Backend ============== The polars backend provides validation for polars DataFrames. Installation ------------ .. code-block:: bash pip install pavise[polars] Basic Usage ----------- .. code-block:: python from typing import Protocol from pavise.polars import DataFrame import polars as pl class UserSchema(Protocol): user_id: int name: str age: int # Create a polars DataFrame df = pl.DataFrame({ "user_id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35] }) # Validate validated_df = DataFrame[UserSchema](df) Type Mapping ------------ Pavise maps Python types to polars dtypes: ================ ===================== Python Type polars dtype ================ ===================== ``int`` Int64 ``float`` Float64 ``str`` Utf8 ``bool`` Boolean ``datetime`` Datetime ``date`` Date ``timedelta`` Duration ``Optional[T]`` Nullable version of T ================ ===================== polars DataType --------------- You can use polars data types directly: .. code-block:: python import polars as pl class Schema(Protocol): category: pl.Categorical value: pl.Int64 text: pl.Utf8 validated_df = DataFrame[Schema](df) This gives you precise control over the polars dtype. Nullable Types -------------- Unlike pandas, polars types are nullable by default: .. code-block:: python from typing import Optional class Schema(Protocol): value: Optional[int] # Allows null values df = pl.DataFrame({"value": [1, 2, None]}) # dtype: Int64 (nullable) validated_df = DataFrame[Schema](df) For non-nullable columns, don't use Optional: .. code-block:: python class Schema(Protocol): value: int # No nulls allowed df = pl.DataFrame({"value": [1, 2, None]}) validated_df = DataFrame[Schema](df) # Raises ValueError Performance Considerations -------------------------- polars is designed for performance, and Pavise validation is fast on polars DataFrames. However, the same principles apply: 1. Validate once at system boundaries 2. Use type annotations without validation for internal functions 3. Trust the type system after initial validation .. code-block:: python # Validate once validated_df = DataFrame[UserSchema](raw_df) # No validation overhead in internal functions def process(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]: return df result = process(validated_df) LazyFrame Support ----------------- Pavise also supports polars LazyFrame for lazy evaluation workflows: .. code-block:: python from pavise.polars import LazyFrame, DataFrame class UserSchema(Protocol): user_id: int name: str # Wrap a LazyFrame with schema validation lf = pl.scan_csv("users.csv") validated_lf = LazyFrame[UserSchema](lf) # Schema is validated immediately (column existence and types) # Value-based validators are applied on collect() df: DataFrame[UserSchema] = validated_lf.collect() LazyFrame Validation Behavior ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ LazyFrame validation happens in two stages: 1. **On construction**: Schema-level validation (column existence and types) using ``collect_schema()`` 2. **On collect()**: Full validation including value-based validators (Range, Unique, etc.) .. code-block:: python from typing import Annotated from pavise.validators import Range class UserSchema(Protocol): user_id: int age: Annotated[int, Range(0, 150)] lf = pl.LazyFrame({"user_id": [1, 2], "age": [25, 200]}) # Schema validation passes (types are correct) validated_lf = LazyFrame[UserSchema](lf) # Range validation fails on collect() df = validated_lf.collect() # Raises ValidationError: age out of range Creating Empty LazyFrames ~~~~~~~~~~~~~~~~~~~~~~~~~ Like DataFrame, LazyFrame supports ``make_empty()``: .. code-block:: python empty_lf = LazyFrame[UserSchema].make_empty() # Returns LazyFrame with correct schema but no rows Differences from pandas Backend -------------------------------- 1. **Nullable types**: polars types are nullable by default, pandas are not 2. **Type system**: polars has a richer type system (e.g., Categorical, Utf8) 3. **Performance**: polars validation is generally faster 4. **Index**: polars doesn't have an index concept, so ``__index__`` validation is not supported 5. **LazyFrame**: polars backend supports LazyFrame for lazy evaluation workflows Method Chaining --------------- polars preserves immutability, but type information is still lost: .. code-block:: python validated_df = DataFrame[UserSchema](df) # Type information is lost after polars operations result = validated_df.group_by("age").mean() # result is not DataFrame[UserSchema] # Re-validate if needed revalidated = DataFrame[ResultSchema](result)