pandas Backend ============== The pandas backend provides validation for pandas DataFrames. Installation ------------ .. code-block:: bash pip install pavise[pandas] Basic Usage ----------- .. code-block:: python from typing import Protocol from pavise.pandas import DataFrame import pandas as pd class UserSchema(Protocol): user_id: int name: str age: int # Create a pandas DataFrame df = pd.DataFrame({ "user_id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35] }) # Validate validated_df = DataFrame[UserSchema](df) Type Mapping ------------ Pavise maps Python types to pandas dtypes: ================ ===================== Python Type pandas dtype ================ ===================== ``int`` int64 ``float`` float64 ``str`` object (str) ``bool`` bool ``datetime`` datetime64[ns] ``date`` datetime64[ns] ``timedelta`` timedelta64[ns] ``Optional[T]`` Nullable version of T ================ ===================== pandas ExtensionDtype --------------------- You can use pandas extension dtypes directly: .. code-block:: python import pandas as pd class Schema(Protocol): category: pd.CategoricalDtype nullable_int: pd.Int64Dtype string: pd.StringDtype validated_df = DataFrame[Schema](df) This gives you more control over the exact dtype used. Index Validation ---------------- Validate the index type using the special ``__index__`` attribute: .. code-block:: python from typing import Protocol class Schema(Protocol): __index__: int # Validates index is int64 value: float # Create DataFrame with integer index df = pd.DataFrame({"value": [1.0, 2.0, 3.0]}, index=[0, 1, 2]) validated_df = DataFrame[Schema](df) Named Index Validation ^^^^^^^^^^^^^^^^^^^^^^ Use ``Annotated`` to validate both the index type and name: .. code-block:: python from typing import Protocol, Annotated class UserSchema(Protocol): __index__: Annotated[int, "user_id"] # Validates type AND name username: str score: float # Create DataFrame with named index df = pd.DataFrame( {"username": ["alice", "bob"], "score": [95.0, 87.0]}, index=pd.Index([1, 2], name="user_id") ) validated_df = DataFrame[UserSchema](df) # This will fail - wrong index name df_wrong = pd.DataFrame( {"username": ["alice"], "score": [95.0]}, index=pd.Index([1], name="id") # Expected "user_id" ) # ValidationError: Index name expected 'user_id', got 'id' MultiIndex Validation ^^^^^^^^^^^^^^^^^^^^^ For MultiIndex, use a tuple of types with a tuple of names: .. code-block:: python from typing import Protocol, Annotated class RegionalSalesSchema(Protocol): __index__: Annotated[tuple[str, int], ("region", "user_id")] sales: float quantity: int # Create DataFrame with MultiIndex df = pd.DataFrame( {"sales": [100.0, 200.0, 150.0], "quantity": [5, 10, 7]}, index=pd.MultiIndex.from_tuples( [("East", 1), ("East", 2), ("West", 1)], names=["region", "user_id"] ) ) validated_df = DataFrame[RegionalSalesSchema](df) Nullable Types -------------- pandas handles nullable integers specially: .. code-block:: python from typing import Optional class Schema(Protocol): value: Optional[int] # pandas converts int to float64 when there are nulls df = pd.DataFrame({"value": [1, 2, None]}) # dtype: float64 validated_df = DataFrame[Schema](df) For true nullable integers, use ``pd.Int64Dtype``: .. code-block:: python class Schema(Protocol): value: pd.Int64Dtype df = pd.DataFrame({"value": pd.array([1, 2, None], dtype=pd.Int64Dtype())}) validated_df = DataFrame[Schema](df) Method Chaining --------------- Note: pandas method chaining may lose Pavise type information: .. code-block:: python validated_df = DataFrame[UserSchema](df) # Type information is lost after pandas operations result = validated_df.groupby("age").mean() # result is not DataFrame[UserSchema] # Re-validate if needed revalidated = DataFrame[ResultSchema](result) Performance Considerations -------------------------- Validation checks all rows for type correctness, which can be slow for large DataFrames. For performance-critical code: 1. Validate once at system boundaries 2. Use type annotations without validation for internal functions 3. Trust the type system after initial validation .. code-block:: python # Validate once validated_df = DataFrame[UserSchema](raw_df) # No validation overhead in internal functions def process(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]: return df result = process(validated_df)