Basic Usage
===========

This page covers the fundamental concepts and usage patterns of Pavise.

Design Philosophy
-----------------

Pavise is designed with these principles:

1. **Type-first design**: Leverage Python's type system for DataFrame validation
2. **Structural subtyping**: Use Protocol for flexible schema definitions
3. **Optional runtime validation**: Type checking is free, validation is opt-in
4. **Detailed error messages**: Help users quickly identify and fix issues

Type Checking vs Runtime Validation
------------------------------------

Type Checking Only
~~~~~~~~~~~~~~~~~~

For internal functions, use type annotations without runtime overhead:

.. code-block:: python

   from typing import Protocol
   from pavise.pandas import DataFrame

   class UserSchema(Protocol):
       user_id: int
       name: str

   def internal_processing(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]:
       # No validation, just type hints
       # Type checker ensures schema compliance
       return df

Runtime Validation
~~~~~~~~~~~~~~~~~~

At system boundaries (loading from CSV, database, API), validate explicitly:

.. code-block:: python

   import pandas as pd
   from pavise.pandas import DataFrame

   # Load data from external source
   raw_df = pd.read_csv("users.csv")

   # Validate at boundary
   validated_df = DataFrame[UserSchema](raw_df)

   # Now pass to internal functions with confidence
   result = internal_processing(validated_df)

Covariance and Structural Subtyping
------------------------------------

Pavise uses covariant type parameters, allowing schemas with more columns to be used where fewer are expected:

.. code-block:: python

   class MinimalSchema(Protocol):
       user_id: int

   class ExtendedSchema(Protocol):
       user_id: int
       name: str
       age: int

   def process_minimal(df: DataFrame[MinimalSchema]) -> None:
       pass

   extended_df: DataFrame[ExtendedSchema] = ...
   process_minimal(extended_df)  # OK: ExtendedSchema is compatible

Backend Selection
-----------------

pandas Backend
~~~~~~~~~~~~~~

.. code-block:: python

   from pavise.pandas import DataFrame

   validated_df = DataFrame[UserSchema](pandas_df)

polars Backend
~~~~~~~~~~~~~~

.. code-block:: python

   from pavise.polars import DataFrame

   validated_df = DataFrame[UserSchema](polars_df)

The API is identical across backends, but they validate against their respective type systems.

Handling Optional Columns
--------------------------

Nullable Columns
~~~~~~~~~~~~~~~~

Use ``Optional[T]`` for nullable columns:

.. code-block:: python

   from typing import Optional

   class UserSchema(Protocol):
       user_id: int
       name: str
       age: Optional[int]  # Allows None values

Note: In pandas, nullable integers are stored as ``float64`` when they contain nulls.
In polars, all types are nullable by default.

Optional Columns
~~~~~~~~~~~~~~~~

Use ``NotRequiredColumn[T]`` for columns that may not exist in the DataFrame:

.. code-block:: python

   from typing import Optional
   from pavise.pandas import DataFrame, NotRequiredColumn

   class UserSchema(Protocol):
       user_id: int
       name: str
       age: NotRequiredColumn[int]  # Column can be missing
       email: NotRequiredColumn[Optional[str]]  # Column can be missing, or contain None

   # Valid: age column is missing
   df1 = pd.DataFrame({"user_id": [1], "name": ["Alice"]})
   validated_df1 = DataFrame[UserSchema](df1)  # OK

   # Valid: age column is present
   df2 = pd.DataFrame({"user_id": [1], "name": ["Alice"], "age": [25]})
   validated_df2 = DataFrame[UserSchema](df2)  # OK

   # Invalid: age column is present but has wrong type
   df3 = pd.DataFrame({"user_id": [1], "name": ["Alice"], "age": ["invalid"]})
   DataFrame[UserSchema](df3)  # ValidationError

Key differences:

* ``Optional[T]``: Column must exist, but can contain ``None`` values
* ``NotRequiredColumn[T]``: Column can be missing, but if present, must have type ``T`` (no ``None`` allowed)
* ``NotRequiredColumn[Optional[T]]``: Column can be missing, and if present, can contain ``None`` values

Supported Types
---------------

Basic Types
~~~~~~~~~~~

* ``int``: Integer values
* ``float``: Floating point values
* ``str``: String values
* ``bool``: Boolean values

Datetime Types
~~~~~~~~~~~~~~

* ``datetime.datetime``: Date and time values
* ``datetime.date``: Date-only values
* ``datetime.timedelta``: Time duration values

Generic Types
~~~~~~~~~~~~~

* ``Optional[T]``: Nullable types (see "Handling Optional Columns" above)
* ``Literal[...]``: Restricts values to specific literals

The ``Literal`` type is useful for columns that should only contain specific values:

.. code-block:: python

   from typing import Literal, Protocol

   class OrderSchema(Protocol):
       order_id: int
       status: Literal["pending", "approved", "rejected"]
       priority: Literal[1, 2, 3]

   # Valid data
   df = pd.DataFrame({
       "order_id": [1, 2, 3],
       "status": ["pending", "approved", "rejected"],
       "priority": [1, 2, 3]
   })
   validated_df = DataFrame[OrderSchema](df)  # OK

   # Invalid data
   df_invalid = pd.DataFrame({
       "order_id": [1],
       "status": ["invalid"],  # Not in Literal values
       "priority": [1]
   })
   DataFrame[OrderSchema](df_invalid)  # ValidationError

pandas ExtensionDtype
~~~~~~~~~~~~~~~~~~~~~

pandas-specific extension dtypes can be used directly:

.. code-block:: python

   import pandas as pd

   class Schema(Protocol):
       category: pd.CategoricalDtype
       value: pd.Int64Dtype

polars DataType
~~~~~~~~~~~~~~~

polars-specific data types can be used directly:

.. code-block:: python

   import polars as pl

   class Schema(Protocol):
       category: pl.Categorical
       value: pl.Int64

Creating Empty DataFrames
--------------------------

You can create an empty DataFrame that conforms to your schema using the ``make_empty()`` class method.
This is useful for initializing DataFrames, creating templates, or testing.

pandas Backend
~~~~~~~~~~~~~~

.. code-block:: python

   from typing import Protocol
   from pavise.pandas import DataFrame

   class UserSchema(Protocol):
       user_id: int
       name: str
       age: int

   # Create an empty DataFrame with the correct schema
   empty_df = DataFrame[UserSchema].make_empty()

   # Result: Empty DataFrame with columns [user_id, name, age]
   # - len(empty_df) == 0
   # - Columns have correct dtypes (int64, object, int64)

polars Backend
~~~~~~~~~~~~~~

.. code-block:: python

   from typing import Protocol
   from pavise.polars import DataFrame

   class UserSchema(Protocol):
       user_id: int
       name: str
       age: int

   # Create an empty DataFrame with the correct schema
   empty_df = DataFrame[UserSchema].make_empty()

   # Result: Empty DataFrame with columns [user_id, name, age]
   # - len(empty_df) == 0
   # - Columns have correct dtypes (Int64, Utf8, Int64)

Supported Features
~~~~~~~~~~~~~~~~~~

The ``make_empty()`` method supports all schema features:

* Basic types (int, float, str, bool)
* Datetime types (datetime, date, timedelta)
* Optional types (``Optional[T]``)
* NotRequired columns (``NotRequiredColumn[T]`` - included as empty columns)
* Literal types (uses base type)
* Annotated types with validators (uses base type, validators not applied)
* Backend-specific types (pandas ExtensionDtype, polars DataType)