Advanced

[1]:

%load_ext autoreload
%autoreload 2

import sys
sys.path.append("../")

Subclassing schemas

Subclassing schemas is a useful pattern for pipelines where every next function adds a few columns.

[2]:

from strictly_typed_pandas import DataSet

class SchemaA:
    name: str

class SchemaB(SchemaA):
    id: int

df = DataSet[SchemaA]({"name": ["John", "Jane", "Jack"]})

def foo(df: DataSet[SchemaA]) -> DataSet[SchemaB]:
    return (
        df.assign(id=lambda df: range(df.shape[0]))
        .pipe(DataSet[SchemaB])
    )

Similarly, you can use it when merging (or joining or concatenating) two datasets together.

[3]:

class SchemaA:
    id: int
    name: str

class SchemaB:
    id: int
    job: str

class SchemaAB(SchemaA, SchemaB):
    pass

df1 = DataSet[SchemaA]({"id": [1, 2, 3], "name": ["John", "Jane", "Jack"]})
df2 = DataSet[SchemaB]({"id": [1, 2, 3], "job": "Data Scientist"})
(
    df1.merge(df2, on="id")
    .pipe(DataSet[SchemaAB])
)

[3]:

	id	name	job
0	1	John	Data Scientist
1	2	Jane	Data Scientist
2	3	Jack	Data Scientist

Creating an empty DataSet

Sometimes it’s useful to create a DataSet without any rows. This can be easily done as follows:

[4]:

class Schema:
    id: int
    name: str

DataSet[Schema]()

[4]:

	id	name

Support for numpy and pandas data types

We also support using numpy types and pandas types, as well as typing.Any. If you miss support for any other data type, drop us a line and we’ll see if we can add it!

[5]:

import numpy as np
import pandas as pd
from typing import Any

class Schema:
    name: pd.StringDtype
    money: np.float64
    eggs: np.int64
    potatoes: Any

df = DataSet[Schema](
    {
        "name": pd.Series(["John", "Jane", "Jack"], dtype="string"),
        "money": pd.Series([100.50, 1000.23, 123.45], dtype=np.float64),
        "eggs": pd.Series([1, 2, 3], dtype=np.int64),
        "potatoes": ["1", 0, np.nan]
    }
)

df.dtypes

[5]:

name         string
money       float64
eggs          int64
potatoes     object
dtype: object

IndexedDataSet

If you’d like to also strictly type the index, you can use the IndexedDataSet class.

[6]:

from strictly_typed_pandas import IndexedDataSet

class IndexSchema:
    id: int
    job: str

class DataSchema:
    name: str

df = (
    pd.DataFrame({"id": [1, 2, 3], "name": ["John", "Jane", "Jack"], "job": "Data Scientist"})
    .set_index(["id", "job"])
    .pipe(IndexedDataSet[IndexSchema, DataSchema])
)

Reusing a variable (e.g. `df`) with different schemas

Sometimes when building a pipeline, it’s useful to reuse a variable (e.g. df) with different schemas. If we do that in the following way however, we’ll get a mypy error.

[7]:

class SchemaA:
    name: str

class SchemaB(SchemaA):
    id: int

def foo(df: DataSet[SchemaA]) -> DataSet[SchemaB]:
    return (
        df.assign(id=1)
        .pipe(DataSet[SchemaB])
    )

df = DataSet[SchemaA]({"name": ["John", "Jane", "Jack"]})
df = foo(df)
# mypy(error): Incompatible types in assignment (expression has type "DataSet[SchemaB]", variable has type "DataSet[SchemaA]")

To avoid this error, we need to declare that df will be of the type DataSet (implying the the schema may be different at different points)

[8]:

df: DataSet
df = DataSet[SchemaA]({"name": ["John", "Jane", "Jack"]})
df = foo(df)

No cloning

When a DataFrame is cast to a DataSet, the underlying data isn’t cloned (unless you use DataSet[Schema](..., copy=True)). This is great for memory purposes, but it does require some caution. For example, consider the following pandas script:

[9]:

df1 = pd.DataFrame({"id": [1, 2, 3], "name": ["John", "Jane", "Jack"]})
df2 = pd.DataFrame(df1)
df1.name = [1, 2, 3]
df2.name

[9]:

0    1
1    2
2    3
Name: name, dtype: int64

Here, df1 and df2 essentially point to the same data, so changing one of them changes the other one too. This behaviour extends to DataSet as well.

[10]:

class Schema:
    id: int
    name: str

df1 = pd.DataFrame({"id": [1, 2, 3], "name": ["John", "Jane", "Jack"]})
df2 = DataSet[Schema](df1)

df1.name = [1, 2, 3]
df2.name

[10]:

0    1
1    2
2    3
Name: name, dtype: int64

This is somewhat problematic, because we now made a change to the schema, without any error thrown whatsoever! However:

I essentially can’t stop you from doing this (apart from forcing DataSet to copy the data when created, which I won’t).
If this happens in your code, you have bigger problems that type checking.

So the bottomline is: be careful when dealing with pointers!

[ ]: