Advanced
[1]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.append("../")
Subclassing schemas
Subclassing schemas is a useful pattern for pipelines where every next function adds a few columns.
[2]:
from strictly_typed_pandas import DataSet
class SchemaA:
name: str
class SchemaB(SchemaA):
id: int
df = DataSet[SchemaA]({"name": ["John", "Jane", "Jack"]})
def foo(df: DataSet[SchemaA]) -> DataSet[SchemaB]:
return (
df.assign(id=lambda df: range(df.shape[0]))
.pipe(DataSet[SchemaB])
)
Similarly, you can use it when merging (or joining or concatenating) two datasets together.
[3]:
class SchemaA:
id: int
name: str
class SchemaB:
id: int
job: str
class SchemaAB(SchemaA, SchemaB):
pass
df1 = DataSet[SchemaA]({"id": [1, 2, 3], "name": ["John", "Jane", "Jack"]})
df2 = DataSet[SchemaB]({"id": [1, 2, 3], "job": "Data Scientist"})
(
df1.merge(df2, on="id")
.pipe(DataSet[SchemaAB])
)
[3]:
id | name | job | |
---|---|---|---|
0 | 1 | John | Data Scientist |
1 | 2 | Jane | Data Scientist |
2 | 3 | Jack | Data Scientist |
Creating an empty DataSet
Sometimes it’s useful to create a DataSet without any rows. This can be easily done as follows:
[4]:
class Schema:
id: int
name: str
DataSet[Schema]()
[4]:
id | name |
---|
Support for numpy and pandas data types
We also support using numpy types and pandas types, as well as typing.Any
. If you miss support for any other data type, drop us a line and we’ll see if we can add it!
[5]:
import numpy as np
import pandas as pd
from typing import Any
class Schema:
name: pd.StringDtype
money: np.float64
eggs: np.int64
potatoes: Any
df = DataSet[Schema](
{
"name": pd.Series(["John", "Jane", "Jack"], dtype="string"),
"money": pd.Series([100.50, 1000.23, 123.45], dtype=np.float64),
"eggs": pd.Series([1, 2, 3], dtype=np.int64),
"potatoes": ["1", 0, np.nan]
}
)
df.dtypes
[5]:
name string
money float64
eggs int64
potatoes object
dtype: object
IndexedDataSet
If you’d like to also strictly type the index, you can use the IndexedDataSet class.
[6]:
from strictly_typed_pandas import IndexedDataSet
class IndexSchema:
id: int
job: str
class DataSchema:
name: str
df = (
pd.DataFrame({"id": [1, 2, 3], "name": ["John", "Jane", "Jack"], "job": "Data Scientist"})
.set_index(["id", "job"])
.pipe(IndexedDataSet[IndexSchema, DataSchema])
)
Reusing a variable (e.g. df
) with different schemas
Sometimes when building a pipeline, it’s useful to reuse a variable (e.g. df
) with different schemas. If we do that in the following way however, we’ll get a mypy error.
[7]:
class SchemaA:
name: str
class SchemaB(SchemaA):
id: int
def foo(df: DataSet[SchemaA]) -> DataSet[SchemaB]:
return (
df.assign(id=1)
.pipe(DataSet[SchemaB])
)
df = DataSet[SchemaA]({"name": ["John", "Jane", "Jack"]})
df = foo(df)
# mypy(error): Incompatible types in assignment (expression has type "DataSet[SchemaB]", variable has type "DataSet[SchemaA]")
To avoid this error, we need to declare that df
will be of the type DataSet
(implying the the schema may be different at different points)
[8]:
df: DataSet
df = DataSet[SchemaA]({"name": ["John", "Jane", "Jack"]})
df = foo(df)
No cloning
When a DataFrame
is cast to a DataSet
, the underlying data isn’t cloned (unless you use DataSet[Schema](..., copy=True)
). This is great for memory purposes, but it does require some caution. For example, consider the following pandas script:
[9]:
df1 = pd.DataFrame({"id": [1, 2, 3], "name": ["John", "Jane", "Jack"]})
df2 = pd.DataFrame(df1)
df1.name = [1, 2, 3]
df2.name
[9]:
0 1
1 2
2 3
Name: name, dtype: int64
Here, df1
and df2
essentially point to the same data, so changing one of them changes the other one too. This behaviour extends to DataSet
as well.
[10]:
class Schema:
id: int
name: str
df1 = pd.DataFrame({"id": [1, 2, 3], "name": ["John", "Jane", "Jack"]})
df2 = DataSet[Schema](df1)
df1.name = [1, 2, 3]
df2.name
[10]:
0 1
1 2
2 3
Name: name, dtype: int64
This is somewhat problematic, because we now made a change to the schema, without any error thrown whatsoever! However:
I essentially can’t stop you from doing this (apart from forcing
DataSet
to copy the data when created, which I won’t).If this happens in your code, you have bigger problems that type checking.
So the bottomline is: be careful when dealing with pointers!
[ ]: