Getting started

The problem

I love Pandas! But in production code I’m always a bit wary when I see:

[1]:
import pandas as pd


def foo(df: pd.DataFrame) -> pd.DataFrame:
    # do stuff
    return df

Because… How do I know which columns are supposed to be in df?

Sure, in a notebook this is often not a big problem, because we’ll likely have

  • a few hundred lines of code

  • that you’re working on alone

  • over a limited amount of time

But what if this is production code, where we have:

  • >1000 lines of code

  • that we are maintaining for years to come

  • potentially by colleagues who haven’t even been hired yet

You’ll probably want to be a bit more explicit about what these DataFrames should look like!

The solution: static type checking of pandas DataFrames

Suppose we know that our DataFrame has two columns: id (an int) and name (a str). Using strictly_typed_pandas, we may write that down as follows.

[2]:
from strictly_typed_pandas import DataSet


class Schema:
    id: int
    name: str


def foo(df: DataSet[Schema]) -> DataSet[Schema]:
    # do stuff
    return df

These type definitions can now be checked using mypy, a linter for static type checking. The big benefit of mypy is that the type checking doesn’t happen during run-time, but rather during linting time (so while you’re coding), saving you precious time. If you haven’t already, you should really check out how to set up mypy for your IDE.

Let’s consider an example of how this works. First, we’ll create some data. Since DataSet is a subclass of pd.DataFrame, it has (nearly) all the functionality of a DataFrame, including:

[3]:
df = DataSet[Schema](
    {
        "id": [1, 2, 3],
        "name": ["John", "Jane", "Jack"],
    }
)
df
[3]:
id name
0 1 John
1 2 Jane
2 3 Jack

We can now call foo() with our data. All types check out, so nothing special happens.

[4]:
res = foo(df)

However, if we instead try to run foo() on a DataFrame, mypy will throw the following error.

(Shown as a comment here, but it will show up in your IDE if you set up mypy.)

[5]:
df = pd.DataFrame(df)
res = foo(df)
# mypy(error): Argument 1 to "foo" has incompatible type "DataFrame"; expected "DataSet[Schema]"

Likewise, if we call foo() on a DataSet with an alternative schema, mypy will throw the following error.

[6]:
class AlternativeSchema:
    id: int
    first_name: str


df = DataSet[AlternativeSchema](
    {
        "id": [1, 2, 3],
        "first_name": ["John", "Jane", "Jack"],
    }
)
try:
    res = foo(df)
    # mypy(error): Argument 1 to "foo" has incompatible type "DataSet[AlternativeSchema]"; expected "DataSet[Schema]"
except:
    pass

How can we be sure that a DataSet adheres to its schema?

The above is great if everyone is meticulous in keeping the schema annotations correct and up-to-date. But shouldn’t we be worried that these schema annotations get out of sync? For example:

[7]:
class Schema:
    id: int
    name: str


def foo() -> DataSet[Schema]:
    return DataSet[Schema](
        {
            "id": [1, 2, 3],
            "name": ["John", "Jane", "Jack"],
            "job": "Data Scientist",
        }
    )

Fortunately, we have some extra precautions in place that prevent the above scenario:

  • The schema of the data is validated during the DataSet creation.

  • DataSet is immutable, so its schema cannot change due to inplace modifications.

As we will see, this means that if your codebase (e.g. foo()) is unit tested, functions like the above will result in errors and hence they shouldn’t make it to the master branch. As such, you will be able to trust the schema annotations in your code base.

Let’s have a look at these precautions in more detail. First, if the columns in the data do not correspond to the ones defined in the shema, we get a TypeError, for example:

[8]:
try:
    df = DataSet[Schema]({"id": [1, 2, 3]})
except TypeError as e:
    print(e)
Schema contains the following columns not present in data: {'name'}

Similarly, if the types defined in the schema don’t match the types in the data, we again get a TypeError.

[9]:
try:
    df = DataSet[Schema](
        {
            "id": [1, 2, 3],
            "name": [1, 2, 3],
        }
    )
except TypeError as e:
    print(e)
Column name is of type numpy.int64, but the schema suggests <class 'str'>

Hence, when we successsfully create our DataSet[Schema], we can be certain that it adheres to the schema.

Of course, for this to work, we do need to make sure that the DataSet’s columns and datatypes cannot be changed after its creation. This brings us to our second point:

  • DataSet is immutable, so its schema cannot change due to inplace modifications.

To this end, we have disabled operations such as:

[10]:
df = DataSet[Schema](
    {
        "id": [1, 2, 3],
        "name": ["John", "Jane", "Jack"],
    }
)
ids = ["1", "2", "3"]
try:
    df["id"] = ids
    df.id = ids
    df.loc[:, "id"] = ids
    df.iloc[:, 0] = ids
    df.assign(id=ids, inplace=True)
except NotImplementedError as e:
    print(e)
To ensure that the DataSet adheres to its schema, you cannot perform inplace modifications. You can either use dataset.to_dataframe() to cast the DataSet to a DataFrame, or use operations that return a DataFrame, e.g. df = df.assign(...).

When you do need to make changes to the schema, you can either cast the DataSet back to a DataFrame.

[11]:
df = df.to_dataframe()

Or you can perform the assign() in the following way, which also casts it to a DataFrame

[12]:
df = df.assign(id=ids)
assert type(df) == pd.DataFrame

In practice, this often means that functions have the following sequence:

  1. The input is a DataSet[SchemaA]

  2. The data is converted to a DataFrame so changes can be made

  3. The output is cast to DataSet[SchemaB]

[13]:
class SchemaA:
    name: str


class SchemaB:
    id: int
    name: str


df = DataSet[SchemaA]({"name": ["John", "Jane", "Jack"]})


def foo(df: DataSet[SchemaA]) -> DataSet[SchemaB]:
    n = df.shape[0]
    ids = range(n)
    new_df = df.assign(id=ids)
    return DataSet[SchemaB](new_df)

Or alternatively in the more compact version

[14]:
def foo(data: DataSet[SchemaA]) -> DataSet[SchemaB]:
    return df.assign(
        id=lambda df: range(df.shape[0]),
    ).pipe(DataSet[SchemaB])

What about functions that return Any?

So far we’ve seen that we can strictly type check our pandas data using a combination of linting checks and runtime checks. So is there anything that we haven’t covered yet? Well, it turns out there is. Consider the following example.

[15]:
class Schema:
    id: int
    name: str


def foo() -> DataSet[Schema]:
    return (
        DataSet[Schema](
            {
                "id": [1, 2, 3],
                "name": ["John", "Jane", "Jack"],
            }
        )
        .assign(job="Data Scientist")
        .iloc[:3]
    )


res = foo()

Now this is interesting: foo() clearly returns something that doesn’t adhere to the schema, but the above gives neither a linting error nor a runtime error!

It turns out that the above problem often happens with functions like iloc, loc and pipe, whose return type is Any (and when you think about it, these can indeed return any possible datatype). When mypy sees that the return type is Any, it reasons that that could still be a DataSet[Schema] object, so it doesn’t raise an error. It’s only during runtime that we find out here that the return type actually is a DataFrame, but mypy doesn’t do any runtime checks.

Fortunately, Python offers other ways to do type checking during runtime. Here, we will use the typeguard package.

[16]:
from typeguard import typechecked


@typechecked
def foo() -> DataSet[Schema]:
    return (
        DataSet[Schema](
            {
                "id": [1, 2, 3],
                "name": ["John", "Jane", "Jack"],
            }
        )
        .assign(job="Data Scientist")
        .iloc[:3]
    )


try:
    res = foo()
except TypeError as e:
    print(e)
Type of the return value must be a DataSet[__main__.Schema]; got pandas.core.frame.DataFrame instead

Alright, we now caught the error dead in its tracks!

We can improve this with one more step: instead of adding the @typechecked decorator to every function by hand (which could be error prone), typeguard can do this automatically when running the unit tests. To do this, simply run your unit tests using pytest --typeguard-packages=foo.bar (where foo.bar is your package name)

Conclusions

We can statically type check pandas in the following way:

[17]:
from strictly_typed_pandas import DataSet


class Schema:
    id: int
    name: str


def foo(df: DataSet[Schema]) -> DataSet[Schema]:
    # do stuff
    return df

Where DataSet:

  • is a subclass of pd.DataFrame and hence has the same functionality as DataFrame.

  • validates whether the data adheres to the provided schema upon its initialization.

  • is immutable, so its schema cannot be changed using inplace modifications.

The DataSet[Schema] annotations are compatible with:

  • mypy for type checking during linting-time (i.e. while you write your code).

  • typeguard for type checking during run-time (i.e. while you run your unit tests).

To get the most out of strictly_typed_pandas, be sure to:

  • set up mypy in your IDE.

  • run your unit tests with pytest --typeguard-packages=foo.bar (where foo.bar is your package name).