{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting started\n", "\n", "## The problem\n", "\n", "I love Pandas! But in production code I’m always a bit wary when I see:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "\n", "def foo(df: pd.DataFrame) -> pd.DataFrame:\n", " # do stuff\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because… How do I know which columns are supposed to be in `df`?\n", "\n", "Sure, in a notebook this is often not a big problem, because we'll likely have\n", "\n", "* a few hundred lines of code\n", "\n", "* that you're working on alone\n", "\n", "* over a limited amount of time\n", "\n", "But what if this is production code, where we have:\n", "\n", "* \\>1000 lines of code\n", "\n", "* that we are maintaining for years to come\n", "\n", "* potentially by colleagues who haven't even been hired yet\n", "\n", "You'll probably want to be a bit more explicit about what these DataFrames should look like!\n", "\n", "## The solution: static type checking of pandas DataFrames\n", "\n", "Suppose we know that our DataFrame has two columns: `id` (an int) and `name` (a str). Using `strictly_typed_pandas`, we may write that down as follows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from strictly_typed_pandas import DataSet\n", "\n", "\n", "class Schema:\n", " id: int\n", " name: str\n", "\n", "\n", "def foo(df: DataSet[Schema]) -> DataSet[Schema]:\n", " # do stuff\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These type definitions can now be checked using `mypy`, a linter for static type checking. The big benefit of `mypy` is that the type checking doesn't happen during run-time, but rather during linting time (so while you're coding), saving you precious time. If you haven't already, you should really check out how to set up `mypy` for your IDE.\n", "\n", "Let's consider an example of how this works. First, we'll create some data. Since `DataSet` is a subclass of `pd.DataFrame`, it has (nearly) all the functionality of a `DataFrame`, including:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = DataSet[Schema](\n", " {\n", " \"id\": [1, 2, 3],\n", " \"name\": [\"John\", \"Jane\", \"Jack\"],\n", " }\n", ")\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now call `foo()` with our data. All types check out, so nothing special happens." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = foo(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, if we instead try to run `foo()` on a `DataFrame`, mypy will throw the following error.\n", "\n", "(Shown as a comment here, but it will show up in your IDE if you set up mypy.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(df)\n", "res = foo(df)\n", "# mypy(error): Argument 1 to \"foo\" has incompatible type \"DataFrame\"; expected \"DataSet[Schema]\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Likewise, if we call `foo()` on a `DataSet` with an alternative schema, mypy will throw the following error." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class AlternativeSchema:\n", " id: int\n", " first_name: str\n", "\n", "\n", "df = DataSet[AlternativeSchema](\n", " {\n", " \"id\": [1, 2, 3],\n", " \"first_name\": [\"John\", \"Jane\", \"Jack\"],\n", " }\n", ")\n", "try:\n", " res = foo(df)\n", " # mypy(error): Argument 1 to \"foo\" has incompatible type \"DataSet[AlternativeSchema]\"; expected \"DataSet[Schema]\"\n", "except:\n", " pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How can we be sure that a DataSet adheres to its schema?\n", "\n", "The above is great if everyone is meticulous in keeping the schema annotations correct and up-to-date. But shouldn't we be worried that these schema annotations get out of sync? For example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Schema:\n", " id: int\n", " name: str\n", "\n", "\n", "def foo() -> DataSet[Schema]:\n", " return DataSet[Schema](\n", " {\n", " \"id\": [1, 2, 3],\n", " \"name\": [\"John\", \"Jane\", \"Jack\"],\n", " \"job\": \"Data Scientist\",\n", " }\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fortunately, we have some extra precautions in place that prevent the above scenario:\n", "\n", "* The schema of the data is validated during the `DataSet` creation.\n", "\n", "* `DataSet` is immutable, so its schema cannot change due to inplace modifications.\n", "\n", "As we will see, this means that if your codebase (e.g. `foo()`) is unit tested, functions like the above will result in errors and hence they shouldn't make it to the master branch. As such, you will be able to trust the schema annotations in your code base.\n", "\n", "Let's have a look at these precautions in more detail. First, if the columns in the data do not correspond to the ones defined in the shema, we get a TypeError, for example:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " df = DataSet[Schema]({\"id\": [1, 2, 3]})\n", "except TypeError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, if the types defined in the schema don't match the types in the data, we again get a `TypeError`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " df = DataSet[Schema](\n", " {\n", " \"id\": [1, 2, 3],\n", " \"name\": [1, 2, 3],\n", " }\n", " )\n", "except TypeError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hence, when we successsfully create our `DataSet[Schema]`, we can be certain that it adheres to the schema. \n", "\n", "Of course, for this to work, we do need to make sure that the `DataSet`'s columns and datatypes cannot be changed after its creation. This brings us to our second point: \n", "\n", "* `DataSet` is immutable, so its schema cannot change due to inplace modifications.\n", "\n", "To this end, we have disabled operations such as:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = DataSet[Schema](\n", " {\n", " \"id\": [1, 2, 3],\n", " \"name\": [\"John\", \"Jane\", \"Jack\"],\n", " }\n", ")\n", "ids = [\"1\", \"2\", \"3\"]\n", "try:\n", " df[\"id\"] = ids\n", " df.id = ids\n", " df.loc[:, \"id\"] = ids\n", " df.iloc[:, 0] = ids\n", " df.assign(id=ids, inplace=True)\n", "except NotImplementedError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you do need to make changes to the schema, you can either cast the `DataSet` back to a `DataFrame`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.to_dataframe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you can perform the `assign()` in the following way, which also casts it to a `DataFrame`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.assign(id=ids)\n", "assert type(df) == pd.DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In practice, this often means that functions have the following sequence:\n", "\n", "1. The input is a `DataSet[SchemaA]`\n", "\n", "2. The data is converted to a `DataFrame` so changes can be made\n", "\n", "3. The output is cast to `DataSet[SchemaB]`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class SchemaA:\n", " name: str\n", "\n", "\n", "class SchemaB:\n", " id: int\n", " name: str\n", "\n", "\n", "df = DataSet[SchemaA]({\"name\": [\"John\", \"Jane\", \"Jack\"]})\n", "\n", "\n", "def foo(df: DataSet[SchemaA]) -> DataSet[SchemaB]:\n", " n = df.shape[0]\n", " ids = range(n)\n", " new_df = df.assign(id=ids)\n", " return DataSet[SchemaB](new_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or alternatively in the more compact version" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def foo(data: DataSet[SchemaA]) -> DataSet[SchemaB]:\n", " return df.assign(\n", " id=lambda df: range(df.shape[0]),\n", " ).pipe(DataSet[SchemaB])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What about functions that return `Any`?\n", "So far we've seen that we can strictly type check our pandas data using a combination of linting checks and runtime checks. So is there anything that we haven't covered yet? Well, it turns out there is. Consider the following example.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Schema:\n", " id: int\n", " name: str\n", "\n", "\n", "def foo() -> DataSet[Schema]:\n", " return (\n", " DataSet[Schema](\n", " {\n", " \"id\": [1, 2, 3],\n", " \"name\": [\"John\", \"Jane\", \"Jack\"],\n", " }\n", " )\n", " .assign(job=\"Data Scientist\")\n", " .iloc[:3]\n", " )\n", "\n", "\n", "res = foo()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now this is interesting: `foo()` clearly returns something that doesn't adhere to the schema, but the above gives neither a linting error nor a runtime error!\n", "\n", "It turns out that the above problem often happens with functions like `iloc`, `loc` and `pipe`, whose return type is `Any` (and when you think about it, these can indeed return any possible datatype). When mypy sees that the return type is `Any`, it reasons that that could still be a `DataSet[Schema]` object, so it doesn't raise an error. It's only during runtime that we find out here that the return type actually is a `DataFrame`, but `mypy` doesn't do any runtime checks.\n", "\n", "Fortunately, Python offers other ways to do type checking during runtime. Here, we will use the `typeguard` package. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from strictly_typed_pandas.typeguard import typechecked\n", "\n", "\n", "@typechecked\n", "def foo() -> DataSet[Schema]:\n", " return (\n", " DataSet[Schema](\n", " {\n", " \"id\": [1, 2, 3],\n", " \"name\": [\"John\", \"Jane\", \"Jack\"],\n", " }\n", " )\n", " .assign(job=\"Data Scientist\")\n", " .iloc[:3]\n", " )\n", "\n", "\n", "try:\n", " res = foo()\n", "except TypeError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alright, we now caught the error dead in its tracks! \n", "\n", "We can improve this with one more step: instead of adding the `@typechecked` decorator to every function by hand (which could be error prone), `typeguard` can do this automatically when running the unit tests. To do this, simply run your unit tests using `pytest --stp-typeguard-packages=foo.bar` (where `foo.bar` is your package name)\n", "\n", "## Conclusions\n", "\n", "We can statically type check pandas in the following way:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from strictly_typed_pandas import DataSet\n", "\n", "\n", "class Schema:\n", " id: int\n", " name: str\n", "\n", "\n", "def foo(df: DataSet[Schema]) -> DataSet[Schema]:\n", " # do stuff\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Where `DataSet`:\n", "\n", "* is a subclass of `pd.DataFrame` and hence has the same functionality as `DataFrame`.\n", "\n", "* validates whether the data adheres to the provided schema upon its initialization.\n", "\n", "* is immutable, so its schema cannot be changed using inplace modifications.\n", "\n", "The `DataSet[Schema]` annotations are compatible with:\n", "\n", "* `mypy` for type checking during linting-time (i.e. while you write your code).\n", "\n", "* `typeguard` for type checking during run-time (i.e. while you run your unit tests).\n", "\n", "To get the most out of `strictly_typed_pandas`, be sure to:\n", "\n", "* set up `mypy` in your IDE.\n", "\n", "* run your unit tests with `pytest --stp-typeguard-packages=foo.bar` (where `foo.bar` is your package name)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "interpreter": { "hash": "21955bae40816b58329a864495bd83642121ab031d49eff86d34b7b0569c6cea" }, "kernelspec": { "display_name": "Python 3.8.5 64-bit ('base': conda)", "name": "python3" }, "language_info": { "name": "python", "version": "" }, "orig_nbformat": 2 }, "nbformat": 4, "nbformat_minor": 2 }