{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Deepdive into data types" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "from typing import Any\n", "from strictly_typed_pandas import DataSet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numeric types\n", "\n", "Pandas stores all numeric data using numpy data types. For example, if we make the following `DataFrame` (where we explicitely define the data types using base python types):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(\n", " {\n", " \"a\": pd.Series([1, 2, 3], dtype=int),\n", " \"b\": pd.Series([1.0, 2.0, 3.0], dtype=float),\n", " \"c\": pd.Series([True, False, True], dtype=bool),\n", " }\n", ")\n", "\n", "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we see that all columns have a numpy data type." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "assert df.dtypes[\"a\"] == np.int64\n", "assert df.dtypes[\"b\"] == np.float64\n", "assert df.dtypes[\"c\"] == np.bool_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interestingly, numpy data types are by default equal to their base python counterparts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "assert df.dtypes[\"a\"] == int\n", "assert df.dtypes[\"b\"] == float\n", "assert df.dtypes[\"c\"] == bool" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Following this mindset, we allow the schemas to be defined using either numpy or base python data types." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Schema:\n", " a: int\n", " b: float\n", " c: bool\n", "\n", "\n", "df = DataSet[Schema]()\n", "df.dtypes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Schema:\n", " a: np.int64\n", " b: np.float64\n", " c: np.bool_\n", "\n", "\n", "df = DataSet[Schema]()\n", "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also define your schema with superclasses (e.g. `np.integer`) instead of specific classes (e.g. `np.int64`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Schema:\n", " a: np.integer\n", "\n", "\n", "df = DataSet[Schema](\n", " {\n", " \"a\": pd.Series([1, 2, 3], dtype=np.int64),\n", " }\n", ")\n", "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Datetime and timedelta\n", "These too are defined using numpy.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Schema:\n", " a: np.datetime64\n", " b: np.timedelta64\n", "\n", "\n", "df = DataSet[Schema]()\n", "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas data types\n", "Pandas has a number of its own data types, to allow for things like:\n", "\n", "* Timezones\n", "\n", "* Categorical values\n", "\n", "* Sparse data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Schema:\n", " a: pd.DatetimeTZDtype(tz=\"UTC\") # type: ignore # noqa: F821\n", " b: pd.CategoricalDtype\n", " c: pd.PeriodDtype(freq=\"D\") # type: ignore # noqa: F821\n", " d: pd.SparseDtype(dtype=np.int64) # type: ignore\n", " e: pd.IntervalDtype\n", " f: pd.Int64Dtype\n", " h: pd.BooleanDtype\n", "\n", "\n", "df = DataSet[Schema]()\n", "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of these types accept arguments (e.g. `pd.DatetimeTZDtype(tz=\"UTC\")`). While this works perfectly well during run-time, it does result in linting errors. You can suppress these without any problems by using `# type: ignore # noqa: F821`.\n", "\n", "Note that the pandas data types are not considered equivalent to their numpy or base python equivalents." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class SchemaA:\n", " a: pd.Int64Dtype\n", "\n", "\n", "class SchemaB:\n", " a: np.int64\n", "\n", "\n", "try:\n", " DataSet[SchemaA]().pipe(DataSet[SchemaB])\n", "except TypeError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Strings\n", "String types are complicated business in pandas. From pandas 1.0.0 and higher, we suggest using the `string` (i.e. `pd.StringDtype`) data type. When defining a schema, this data type is compatible with both the base python `str` annotation and the pandas `pd.StringDtype` annotation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Schema:\n", " a: str\n", " b: pd.StringDtype\n", "\n", "\n", "df = DataSet[Schema](\n", " {\n", " \"a\": pd.Series([\"a\", \"b\", \"c\"], dtype=\"string\"),\n", " \"b\": pd.Series([\"a\", \"b\", \"c\"], dtype=\"string\"),\n", " }\n", ")\n", "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately, `pd.StringDtype` has only been around briefly: it isn't available in older versions of python, and as of yet it is still not used by default when creating a DataFrame with strings. Instead, strings are by default stored as the non-descript `object` type." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame({\"a\": [\"a\", \"b\", \"c\"]})\n", "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To be consistent, we have decided to set `str == object` when checking the schema, atleast until `pd.StringDtype` will be the default data type for strings in pandas." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Schema:\n", " a: str\n", "\n", "\n", "df = DataSet[Schema]({\"a\": [\"a\", \"b\", \"c\"]})\n", "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this is horribly unspecific. For example, the following `DataSet` contains a column `a` with data type `object`, which contains several things that are definitely not strings. However, since we had to agree that `object == str`, this currently passes without failure." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Schema:\n", " a: str\n", "\n", "\n", "df = DataSet[Schema]({\"a\": [None, 42, lambda x: x]})\n", "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We hope that `pd.StringDtype` will soon be the default string type, so that we can avoid the problem outlined above. Until then, if you want to be sure that your string columns are actually strings, it's best to use `pd.StringDtype` for your type annotations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Schema:\n", " a: pd.StringDtype\n", "\n", "\n", "df = DataSet[Schema]({\"a\": pd.Series([\"a\", \"b\", \"c\"], dtype=\"string\")})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " DataSet[Schema]({\"a\": [None, 42, lambda x: x]})\n", "except TypeError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The `Any` type\n", "\n", "In some cases it is useful to be able to define that a column can have `Any` type. This can either be a column of a specific type (e.g. `int64`) or a mix of data types (i.e. an `object`)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Schema:\n", " a: Any\n", " b: Any\n", "\n", "\n", "df = DataSet[Schema](\n", " {\n", " \"a\": [1, 2, 3],\n", " \"b\": [\"1\", 2, None],\n", " }\n", ")\n", "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Anything missing?\n", "There's a zoo of data types used in pandas. Is anything missing? Contact me and I'll look into it!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "stp", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "orig_nbformat": 2, "vscode": { "interpreter": { "hash": "0785e816af5df78c77a9de5b5385808c06b955fe7dba50fa53415245f1f2e5ee" } } }, "nbformat": 4, "nbformat_minor": 2 }