Deepdive into data types
[1]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.append("../")
import pandas as pd
import numpy as np
from typing import Any
from strictly_typed_pandas import DataSet, IndexedDataSet
Numeric types
Pandas stores all numeric data using numpy data types. For example, if we make the following DataFrame
(where we explicitely define the data types using base python types):
[2]:
df = pd.DataFrame(
{
"a": pd.Series([1, 2, 3], dtype=int),
"b": pd.Series([1.0, 2.0, 3.0], dtype=float),
"c": pd.Series([True, False, True], dtype=bool)
}
)
df.dtypes
[2]:
a int64
b float64
c bool
dtype: object
Then we see that all columns have a numpy data type.
[3]:
assert df.dtypes["a"] == np.int_
assert df.dtypes["b"] == np.float_
assert df.dtypes["c"] == np.bool_
Interestingly, numpy data types are by default equal to their base python counterparts.
[4]:
assert df.dtypes["a"] == int
assert df.dtypes["b"] == float
assert df.dtypes["c"] == bool
Following this mindset, we allow the schemas to be defined using either numpy or base python data types.
[5]:
class Schema:
a: int
b: float
c: bool
df = DataSet[Schema]()
df.dtypes
[5]:
a int64
b float64
c bool
dtype: object
[6]:
class Schema:
a: np.int64
b: np.float64
c: np.bool_
df = DataSet[Schema]()
df.dtypes
[6]:
a int64
b float64
c bool
dtype: object
You can also define your schema with superclasses (e.g. np.integer
) instead of specific classes (e.g. np.int64
).
[7]:
class Schema:
a: np.integer
b: np.float_
df = DataSet[Schema](
{
"a": pd.Series([1, 2, 3], dtype=np.int64),
"b": pd.Series([1.0, 2.0, 3.0], dtype=np.float64)
}
)
df.dtypes
[7]:
a int64
b float64
dtype: object
Datetime and timedelta
These too are defined using numpy.
[8]:
class Schema:
a: np.datetime64
b: np.timedelta64
df = DataSet[Schema]()
df.dtypes
[8]:
a datetime64[ns]
b timedelta64[ns]
dtype: object
Pandas data types
Pandas has a number of its own data types, to allow for things like:
Timezones
Categorical values
Sparse data
[9]:
class Schema:
a: pd.DatetimeTZDtype(tz="UTC") # type: ignore # noqa: F821
b: pd.CategoricalDtype
c: pd.PeriodDtype(freq="D") # type: ignore # noqa: F821
d: pd.SparseDtype(dtype=np.int64) # type: ignore
e: pd.IntervalDtype
f: pd.Int64Dtype
h: pd.BooleanDtype
df = DataSet[Schema]()
df.dtypes
[9]:
a datetime64[ns, UTC]
b category
c period[D]
d Sparse[int64, 0]
e interval[int64, right]
f Int64
h boolean
dtype: object
Some of these types accept arguments (e.g. pd.DatetimeTZDtype(tz="UTC")
). While this works perfectly well during run-time, it does result in linting errors. You can suppress these without any problems by using # type: ignore # noqa: F821
.
Note that the pandas data types are not considered equivalent to their numpy or base python equivalents.
[10]:
class SchemaA:
a: pd.Int64Dtype
class SchemaB:
a: np.int64
try:
(
DataSet[SchemaA]()
.pipe(DataSet[SchemaB])
)
except TypeError as e:
print(e)
Column a is of type Int64, but the schema suggests <class 'numpy.int64'>
Strings
String types are complicated business in pandas. From pandas 1.0.0 and higher, we suggest using the string
(i.e. pd.StringDtype
) data type. When defining a schema, this data type is compatible with both the base python str
annotation and the pandas pd.StringDtype
annotation.
[11]:
class Schema:
a: str
b: pd.StringDtype
df = DataSet[Schema](
{
"a": pd.Series(["a", "b", "c"], dtype="string"),
"b": pd.Series(["a", "b", "c"], dtype="string")
}
)
df.dtypes
[11]:
a string
b string
dtype: object
Unfortunately, pd.StringDtype
has only been around briefly: it isn’t available in older versions of python, and as of yet it is still not used by default when creating a DataFrame with strings. Instead, strings are by default stored as the non-descript object
type.
[12]:
df = pd.DataFrame({"a": ["a", "b", "c"]})
df.dtypes
[12]:
a object
dtype: object
To be consistent, we have decided to set str == object
when checking the schema, atleast until pd.StringDtype
will be the default data type for strings in pandas.
[13]:
class Schema:
a: str
df = DataSet[Schema]({"a": ["a", "b", "c"]})
df.dtypes
[13]:
a object
dtype: object
Note that this is horribly unspecific. For example, the following DataSet
contains a column a
with data type object
, which contains several things that are definitely not strings. However, since we had to agree that object == str
, this currently passes without failure.
[14]:
class Schema:
a: str
df = DataSet[Schema](
{
"a": [None, 42, lambda x: x]
}
)
df.dtypes
[14]:
a object
dtype: object
We hope that pd.StringDtype
will soon be the default string type, so that we can avoid the problem outlined above. Until then, if you want to be sure that your string columns are actually strings, it’s best to use pd.StringDtype
for your type annotations.
[15]:
class Schema:
a: pd.StringDtype
df = DataSet[Schema](
{
"a": pd.Series(["a", "b", "c"], dtype="string")
}
)
[16]:
try:
DataSet[Schema](
{
"a": [None, 42, lambda x: x]
}
)
except TypeError as e:
print(e)
Column a is of type numpy.object, but the schema suggests <class 'pandas.core.arrays.string_.StringDtype'>
The Any
type
In some cases it is useful to be able to define that a column can have Any
type. This can either be a column of a specific type (e.g. int64
) or a mix of data types (i.e. an object
)
[17]:
class Schema:
a: Any
b: Any
df = DataSet[Schema](
{
"a": [1, 2, 3],
"b": ["1", 2, None]
}
)
df.dtypes
[17]:
a int64
b object
dtype: object
Data types that are not supported in the index
There are certain data types that pandas does not support in the index. As of pandas 1.5.3, this is limited to pd.SparseDtype()
columns, which will be transformed to an object
column when used as an index. This means that you cannot use these data types in the index schema.
[23]:
class IndexSchema:
a: pd.SparseDtype(dtype=np.int64) # including other variations of SparseDtype
class Schema:
b: int
try:
IndexedDataSet[IndexSchema, Schema]()
except TypeError as e:
print(e)
Cannot interpret 'Sparse[int64, 0]' as a data type
/Users/nanneaben/Documents/projects/2022/strictly_typed_pandas/strictly_typed_pandas/create_empty_dataframe.py:39: FutureWarning: In a future version, passing a SparseArray to pd.Index will store that array directly instead of converting to a dense numpy ndarray. To retain the old behavior, use pd.Index(arr.to_numpy()) instead
pd.concat([df_index, df_data], axis=1)
/Users/nanneaben/Documents/projects/2022/strictly_typed_pandas/strictly_typed_pandas/validate_schema.py:25: SyntaxWarning: As of Pandas 1.5.3, there is no support for the following data types in the index: [Sparse[int64, 0]]. While this may change in future versions, we suggest you proceed with caution.
warnings.warn(msg.format(dtypes), SyntaxWarning)
Anything missing?
There’s a zoo of data types used in pandas. Is anything missing? Contact me and I’ll look into it!
[ ]: