Pack Pandas DataFrames into smaller, more memory efficient types.
When you load data into Pandas, it will use standard types by default:
objectfor stringsint64for integersfloat64for floating point numbers
However, for many datasets there is a much more compact representation that Pandas could be using for that data. Using a more compact representation leads to lower memory usage, and smaller binary files on disk when using formats such as Feather and Parquet.
This library does just one thing: it shrinks your data frames to use smaller types.
pip install owid-repack
The owid.repack module exposes two methods, repack_series() and repack_frame().
repack_series() will detect the smallest type that can accurately fit the existing data in the series.
In [1]: from owid import repack
In [2]: pd.Series([1, 2, 3])
Out[2]:
0 1
1 2
2 3
dtype: int64
In [3]: repack.repack_series(pd.Series([1.5, 2, 3]))
Out[3]:
0 1.5
1 2.0
2 3.0
dtype: float32
In [4]: repack.repack_series(pd.Series([1, None, 3]))
Out[4]:
0 1
1 <NA>
2 3
dtype: UInt8
In [5]: repack.repack_series(pd.Series([-1, None, 3]))
Out[5]:
0 -1
1 <NA>
2 3
dtype: Int8
The repack_frame() method simply does this across every column in your DataFrame, returning a new DataFrame.
0.1.3:- Improve performance on float dtypes
0.1.2:- Shrink columns with all NaNs to Int8
0.1.1:- Fix Python support in package metadata to support 3.8.1 onwards
0.1.0:- Migrate first version from
owid-catalog-pyrepo
- Migrate first version from