ptype is a probabilistic approach to type inference, which is the task of identifying the data type (e.g. Boolean, date, integer or string) of a given column of data.
Existing approaches often fail on type inference for messy datasets where data is missing or anomalous. With ptype, our goal is to develop a robust method that can deal with such data.
Normal, missing and anomalous values are denoted by green, yellow and red, respectively in the right hand figure.ptype uses Probabilistic Finite-State Machines (PFSMs) to model known data types, missing and anomalous data. Given a column of data, we can infer a plausible column type, and also identify any values which (conditional on that type) are deemed missing or anomalous. In contrast to more familiar finite-state machines, such as regular expressions, that either accept or reject a given data value, PFSMs assign probabilities to different values. They therefore offer the advantage of generating weighted predictions when a column of messy data is consistent with more than one type assignment.
If you use this package, please cite the ptype paper, using the following BibTeX entry:
@article{ceritli2020ptype,
title={ptype: probabilistic type inference},
author={Ceritli, Taha and Williams, Christopher KI and Geddes, James},
journal={Data Mining and Knowledge Discovery},
year={2020},
volume = {34},
number = {3},
pages={870–-904},
doi = {10.1007/s10618-020-00680-1},
}
You can simply install ptype from PyPI:
pip install ptype
See demo notebooks in notebooks
folder. View them online via Binder.