Skip to content

alan-turing-institute/ptype

Repository files navigation

build-publish on release

build on develop

PyPI version

Documentation status

Downloads

Binder

Introduction

ptype is a probabilistic approach to type inference, which is the task of identifying the data type (e.g. Boolean, date, integer or string) of a given column of data.

Existing approaches often fail on type inference for messy datasets where data is missing or anomalous. With ptype, our goal is to develop a robust method that can deal with such data.

Normal, missing and anomalous values are denoted by green, yellow and red, respectively in the right hand figure.

Normal, missing and anomalous values are denoted by green, yellow and red, respectively in the right hand figure.

ptype uses Probabilistic Finite-State Machines (PFSMs) to model known data types, missing and anomalous data. Given a column of data, we can infer a plausible column type, and also identify any values which (conditional on that type) are deemed missing or anomalous. In contrast to more familiar finite-state machines, such as regular expressions, that either accept or reject a given data value, PFSMs assign probabilities to different values. They therefore offer the advantage of generating weighted predictions when a column of messy data is consistent with more than one type assignment.

If you use this package, please cite the ptype paper, using the following BibTeX entry:

@article{ceritli2020ptype,
  title={ptype: probabilistic type inference},
  author={Ceritli, Taha and Williams, Christopher KI and Geddes, James},
  journal={Data Mining and Knowledge Discovery},
  year={2020},
  volume = {34},
  number = {3},
  pages={870–-904},
  doi = {10.1007/s10618-020-00680-1},
}

Install requirements

You can simply install ptype from PyPI:

pip install ptype

Usage

See demo notebooks in notebooks folder. View them online via Binder.