GitHub - alan-turing-institute/ptype: Probabilistic type inference

1 Introduction

Contents

1 Introduction
2 Install requirements
3 Usage

ptype is a probabilistic approach to type inference, which is the task of identifying the data type (e.g. Boolean, date, integer or string) of a given column of data.

Existing approaches often fail on type inference for messy datasets where data is missing or anomalous. With ptype, our goal is to develop a robust method that can deal with such data.

https://raw.githubusercontent.com/alan-turing-institute/ptype/release/notes/motivation.png

Normal, missing and anomalous values are denoted by green, yellow and red, respectively in the right hand figure.

ptype uses Probabilistic Finite-State Machines (PFSMs) to model known data types, missing and anomalous data. Given a column of data, we can infer a plausible column type, and also identify any values which (conditional on that type) are deemed missing or anomalous. In contrast to more familiar finite-state machines, such as regular expressions, that either accept or reject a given data value, PFSMs assign probabilities to different values. They therefore offer the advantage of generating weighted predictions when a column of messy data is consistent with more than one type assignment.

If you use this package, please cite the ptype paper, using the following BibTeX entry:

@article{ceritli2020ptype,
  title={ptype: probabilistic type inference},
  author={Ceritli, Taha and Williams, Christopher KI and Geddes, James},
  journal={Data Mining and Knowledge Discovery},
  year={2020},
  volume = {34},
  number = {3},
  pages={870–-904},
  doi = {10.1007/s10618-020-00680-1},
}

2 Install requirements

You can simply install ptype from PyPI:

pip install ptype

3 Usage

See demo notebooks in notebooks folder. View them online via Binder.

Name		Name	Last commit message	Last commit date
Latest commit History 1,150 Commits
.github/workflows		.github/workflows
.idea		.idea
annotations		annotations
data		data
docs		docs
notebooks		notebooks
notes		notes
ptype		ptype
script		script
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
LICENSE		LICENSE
Manifest.in		Manifest.in
README.rst		README.rst
conftest.py		conftest.py
linecount.sh		linecount.sh
linecount.xlsx		linecount.xlsx
requirements.txt		requirements.txt
setup.py		setup.py

License

alan-turing-institute/ptype

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages