Clean Answers over Dirty Databases: A Probabilistic Approach

Course Code: CS702

Course Project: Distributed Database Management System

Overview

Authors propose a complementary approach that permits declarative query answering over duplicated data, where each duplicate is associated with a probability of being in the clean database. This repository contains the simulation of author work[1] using python[2] script in which they rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database.

Reference Dataset

Synthetic Data Generator, UIS Database Generator and Cora Dataset

Simulating Simulator

Simulator script should be executed as

./python simulator.py

Simulator Command Format

Select Attribute1,Attribute2,...,AttributeN 
   from Table1,Table2 
      where condition1,condition2..,conditionN 
         groupBy Attribute1,...AttributeN

Query Re-Writing Example

Dataset Snippet of Customer Table

id	custId	name	balance	prob
c1	m1	John	20	0.7
c1	m2	John	30	0.3
c2	m3	Mary	27	0.2
c2	m4	Marion	5	0.8

Normal SQL query to fetch id of those customers having balance > 10

select id,prob
   from customer
      where balance>10

id	prob
c1	0.7
c1	0.3
c2	0.2

But if we apply clean answers over Dirty Database using Probabilistic Database

select id,sum(prob)
   from customer
      where balance>10
        groupby id

id	prob
c1	0.1
c2	0.2

References

[1] P. Andritsos, A. Fuxman, R.J. Miller, "Clean Answers over Dirty Databases: A Probabilistic Approach", Proceedings of the 22nd International Conference on Data Engineering, 2006.

[2] https://github.com/mysql/mysql-server

[3] https://www.python.org/.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
tables		tables
LCS_and_Twitter_String_Similarity.png		LCS_and_Twitter_String_Similarity.png
README.md		README.md
_config.yml		_config.yml
leve.png		leve.png
report.pdf		report.pdf
simmulator.py		simmulator.py
tuple_probability_cal.py		tuple_probability_cal.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tables

tables

LCS_and_Twitter_String_Similarity.png

LCS_and_Twitter_String_Similarity.png

README.md

README.md

_config.yml

_config.yml

leve.png

leve.png

report.pdf

report.pdf

simmulator.py

simmulator.py

tuple_probability_cal.py

tuple_probability_cal.py

Repository files navigation

Clean Answers over Dirty Databases: A Probabilistic Approach

Course Code: CS702

Course Project: Distributed Database Management System

Overview

Reference Dataset

Simulating Simulator

Simulator Command Format

Query Re-Writing Example

References

About

Releases

Packages

Contributors 2

Languages

bhaskar24/Clean_Answers_over_Dirty_Database

Folders and files

Latest commit

History

Repository files navigation

Clean Answers over Dirty Databases: A Probabilistic Approach

Course Code: CS702

Course Project: Distributed Database Management System

Overview

Reference Dataset

Simulating Simulator

Simulator Command Format

Query Re-Writing Example

References

About

Topics

Resources

Stars

Watchers

Forks

Languages