Skip to content
Oliver Kennedy edited this page Aug 13, 2015 · 18 revisions

Overview of the data

Here, we have some messy data. We have a "product information" dataset and have got several "product review" datasets. The product information dataset is missing some critical information, and the product review datasets have different schemas... and some mistyped data too, but let's leave that off for now.

Loading the data

Drag CSV files into the database

I have product data in my database already. I've just loaded the ratings data into my database, creating some new tables.

  • ratings1(pid, rating, review_ct)
  • ratings2(pid, evaluation, num_ratings)

CSV files don't store type information. Everything that got loaded in is a string, not something that we want to work with. The first thing we need to do is assign types. Mimir can do this for you, making some educated guesses based on the data.

** Create type inference lenses for ratings1 and ratings2 named ratings1typed and ratings2typed respectively **

Show the loaded tables

Mimir was able to make some educated guesses. For example, it currently thinks that review_ct is a number, based on the contents of the column. When we query the data, it will try to cast all of the data in that field to the type it guessed... I hope it guessed right.

Then the tables were created in the database, they look like:

  • typedratings1(pid number, rating number, review_ct number)
  • typedratings2(pid number, evaluation number, num_ratings number)

Cleaning the data

We know a few things about the dataset. We know that the product dataset is missing some attributes, and we know that the product review datasets need to be linked together under a common schema. Let's handle these one at a time.

Create a Schema-Matching lens for the product review dataset(s) named ratings2matched

We just merged the review datasets. The database is making some assumptions about how to merge things together. These are best-guess efforts, so I really hope that none of them turn out wrong... Still, let's try getting all of the ratings in one go.

SELECT * FROM ratings2matched UNION ALL SELECT * FROM rating1complete

Create a Domain-Constraint-Repair lens for the product table named productcomplete

We just cleaned the product dataset. Magic, right?

SELECT * FROM productcomplete, (SELECT * FROM rating2complete union all select * from rating1complete) ratings WHERE product.id = ratings.pid;

Cool. Results... oh, but wait, what are these red records?

Mouseover

Oh... this result is based on a guess about rating in the product data. And this one is based on a guess about the type of product, and this one on the review...

Fixing the data

Demo the 'fix' button