Skip to content
This repository has been archived by the owner on May 14, 2018. It is now read-only.

Roadmap for testdat #1

Open
2 of 5 tasks
karthik opened this issue Nov 6, 2013 · 11 comments
Open
2 of 5 tasks

Roadmap for testdat #1

karthik opened this issue Nov 6, 2013 · 11 comments

Comments

@karthik
Copy link
Owner

karthik commented Nov 6, 2013

Here's a quick roadmap for the package. The goal is to have a full test suite that folks can run on their tabular data to identify problems and issues. These can be as common as finding UTF-8 characters, unintended spaces in cells, and also finding malformed characters (eg. date patterns).

Right now I have a dozen or ‘messy’ datasets to work with.

Basic function to implement

  • Pattern matching. Test that the data in a vector matches a regex pattern
  • Length (all data in a vector are of a specified length).
  • Check for extra spaces
  • Check for missing values.
  • Check for outliers. Somewhat tricky but a common use case would be to identify typos. e.g. 1.5, 1.6, 1.98, 17

Question: Would it be worth implementing a set of matching functions to fix the issues as well? With code unit testing one can only identify problems and point out where fixes need to occur. Here we can actually go through and clean everything up.

@sckott
Copy link
Contributor

sckott commented Feb 10, 2014

WRT your question: seems like yes to me.

I wonder if you could have the function that detects the errors collect them (e.g., in attributes element of an S3 object), then the function that fixes the errors simply pulls in that metadata of what errors to fix, fixes them, and returns the fixed dataset.

@sckott
Copy link
Contributor

sckott commented Feb 10, 2014

WRT checking for outliers: A simple wrapper function around GGally::ggpairs might be useful for visually looking for outliers across any set of columns.

@karthik
Copy link
Owner Author

karthik commented Feb 10, 2014

Sounds like a great idea @sckott. Feel free to add stuff to the package if you have any interest for working on it.

Interesting idea re: the S3 object. I'll think on it some more but seems like it could be useful in a provenance context. Like

original_data <- read.csv(...)
issues <- test_dat("Testing data for following issues", {
                             ...
                            })
clean_data <- fix_dat(original_data, issues)

But for first pass it might just be simple to have a few function calls to fix issues.

@karthik
Copy link
Owner Author

karthik commented Feb 12, 2014

Just realized that what testdat would do is become a programmatic equivalent of Google Refine (now Open Refine).

@sckott
Copy link
Contributor

sckott commented Feb 12, 2014

good analogy

@emhart
Copy link

emhart commented Feb 15, 2014

WRT to outliers, a useful addendum to @sckott suggestion of a plot would be to add a set of criterion to establish outliers, these could be percentiles, or maybe standard deviations if it's normalish data, and then plot them with different colors. Perhaps even numbers by the points so people can quickly identify the row number. I can try and whip something up. Here's a definition from NIST.

@emhart
Copy link

emhart commented Feb 15, 2014

Looks like putting numbers near the points might be pretty cramped. Here's a gist that will create a plot, is this what you had in mind? (before I incorporate it in to the package) https://gist.github.com/emhart/9025719

@sckott
Copy link
Contributor

sckott commented Feb 15, 2014

Looks good to me @emhart - But you could just number the points that are outliers, right?, like the 5% outliers, or 2.5% or whatever.

@emhart
Copy link

emhart commented Feb 16, 2014

Yeah, it looks like just labeling the outliers would still be tight.

@sckott
Copy link
Contributor

sckott commented Feb 16, 2014

We could try out an interactive shiny/rcharts version where you can hover over points to get their metadata?

@karthik
Copy link
Owner Author

karthik commented Feb 16, 2014

Sounds great. I've jotted down all these for implementation.

On Sat, Feb 15, 2014 at 5:07 PM, Scott Chamberlain <notifications@github.com

wrote:

We could try out an interactive shiny/rcharts version where you can hover
over points to get their metadata?

Reply to this email directly or view it on GitHubhttps://github.com//issues/1#issuecomment-35174127
.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants