Skip to content

viraltux/DataWrangler.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataWrangler

Data wrangling refers to a number of processes designed to clean and transform data into into analytics ready datasets.

This package provides the following functionality to wrangle data:

  • Box-Cox and inverse Box-Cox transformation and estimation: boxcox, iboxcox
  • Data imputation (loess inter/extra-polation, random local density): impute, impute!
  • Data normalization (z-score, min-max, softmax, sigmoid): normalize, normalize!
  • Finite lagged difference and partial difference and its inverse: d, p
  • Outlier detection and removal: outlie, outlie!

Examples

Data Imputation

using Plots
n = 1000
x = sort(rand(n))*2*pi;
y = Array{Union{Missing,Float64}}(undef,n);
y[:] = sin.(x).+randn(n)/10;
mid = vcat(100:150,300:350,600:650,950:1000);

y[mid] .= missing;
scatter(x,y; label="data")

ipy = impute(x,y; method = "normal")
scatter!(x[mid],ipy[mid]; label = "imputed 'normal'", color=:white)

ipy = impute(x,y)
scatter!(x[mid],ipy[mid]; label = "imputed 'loess'", color=:black, markersize = 2)

Time Series Outlier Detection

using Plots
n = 1000
x = sort(rand(n))*2*pi;
y = Array{Union{Missing,Float64}}(undef,n);
y[:] = sin.(x).+randn(n)/10
mid = vcat(100:150,300:350,600:650,950:1000);
y[mid] .= y[mid] .+ 2*(randn(length(mid)).+1)

outlist = outlie(x,y)
scatter(outlist, y[outlist]; color="blue", label="outliers",ms=6)
scatter!(y,color="lightblue", label="data")

Build Status Coverage