Skip to content

invenia/Impute.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Impute

stable latest CI codecov

Impute.jl provides various methods for handling missing data in Vectors, Matrices and Tables.

Installation

julia> using Pkg; Pkg.add("Impute")

Quickstart

Let's start by loading our dependencies:

julia> using DataFrames, Impute

We'll also want some test data containing missings to work with:

julia> df = Impute.dataset("test/table/neuro") |> DataFrame
469×6 DataFrame
 Row │ V1         V2         V3       V4        V5         V6
     │ Float64?   Float64?   Float64  Float64?  Float64?   Float64?
─────┼───────────────────────────────────────────────────────────────
   1missing       -203.7    -84.1      18.5  missing    missing
   2missing       -203.0    -97.8      25.8      134.7  missing
   3missing       -249.0    -92.1      27.8      177.1  missing
   4missing       -231.5    -97.5      27.0      150.3  missing
   5missing    missing     -130.1      25.8      160.0  missing
   6missing       -223.1    -70.7      62.1      197.5  missing
   7missing       -164.8    -12.2      76.8      202.8  missing
   8missing       -221.6    -81.9      27.5      144.5  missing
                                                 
 463-242.6     -142.0    -21.8      69.8      148.7  missing
 464-235.9     -128.8    -33.1      68.8      177.1  missing
 465missing       -140.8    -38.7      58.1      186.3  missing
 466missing       -149.5    -40.3      62.8      139.7      242.5
 467-247.6     -157.8    -53.3      28.3      122.9      227.6
 468missing       -154.9    -50.8      28.1      119.9      201.1
 469missing       -180.7    -70.9      33.7      114.8      222.5
                                                     454 rows omitted

Our first instinct might be to drop all observations, but this leaves us too few rows to work with:

julia> Impute.filter(df; dims=:rows)
4×6 DataFrame
 Row │ V1       V2       V3       V4       V5       V6
     │ Float64  Float64  Float64  Float64  Float64  Float64
─────┼──────────────────────────────────────────────────────
   1-247.0   -132.2    -18.8     28.2     81.4    237.9
   2-234.0   -140.8    -56.5     28.0    114.3    222.9
   3-215.8   -114.8    -18.4     65.3    171.6    249.7
   4-247.6   -157.8    -53.3     28.3    122.9    227.6

We could try imputing the values with linear interpolation, but that still leaves missing data at the head and tail of our dataset:

julia> Impute.interp(df)
469×6 DataFrame
 Row │ V1           V2         V3       V4        V5         V6
     │ Float64?     Float64?   Float64  Float64?  Float64?   Float64?
─────┼───────────────────────────────────────────────────────────────────
   1missing        -203.7     -84.1      18.5  missing    missing
   2missing        -203.0     -97.8      25.8      134.7  missing
   3missing        -249.0     -92.1      27.8      177.1  missing
   4missing        -231.5     -97.5      27.0      150.3  missing
   5missing        -227.3    -130.1      25.8      160.0  missing
   6missing        -223.1     -70.7      62.1      197.5  missing
   7missing        -164.8     -12.2      76.8      202.8  missing
   8missing        -221.6     -81.9      27.5      144.5  missing
                                                   
 463-242.6      -142.0     -21.8      69.8      148.7      224.125
 464-235.9      -128.8     -33.1      68.8      177.1      230.25
 465-239.8      -140.8     -38.7      58.1      186.3      236.375
 466-243.7      -149.5     -40.3      62.8      139.7      242.5
 467-247.6      -157.8     -53.3      28.3      122.9      227.6
 468missing        -154.9     -50.8      28.1      119.9      201.1
 469missing        -180.7     -70.9      33.7      114.8      222.5
                                                         454 rows omitted

Finally, we can chain multiple simple methods together to give a complete dataset:

julia> Impute.interp(df) |> Impute.locf() |> Impute.nocb()
469×6 DataFrame
 Row │ V1        V2         V3       V4        V5        V6
     │ Float64?  Float64?   Float64  Float64?  Float64?  Float64?
─────┼────────────────────────────────────────────────────────────
   1-233.6      -203.7     -84.1      18.5     134.7   222.7
   2-233.6      -203.0     -97.8      25.8     134.7   222.7
   3-233.6      -249.0     -92.1      27.8     177.1   222.7
   4-233.6      -231.5     -97.5      27.0     150.3   222.7
   5-233.6      -227.3    -130.1      25.8     160.0   222.7
   6-233.6      -223.1     -70.7      62.1     197.5   222.7
   7-233.6      -164.8     -12.2      76.8     202.8   222.7
   8-233.6      -221.6     -81.9      27.5     144.5   222.7
                                               
 463-242.6      -142.0     -21.8      69.8     148.7   224.125
 464-235.9      -128.8     -33.1      68.8     177.1   230.25
 465-239.8      -140.8     -38.7      58.1     186.3   236.375
 466-243.7      -149.5     -40.3      62.8     139.7   242.5
 467-247.6      -157.8     -53.3      28.3     122.9   227.6
 468-247.6      -154.9     -50.8      28.1     119.9   201.1
 469-247.6      -180.7     -70.9      33.7     114.8   222.5
                                                  454 rows omitted

Warning:

  • Your approach should depend on the properties of you data (e.g., MCAR, MAR, MNAR).
  • In-place calls aren't guaranteed to mutate the original data, but it will try avoid copying if possible. In the future, it may be possible to detect whether in-place operations are permitted on an array or table using traits: