Impute

Impute.jl provides various methods for handling missing data in Vectors, Matrices and Tables.

Installation

julia> using Pkg; Pkg.add("Impute")

Quickstart

Let's start by loading our dependencies:

julia> using DataFrames, Impute

We'll also want some test data containing missings to work with:

julia> df = Impute.dataset("test/table/neuro") |> DataFrame
469×6 DataFrame
 Row │ V1         V2         V3       V4        V5         V6
     │ Float64?   Float64?   Float64  Float64?  Float64?   Float64?
─────┼───────────────────────────────────────────────────────────────
   1 │ missing       -203.7    -84.1      18.5  missing    missing
   2 │ missing       -203.0    -97.8      25.8      134.7  missing
   3 │ missing       -249.0    -92.1      27.8      177.1  missing
   4 │ missing       -231.5    -97.5      27.0      150.3  missing
   5 │ missing    missing     -130.1      25.8      160.0  missing
   6 │ missing       -223.1    -70.7      62.1      197.5  missing
   7 │ missing       -164.8    -12.2      76.8      202.8  missing
   8 │ missing       -221.6    -81.9      27.5      144.5  missing
  ⋮  │     ⋮          ⋮         ⋮        ⋮          ⋮          ⋮
 463 │    -242.6     -142.0    -21.8      69.8      148.7  missing
 464 │    -235.9     -128.8    -33.1      68.8      177.1  missing
 465 │ missing       -140.8    -38.7      58.1      186.3  missing
 466 │ missing       -149.5    -40.3      62.8      139.7      242.5
 467 │    -247.6     -157.8    -53.3      28.3      122.9      227.6
 468 │ missing       -154.9    -50.8      28.1      119.9      201.1
 469 │ missing       -180.7    -70.9      33.7      114.8      222.5
                                                     454 rows omitted

Our first instinct might be to drop all observations, but this leaves us too few rows to work with:

julia> Impute.filter(df; dims=:rows)
4×6 DataFrame
 Row │ V1       V2       V3       V4       V5       V6
     │ Float64  Float64  Float64  Float64  Float64  Float64
─────┼──────────────────────────────────────────────────────
   1 │  -247.0   -132.2    -18.8     28.2     81.4    237.9
   2 │  -234.0   -140.8    -56.5     28.0    114.3    222.9
   3 │  -215.8   -114.8    -18.4     65.3    171.6    249.7
   4 │  -247.6   -157.8    -53.3     28.3    122.9    227.6

We could try imputing the values with linear interpolation, but that still leaves missing data at the head and tail of our dataset:

julia> Impute.interp(df)
469×6 DataFrame
 Row │ V1           V2         V3       V4        V5         V6
     │ Float64?     Float64?   Float64  Float64?  Float64?   Float64?
─────┼───────────────────────────────────────────────────────────────────
   1 │ missing        -203.7     -84.1      18.5  missing    missing
   2 │ missing        -203.0     -97.8      25.8      134.7  missing
   3 │ missing        -249.0     -92.1      27.8      177.1  missing
   4 │ missing        -231.5     -97.5      27.0      150.3  missing
   5 │ missing        -227.3    -130.1      25.8      160.0  missing
   6 │ missing        -223.1     -70.7      62.1      197.5  missing
   7 │ missing        -164.8     -12.2      76.8      202.8  missing
   8 │ missing        -221.6     -81.9      27.5      144.5  missing
  ⋮  │      ⋮           ⋮         ⋮        ⋮          ⋮           ⋮
 463 │    -242.6      -142.0     -21.8      69.8      148.7      224.125
 464 │    -235.9      -128.8     -33.1      68.8      177.1      230.25
 465 │    -239.8      -140.8     -38.7      58.1      186.3      236.375
 466 │    -243.7      -149.5     -40.3      62.8      139.7      242.5
 467 │    -247.6      -157.8     -53.3      28.3      122.9      227.6
 468 │ missing        -154.9     -50.8      28.1      119.9      201.1
 469 │ missing        -180.7     -70.9      33.7      114.8      222.5
                                                         454 rows omitted

Finally, we can chain multiple simple methods together to give a complete dataset:

julia> Impute.interp(df) |> Impute.locf() |> Impute.nocb()
469×6 DataFrame
 Row │ V1        V2         V3       V4        V5        V6
     │ Float64?  Float64?   Float64  Float64?  Float64?  Float64?
─────┼────────────────────────────────────────────────────────────
   1 │ -233.6      -203.7     -84.1      18.5     134.7   222.7
   2 │ -233.6      -203.0     -97.8      25.8     134.7   222.7
   3 │ -233.6      -249.0     -92.1      27.8     177.1   222.7
   4 │ -233.6      -231.5     -97.5      27.0     150.3   222.7
   5 │ -233.6      -227.3    -130.1      25.8     160.0   222.7
   6 │ -233.6      -223.1     -70.7      62.1     197.5   222.7
   7 │ -233.6      -164.8     -12.2      76.8     202.8   222.7
   8 │ -233.6      -221.6     -81.9      27.5     144.5   222.7
  ⋮  │    ⋮          ⋮         ⋮        ⋮         ⋮         ⋮
 463 │ -242.6      -142.0     -21.8      69.8     148.7   224.125
 464 │ -235.9      -128.8     -33.1      68.8     177.1   230.25
 465 │ -239.8      -140.8     -38.7      58.1     186.3   236.375
 466 │ -243.7      -149.5     -40.3      62.8     139.7   242.5
 467 │ -247.6      -157.8     -53.3      28.3     122.9   227.6
 468 │ -247.6      -154.9     -50.8      28.1     119.9   201.1
 469 │ -247.6      -180.7     -70.9      33.7     114.8   222.5
                                                  454 rows omitted

Warning:

Your approach should depend on the properties of you data (e.g., MCAR, MAR, MNAR).
In-place calls aren't guaranteed to mutate the original data, but it will try avoid copying if possible. In the future, it may be possible to detect whether in-place operations are permitted on an array or table using traits:
- JuliaData/Tables.jl#116
- JuliaArrays/ArrayInterface.jl#22

Name		Name	Last commit message	Last commit date
Latest commit History 264 Commits
.github/workflows		.github/workflows
docs		docs
src		src
test		test
.codecov.yml		.codecov.yml
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

docs

docs

src

src

test

test

.codecov.yml

.codecov.yml

.gitignore

.gitignore

LICENSE.md

LICENSE.md

Project.toml

Project.toml

README.md

README.md

Repository files navigation

Impute

Installation

Quickstart

About

Releases 20

Packages

Contributors 16

Languages

License

invenia/Impute.jl

Folders and files

Latest commit

History

Repository files navigation

Impute

Installation

Quickstart

About

Resources

License

Stars

Watchers

Forks

Languages