There are several methods used in literature and textbooks for treating missingness in data. A summary of these methods is shown in figure below[1]. In addition, there are certain drawbacks associated with each of these methods when used for data mining and one needs to be careful to avoid bias or the under- or over-estimation of variability.
Methods Used
- Imputation Using Constant Value
- Imputation Using k-NN
- BiScaler imputation
Zero or Constant imputation; as the name suggests , it replaces the missing values with either zero or any constant value specified. In figure below a part of data is shown before and after applying imputation using constant value. Constant value used was 1.
• It Works well in practice with categorical features.
• It is Easy and Fast.
• It does not consider the correlations between different observed features.
• It gives poor results on encoded categorical features.
• It does not consider the uncertainty in imputations.
• It can introduce bias in the data.
The k nearest neighbors' algorithm is used for simple classification. It uses feature similarity to predict the values of any new data point in the data. This means that the new data point is assigned a predicted value based on how closely it is similar to the neighbor points in the training set. This can be very useful in making predictions about the missing values by finding the k nearest neighbors to the observation with missing data and then imputing them based on the non-missing values in the neighborhood. It works on the principle of finding minimum distance from the query instance to the training samples to determine K- nearest neighbors. After finding K such neighbors, simple majority of these neighbors is taken to predict the value of query instance. A graph for KNN with 30 neighbors is shown in figiure below[2].
•It can be much more accurate than the simple mean, median or most frequent imputation methods (It depends on the dataset).
• It is computationally expensive because it works by storing the whole training dataset in memory.
• It is quite sensitive to outliers in the data.
It is an iterative estimation of row or column means and standard deviations to get doubly normalized matrix [3]. It is not guaranteed to converge but works well in practice. It brings the two approaches nuclear-norm-regularized matrix approximation [4] and maximum-margin matrix factorization [5] together, leading to an efficient algorithm for large matrix factorization and completion that outperforms both of these.
[1] https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones
[3] https://arxiv.org/pdf/1410.2596.pdf
[4] Rahul Mazumder, Trevor Hastie, and Rob Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11:2287–2322, 2010.
[5] Nathan Srebro, Jason Rennie, and Tommi Jaakkola. Maximum margin matrix factorization. Advances in Neural Information Processing Systems, 17, 2005.