Heterogeneous Distance Functions

The presence of heterogeneous data, comprising both continuous and categorical features, is often a challenging problem that researchers need to handle in real-world domains. Essentially, there are three main solutions for heterogeneous data:

Ignore the nature of data: This is ultimately the worst approach, as standard distance functions (e.g., Euclidean) are not universally suited to all types of data. For instance, in the case of nominal features, arithmetical operations over unordered categories are meaningless.
Transform features so that they are represented on the same scale:
- Continuous features may be discretized to categorical. However, determining an adequate number of categories for the discretisation of continuous features is not trivial. Besides, if categories are considered nominal, the order information is lost.
- Categorical features may be transformed to binary, using a 1/0 encoding (one-hot encoding) for each existing category (which allows arithmetic operations over values). However, one-hot encoding may significantly increase data dimensionality which adds time and memory complexity to most standard classifiers.
Consider heterogeneous distance functions: These distance functions directly handle different types of features (i.e., they do not require data transformation), thus avoiding the problems described above.

However, real-world domains are also often affected by missing data, i.e., absent information in datasets for which the respective values are unknown. Accordingly, there is yet another factor that needs to be accounted for in the development of heterogeneous distance functions:the incorporation of missing data in the distance computation. This allows that all existing information is considered for imputation, without discarding any data patterns or values. Also, it allows that the presence of missing data itself is considered in the distance computation, i.e., the uncertainty of the missing data can be accounted for: patterns comprising missing values in the same feature can either be thought to be closer (more similar) or farther from each other (less similar), or evaluated according to intermediate strategies.

This repository comprises a set of heterogeneous distance functions (implemented in MATLAB) able to address complex scenarios comprising heterogeneous data – continuous and categorical (nominal and binary) features – and missing data, where the absent values themselves are incorporated in distance computation:

HEOM: Heterogeneous Euclidean-Overlap Metric
HVDM: Heterogeneous Value Difference Metric
HEOM-R, HVDM-R, and HVDM-S: Redefinitions of HEOM and HVDM
SIMDIST: Similarity for Heterogeneous Data
MDE: Mean Euclidean Distance

Machine Learning Applications

The distance functions comprised in this repository may be investigated across a plethora of machine learning fields operating with similarity/distance between patterns.

Below is an overview of machine learning areas relying on distance metric learning and how heterogeneous distance functions may be applied in each context:

Machine Learning Area	Sub-area	Methodology
Data Classification	Neural Networks	Distance functions are embedded in the operation of algorithms (e.g., radial basis functions networks, self-organising maps).
Data Classification	Instance-Based Learning	Some are referred as nearest-neighbour techniques, memory-based reasoning methods, or case-based reasoning methods. These systems use a distance function to determine the similarity between a new pattern and the training data and use the nearest instance(s) to predict the target class.
Data Clustering		Clusters are found by identifying similar patterns. A suitable cluster solution comprises groups where its members have small distances among each other.
Data Preprocessing	Data Resampling	Resampling approaches – undersampling and oversampling – use distance functions to analyse the neighbourhood of training examples and determine which patterns to clean or replicate.
	Instance Selection	Prototype Selection and Instance Selection methods use an instance-based classifier (commonly kNN) with a distance function, to find obtain a representative subset of the original training data.
	Dimensionality Reduction	Distance functions are used as input for well-known dimensionality reduction algorithms, such as Multidimensional Scaling (MDS) or t-distributed Stochastic Neighbour Embedding (t-SNE).
	Data Imputation	Distance functions are used in KNN imputation as well as other imputation algorithms that operate with distances among patterns (e.g., NN, SOM, cluster-based imputation). They can also be as intermediate steps to improve other imputation approaches (e.g., via instance selection). Absent values of a given pattern are estimated using the available information of its closest neighbours.
Data Analysis and Meta-Learning	Data Complexity	Distance functions are in the base of several well-established complexity measures and instance hardness estimators (e.g., N1, N2, N3, T1, LSC, CM, R-value, kDN, among others).
Data Analysis and Meta-Learning	Dimensionality Typology	Depending on their local neighbourhoods, examples may be categorised into safe, borderline, rare or outlier examples. Using distinct distance functions may result in the different categorisation of examples (e.g., safe examples become borderline).

Example of use:

The considered distance functions are implemented in the distances folder. The arff-to-mat folder contains a set of files to read .arff files, whereas the data folder contains some .arff files for testing.

The arguments required by distance functions are overall the following:

X: matrix of data (patterns x features);
T: column vector of classes;
feature_types: 1/0 row vector indicating whether a feature in column i is categorical (1) or continuous (0);

Consider the thoracic.arff dataset provided in data folder, containing continuous features (PRE4, PRE5, AGE), and categorical features (the remaining), both binary and nominal. The output distance matrix D returns the distances between every pair of patterns according to the HEOM distance.

addpath('arff-to-mat');
addpath('distances');
addpath('data');

data = arff2double('thoracic.arff');
X = data.X;
T = data.Y;
feature_types = data.isNomBin;

D = heomDist(X, feature_types, 'original');

In such case, the argument T is not necessary, but it may be for other distance functions, such as HVDM.

D = hvdmDist(X, T, feature_types, 'original');

Original implmentations of HEOM and HVDM are defined by the original parameter. Their redefinitions (HEOM-R and HVMD-R) are defined by the redef parameter. In the case of HVDM-S, a different function is called.

D = heomDist(X, feature_types, 'redef'); % HEOM-R
D = hvdmDist(X, T, feature_types, 'redef'); % HVDM-R
D = hvdmSpecDist(X, T, feature_types);

Note that although thoracic.arff is originally complete (i.e., it does not contain missing data), the heterogeneous functions can also handle missing values internally. In the following example, 20% of the dataset X is set to be missing completely at random (i.e., following a MCAR missing mechanism). The distance computation will be performed without the need to change any parameters in the previous code:

% Insert MVs in X
p = 20; % percentage of missing data
X = MCAR(X, p);

D = heomDist(X, feature_types, 'original');

The main.m file contains the basic functionality to use the different heterogeneous functions. To submit bugs and feature requests, please report at the project issues.

Citation Request:

If you plan to use this library, please consider referring to the following papers:

@article{Santos2020a,
  title={How distance metrics influence missing data imputation with k-nearest neighbours},
  author={Santos, Miriam Seoane and Abreu, Pedro Henriques and Wilk, Szymon and Santos, Jo{\~a}o},
  journal={Pattern Recognition Letters},
  volume={136},
  pages={111--119},
  year={2020},
  publisher={Elsevier}
}

@inproceedings{Santos2020b,
  title={Assessing the Impact of Distance Functions on K-Nearest Neighbours Imputation of Biomedical Datasets},
  author={Santos, Miriam S and Abreu, Pedro H and Wilk, Szymon and Santos, Jo{\~a}o},
  booktitle={International Conference on Artificial Intelligence in Medicine},
  pages={486--496},
  year={2020},
  organization={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
arff-to-mat		arff-to-mat
data		data
distances		distances
img		img
md-generation		md-generation
README.md		README.md
main.m		main.m

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arff-to-mat

arff-to-mat

data

data

distances

distances

img

img

md-generation

md-generation

README.md

README.md

main.m

main.m

Repository files navigation

Heterogeneous Distance Functions

Machine Learning Applications

Example of use:

Citation Request:

References

About

Languages

miriamspsantos/heterogeneous-distance-functions

Folders and files

Latest commit

History

Repository files navigation

Heterogeneous Distance Functions

Machine Learning Applications

Example of use:

Citation Request:

References

About

Topics

Resources

Stars

Watchers

Forks

Languages