Skip to content

UIC-InDeXLab/RepresentationBias-ContinuousCoverage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Representation Bias Identification

Identifying Insufficient Data Coverage for Ordinal Continuous-Valued Attributes

Abstract

Appropriate training data is a requirement for building good machine-learned models. In this project, we study the notion of coverage for ordinal and continuous-valued attributes, by formalizing the intuition that the learned model can accurately predict only at data points for which there are "enough" similar data points in the training data set. We develop an efficient algorithm to identify uncovered regions in low-dimensional attribute feature space, by making a connection to Voronoi diagrams. We also develop a randomized approximation algorithm for use in high-dimensional attribute space.

Publications to cite:

[1] Abolfazl Asudeh, Nima Shahbazi, Zhongjun Jin, H. V. Jagadish. Identifying Insufficient Data Coverage for Ordinal Continuous-Valued Attributes. SIGMOD, 2021, ACM.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

What things you need to install the software and how to install them

Installing (Console)

In console

mvn clean install

Installing (Eclipse)

In Eclipse or other IDE, all packages should be automatically installed once imported. ished

Running the tests

Explain how to run the automated tests for this system

Command line arguments

Command line arguments to use when running test scripts

Option Descriptions Has arguments Allow multiple values
-a selected attribute values Yes Yes
-e epsilon values Yes Yes
-h show help No
-i input dataset data file name Yes No
-k k values Yes Yes
-n number of query points Yes Yes
-o if store test result in a file No
-p number of repeats Yes No
-phi phi values Yes Yes
-r rho values Yes Yes
-s input dataset schema file name Yes No

Run tests from console

Accuracy Test

Format

mvn -e exec:java@accuracy -Dexec.args="{command-line-arguments}"

Example

mvn -e exec:java@accuracy -Dexec.args="-i data/iris.data -s data/iris.schema -a sepalLength sepalWidth petalLength -k 3 -r 0.05 0.1 0.15 -n 2000 -p 100 -e 0.1 0.2 -phi 0.1 0.2"

Efficiency Test

Format

mvn -e exec:java@accuracy -Dexec.args="{command-line-arguments}"

Example

mvn -e exec:java@efficiency -Dexec.args="-i data/iris.data -s data/iris.schema -a sepalLength sepalWidth -k 2 -r 0.05 0.1 0.15 -n 1000 2000 -p 100"

From Eclipse

In Eclipse or other IDE, run src/test/java/umichdb/coverage2/TestCoverageChecker.java

Built With

  • Smile - Java-based Machine Learning Pacakge
  • Maven - Dependency Management

Authors

License

This project is licensed under the MIT License - see the LICENSE.md file for details