Skip to content

ushashwat/Sequence-Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Sequence-Clustering

Cluster sequences into inliers/outliers and generate a novel prototypical sequence for each cluster.

Description

Consider the following scenario: a process generates a set of sequences, each sequences is encoded as a sequence of characters. There is an undisclosed number of distinct processes so it should be possible to group the sequences into clusters of similar sequences. However, in addition some sequences have been generated by another unrelated process to form outliers. Each instance is either an inlier or an outlier.

Tasks

  1. Cluster the inliers into an appropriate number of groups.
  2. Generate a novel prototypical sequence for each cluster, i.e. a sequence that is the most representative for that cluster. Note that the prototypical sequence must be novel, i.e. not be one of the provided sequences.

Data

A text file, test.txt, is provided which contains a random mixture of inlier and outlier sequences in no particular order. Each row of this contains an integer identifier for the sequence and the sequence itself.

Outputs

  1. Print for each sequence's identifier together with the cluster ID they belong to or that they are an outlier.
  2. Print one novel prototypical sequence for each cluster you have found.

About

Cluster sequences into inliers and outliers and generate a novel prototypical sequence for each cluster.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published