Sequence-Clustering

Cluster sequences into inliers/outliers and generate a novel prototypical sequence for each cluster.

Description

Consider the following scenario: a process generates a set of sequences, each sequences is encoded as a sequence of characters. There is an undisclosed number of distinct processes so it should be possible to group the sequences into clusters of similar sequences. However, in addition some sequences have been generated by another unrelated process to form outliers. Each instance is either an inlier or an outlier.

Tasks

Cluster the inliers into an appropriate number of groups.
Generate a novel prototypical sequence for each cluster, i.e. a sequence that is the most representative for that cluster. Note that the prototypical sequence must be novel, i.e. not be one of the provided sequences.

Data

A text file, test.txt, is provided which contains a random mixture of inlier and outlier sequences in no particular order. Each row of this contains an integer identifier for the sequence and the sequence itself.

Outputs

Print for each sequence's identifier together with the cluster ID they belong to or that they are an outlier.
Print one novel prototypical sequence for each cluster you have found.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
README.md		README.md
sequence_clustering.ipynb		sequence_clustering.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

README.md