Skip to content

DeepthiSudharsan/CGR-for-Sequence-Similarity

Repository files navigation

CHAOS GAME REPRESENTATION FOR SEQUENCE SIMILARITY ANALYSIS

There are two main parts to the project - Analysis of sequences using Frequency CGR (FCGR) and Coordinate CGR.

Methods implemented

image

Note: While plotting, the y-axis is inverted to follow the usual convention.

Frequency CGR

In the frequency CGR method, we divide a grid into a 2D array of size (√(4k ), √(4k)) image

For each nucleotide in a kmer, the image is subdivided into 4 quadrants:
  • A in the top left
  • G in the top right
  • C in the bottom left
  • T in the bottom right

Each quadrant is split according to the same principle for the next nucleotide in the kmer, recursively.

image

CGR Probabiltiy Distance

Calculating Euclidean distance between 2 chaos probability matrices

image

Coordinate CGR

In this method we use the coordinates calculated using the following steps to analyse the sequences

  • Start from the center of the grid

  • 1st coordinate - plotted halfway between the center of the square and the vertex representing this nucleotide (A)

  • Successive coordinates - plotted halfway between the previous point and the vertex representing the current nucleotide

image

CGR Coordinate Distance

Calculating Euclidean distance between 2 chaos vectors obtained

image

Annotated code of our project has been provided. We also used streamlit open-source app framework for creating a custom web-app for our project. A folder with the source code for the app, snips of the expected output and directions for running it have also been provided.

Data

The data has been gathered from NCBI (https://www.ncbi.nlm.nih.gov/) and GISAID (https://www.gisaid.org/). We tried for two categories of data - hCov-19 and BetaCov-19 sequences (DNA_SEQUENCES folder) and also for human and various animal genome sequences (ANIMAL_GENOME folder).

Observations

  • Frequency Chaos Game Representation

image

  • Coordinate Chaos Game Representation

image

References

  1. https://towardsdatascience.com/chaos-game-representation-of-a-genetic-sequence-4681f1a67e14
  2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7497811/
  3. https://www.hindawi.com/journals/aaa/2013/926519/

Additional Note

In the code, the CGR of the sequences being analyzed are exported as png files in the same folder as the data. So in the same folders as the data, we have uploaded the images for a sample execution as well.