Introduction

In this repository you will find an analysis of human chromosomes. This analysis primarily focuses on chromosomal morphology. The point of this data science project is to establish whether or not protein composition affects chromosomal alignment and ending structure. Locus locations and amino acid information were used to this end.

Background

In biochemistry, there are many phenomena in which we know little about. Protein formation is one of those. Proteins are the machines that allow our body to function daily. Our body makes proteins everyday, quite frequently. Information about how to make proteins is encoded in DNA. When DNA is organized it is called chromosomes. The only time that DNA organizes into a chromosome is when the cell in which the DNA is located is preparing to divide. All other times, DNA has no discernable structure as far as humans can tell. This chromosomal organization does however provide insight into various types of diseases.

Currently, the only way to track DNA is by radioactively tagging it. This can affect the organism in which the tag is inserted. This study attempts to classify proteins by their location on a chromosome, something that is known, by using the protein's amino acid composition. This amino acid driven classifier could potentially provide an alternative pathway to protein study and isolation for the delivery of drugs.

The type of information being focused on in this project are below in Figure 1.

Figure 1. For the 20 selected loci on chromosome 19, the type of information being evaluated.

Business Case

This tracking system could lead to enhanced drug development capabilities, medicine that is tailored to an individual's own genetic information, a growth in understanding behind chromosomal based diseases, and a variety of other health applications.

Methodology

I used the data science model for processing data: obtain, scrub, explore, model, and interpret. Ultimately this supervised machine learning model is based on human chromosome 19 and includes only loci that have more than thirty three occurrences, as seen below in Table 1.

Locus	Occurrences in Chromosome 19
KIR3DL2	94
FCAR	91
KIR2DL4	91
NCR1	83
KIR2DS4	63
KIR2DS2	61
CARD8	51
RPS9	49
OSCAR	48
ZNF266	46
ZNF415	46
DMKN	45
KIR3DL3	45
NLRP2	40
VSTM1	38
KIR2DL1	36
CEACAM21	35
MBOAT7	35
TSEN34	34
LILRB1	33

Table 1. Loci included in machine learning model to study Chromosome 19.

From this collected data, two studies were run. One where loci classification was performed, and a second study, called strand, that identified whether or not the evaluated sequence was a protein.

Model

The models evaluated include: a decision tree classification model, a bagged classification model, and a random forest classification model. A pipeline and a grid search were used to find optimal parameters for the random forest model.

Results

For the loci study, a test accuracy of 99% was achieved using the random forest classifier. For the strand study, the best performing model was the bagged classifier, with a test accuracy of 71%.

Figure 1. Confusion Matrix for best parameters for loci study.

The best parameters for the loci study were:

max_depth: 9
min_samples_split: 0.1
n_estimators: 151.

Figure 2. Confusion Matrix for best parameters for strand study.

The best parameters for the strand study were:

max_depth: 20
min_samples_split: 0.1
n_estimators: 451.

Contact

If you have additional questions or would like to discuss any of the contents, feel free to reach out to me either here or through my blog.

The post directly connected to this project can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
data		data
images		images
models		models
notebooks		notebooks
references		references
reports		reports
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

images

images

models

models

notebooks

notebooks

references

references

reports

reports

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

Repository files navigation

Introduction

Background

Business Case

Methodology

Model

Results

Contact

About

Releases

Packages

Contributors 6

Languages

License

eannefawcett/Classifying-Gene-Locus-by-Protein-Composition

Folders and files

Latest commit

History

Repository files navigation

Introduction

Background

Business Case

Methodology

Model

Results

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages