GitHub - gauravcodepro/arabidopsis-maf-cap-accessions: arabidopsis-maf-cap-accessions. genome extraction, alignments, visualization, phylogenomics, ancestral tree

arabidopsis-maf-cap-accessions: brief writeup

arabidopsis alignments for the maf and the cap gene clusters. multiple alignment and visualization of the aligned genes and clusters. The genes that are used are At5g65050-At5g65080, containing 4 genes (Maf2-5 cluster of genes) and At2g13540 CBP80 Cap-binding protein 80. The corresponding genome assemblies and the download link to the assemblies along with the code to download, align, annotate and draw the alignments. Analysis for the arabidopsis genomes and the accessions cited in the paper: https://www.nature.com/articles/s41586-023-06062-z#data-availability.

Analysis outlay: since the genomes reported in the paper have not been annotated, miniprot was used to align the protein sequences for the corresponding genes and then align them to the genome of all these accesssion and extract those regions and then make a alignment of the same. The code along with the runs files are present within the corresponding project execution files.

Methods

the accession used for the analysis are listed here: accession

downloadrecords.sh: Run this to download the sequence records from the ebi or the ena. Either you can run this or you can run the code below to generate the direct apis for the download

    *code for generating the direct apis for the arabidopsis ena*
for i in $(cat arabidopsisaccessionlinks.md | grep GCA | cut -f 2 -d "|");
do
         echo "curl https://www.ebi.ac.uk/ena/browser/api/fasta/$i.1\?download\=true\&gzip\=true -o $i.gz";
done

arabidopsis genome directapis directapis.

          *Normalize your header by running this before running the analysis*
          cat fastafile | cut -f 1 -d " " | cut -f 1 -d "." > output.fasta
          # the output fasta will be used for all the analysis.

alignmentrecords.sh: Run this to make the corresponding alignments, this follows the lift off approach by transferring the annotations.
phylogeny.R: Run this to make the phylogeny.
visualfreq.R: Run this to make the alignment visualization.
mapalignment-phylogeny.R: Run this to make the alignment, visualization.
generatemRNAs.py: Run this to extract the corresponding mRNAs.
genome-annotation-visualizer.R: Run this to make the visualization of the genomic features. You can find the code here also evoseq and here also genome-annotation

How to read this github repository

allassembly.md all accession that were studied
arabidopsisaccessionlinks.md links to the accession and the corresponding ENA archives
arabidopsis_paper.pdf arabidopsis paper
directapis.txt directapis for the ena
cap_alignments folder containing cap alignments with a readme as how to generate them
cap_final_joined_fasta folder containing the final fasta, alignments, ancestral tree, phylogenetic tree, acestral sequence, alignment visualization
cap_genes folder containing the cap genes
maf_alignments folder containing maf alignments with a readme as how to generate them
maf_final_joined_fasta folder containing the final fasta alignments, ancestral tree, phylogenetic tree, acestral sequence, alignment visualization
maf_genes folder containing the cap genes
python_scripts python scripts for analysis
r_scripts r scripts for analysis
shell_scripts shell scripts for analysis
README.md README for the complete analysis

Folder read for the analysis

cap_final_joined_fasta: File listing cap_final_joined_fasta

    alignments can be run with the following: 
       for i in *.fasta; do echo prank -d=${i} -o=${i%.*}.aligned.fasta -showanc -showtree; done

├── all.cap.gff.clipped.gff: All aligned mRNA positions. 
├── capgenes.aligned.fasta.best.anc.dnd: best phylogenetic tree 
├── capgenes.aligned.fasta.best.anc.fas: best ancestral sequence 
├── capgenes.aligned.fasta.best.dnd: best phylogenetic ancestral tree 
├── capgenes.aligned.fasta.best.fas: alignment 
└── capview.html: visualization of alignment

maf_final_joined_fasta: File listing maf_final_joined_fasta

├── AT5G65050.all.out.fasta : mRNA regions for the AT5G65050
├── AT5G65050.gff.clipped.gff : aligned position information for the AT5G65050
├── AT5G65060.all.out.fasta : mRNA regions for the AT5G65060
├── AT5G65060.gff.clipped.gff : aligned position information for the AT5G65060
├── AT5G65070.all.out.fasta : mRNA regions for the AT5G65070
├── AT5G65070.gff.clipped.gff : aligned position information for the AT5G65070
├── AT5G65080.all.out.fasta : mRNA regions for the AT5G65080
├── AT5G65080.gff.clipped.gff : aligned position information for the AT5G65070
├── final.all.linear.tar.bz : all arabidopsis accessions
├── maf_aligned_ancestral_tree : ancestral tree for each of the indiviual.
├── maf_aligned_best : aligned regions for each of the indiviuals along with the visualization
├── maf_ancestral_sequence : ancestral sequences for each of them.

Uncompress the tar archive by using the tar -xJf TAIR10_GFF3_genes.tar.xz for the genome annotations. if you have any questions i can be contacted at gaurav.sablok@uni-potsdam.de or sablokg@gmail.com

Gaurav
Academic Staff Member
Bioinformatics
Institute for Biochemistry and Biology
University of Potsdam
Potsdam,Germany

Name		Name	Last commit message	Last commit date
Latest commit History 209 Commits
cap_alignments		cap_alignments
cap_final_joined_fasta		cap_final_joined_fasta
cap_genes		cap_genes
maf_alignments		maf_alignments
maf_final_joined_fasta		maf_final_joined_fasta
maf_genes		maf_genes
python_scripts		python_scripts
r_scripts		r_scripts
shell_scripts		shell_scripts
LICENSE		LICENSE
README.md		README.md
allassembly.md		allassembly.md
arabidopsis_paper.pdf		arabidopsis_paper.pdf
arabidopsisaccessionlinks.md		arabidopsisaccessionlinks.md
directapis.txt		directapis.txt
nextflow.nf		nextflow.nf

License

gauravcodepro/arabidopsis-maf-cap-accessions

Folders and files

Latest commit

History

Repository files navigation

arabidopsis-maf-cap-accessions: brief writeup

About

Topics

Resources

License

Stars

Watchers

Forks

Languages