Skip to content

FJBiocode/ShortSequenceRastreator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

ShortSequenceRastreator

Developed by Francisco Javier Bielsa (Graduated in Biotechnology and Master in Molecular and Cell Biology. University of Zaragoza). Greetings! I hope my work helps yours and your SSR studies. This code was made during the development of my Master's Thesis and was tested with Genomic Data from Pyrus Communis L. held at CITA of Aragón Germplasm bank.

ShortSequenceRastreator allows Genetic diversisity analysis and input file preparation for STRUCTURE 2.3.4 (Stanford), DARwin 6.0 (CIRAD) and SPAGeDi (Iowa State Uiversity).

This program works with CSV-UTF8 encoded dataframes. Header is expected in the first row. Rows contain sample data, each row represents an accession. The first columns of your dataframe should contain important information such as name, origin or ploidy. These should be followed by loci columns containing genomic data. I.E:

Accession Name Origin Ploidy Locus.1 Locus.2
100 M.1 SPA 2 123/123 241/243
101 M.2 SPA 3 123/125/127 241/243/247
102 M.3 FRA 2 123/125 241/247
103 M.4 SPA 2 121/123 241/243

This program supports diploid and triploid SSR analysis. Insert path of the file you want to work with - f.e. 'C:/Users/Admin/Documents/SSR/Dataframe1.csv'

Before executing the code set of columns containing information should be checked and changed: set(['Ord', 'Nclon', 'AccessionName', 'Origin', 'Ploidy']).

FEATURES:

ACCESSIONS SUMMARY: Number of accessions, Number of loci, Percentage of diploid and triploids.

CREATION OF HITS MATRIX / DARwin MATRIX : Here we are creating a 'Hits' dataframe (Accession vs number of times an allele appears in a given locus). This matrix can be used in DARwin software and can be saved to the path we are working on. Be careful with DARwin parameters, because the matrix may contain values different from 1 or 0 and may give errors in DARwin. This can happen because homozygotes/triploids give 2 hits in some alleles. In order to make the matrix work in DARwin you should replace all values=2 to values=1.

ALLELE SUMMARY: Relative frequencies, rare alleles and unique alleles.

DISCRIMINANT POWER

OBSERVED AND EXPECTED HETEROZYGOSITY

F-STAT AND NULE ALELES

DUPLICATE IDENTIFICATION : Get a dataframe with your duplicates! WARNING: In 'col[6:(len(new_df.columns))]' number 6 need to be checked and replaced with the last position of information columns that you have in your dataframe.

TRIALELIC LOCI SEARCH

CRETION OF STRUCTURE INPUT FILE: Enjoy saving 2 hours of excel work! WARNING: Insert file path of your unique genotypes matrix. Change Information columns in '{'Ord', 'Nclon', 'AccessionName', 'Origin', 'Ploidy'}' and St = Stru.drop(['Ord', 'Nclon', 'Origin', 'Ploidy'], axis = 1).

CREATION OF SPAGeDI INPUT FILE: Feel free to dance to the rythm of https://www.youtube.com/watch?v=Z3w5gVM_4y8 while python does your work.

About

Short Sequence Rastreator allows genetic diversity analysis and preparation of input files for other softwares such as STRUCTURE, DARwin and SPAGeDi.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages