rubyCorrSieve

Michael G. Campana
Smithsonian Conservation Biology Institute

Ruby implementation of CorrSieve

Licensing

Original Ruby source code (CorrSieve versions <= 1.6-5) copyright (c) Michael G. Campana, 2010-2011 is licensed under the GNU General Public License (version 3 or later). See included LICENSE file for details.

Public domain updates by Michael G. Campana (2019) to the original Ruby source code (CorrSieve versions >= 1.7-0) are United States government works. These modifications are annotated in the modified source code.

Introduction

CorrSieve is a Ruby and R package that filters Q value output from the programs STRUCTURE (Pritchard et al. 2000) and INSTRUCT (Gao et al. 2007) by correlation values. It outputs matrices showing significant correlations between individual runs for each K. It can also calculate ΔK (Evanno et al. 2005), mean F_STs and ΔF_ST. These measures help identify meaningful values of K.

Installation and Compatibility

rubyCorrSieve is compatible with Windows, Linux, and UNIX (including macOS) operating systems. rubyCorrSieve requires the Ruby interpreter. Installation files are available at www.ruby-lang.org/en/downloads. Install the appropriate interpreter for your operating system.

Clone this repository to your system. Using a Linux/UNIX command line this can be performed using git:
git clone https://github.com/campanam/rubyCorrSieve

You may need to make the CorrSieve-1.7-0.rb file executable. On Linux/UNIX, enter:
chmod +x rubyCorrSieve/*.rb

Move the CorrSieve-1.7-0.rb executable and LICENSE file to your chosen execution directory.

Usage

Prepare input for CorrSieve. CorrSieve reads directly from STRUCTURE and INSTRUCT output files, but requires that all files be in a single folder. Do not place other files in this folder. All files should end in ‘_f’ without an extension, e.g:
TEST_11_f
TEST_12_f
TEST_13_f
TEST_21_f
TEST_22_f
TEST_23_f
Launch a terminal window (Linux/Unix) or command prompt (Windows). On Windows, ensure that you launch the command prompt with the Ruby interpreter that came with the installed Ruby package.
Execute the CorrSieve script. If CorrSieve is in your $PATH (Linux/Unix), you can omit the ruby command:
ruby CorrSieve-1.7-0.rb
The splash screen will load. Enter 'C' to continue with program execution, 'L' to see the licensing information, and 'X' to exit the program and then press ENTER. Capitalization of command choices does not matter for any CorrSieve prompts.
Once you continue execution, the program will prompt you for the path to the folder containing the STRUCTURE or INSTRUCT raw data. If the folder is in the same folder that the script is currently located in (i.e. both are on the desktop), simply type the name of the folder. Otherwise, type the full file path (e.g. C:\Users\<username>\Desktop on Windows or /Users/<username>/Desktop/ on macOS).
The program will ask you to enter the name of the run. This is the name of the output files generated by CorrSieve. Type the name and press ENTER.
The program will ask you to enter the the path to the folder in which to save the output files. Type the folder path and press ENTER. Pressing ENTER without previously typing a folder name will place the files in the current directory. If the folder does not exist, CorrSieve will create the directory.
The program will prompt you if you wish to calculate the Q matrix correlations. Type ‘yes’/‘y’ or ‘no’/‘n’ and press ENTER to continue.
If the Q matrix correlations are calculated, the program will ask for the minimum Pearson correlation value (r value) to be considered significant. Enter the appropriate value and press ENTER to continue.
If the Q matrix correlations are calculated, the program will then ask if you also wish to filter the data by the significance level. Type ‘yes’/‘y’ or ‘no’/‘n’ and press ENTER to continue.

NOTE: The average maximum correlation algorithm ignores non-significant values as potential maximum correlations. The columns-and-rows method filters first by correlation and then again by significance.

The program will ask if you wish to estimate the p value or calculate an exact p. Selecting yes will estimate the p and prompt asking for the number of permutations to estimate p. Selecting no will calculate the exact p. Type ‘yes’/‘y’ or ‘no’/‘n’ and press ENTER to continue.

WARNING: For large data sets, calculating the exact p will be EXTREMELY slow. This should only be used if necessary.

If the p-value filter was selected, it will ask for the maximum p value to be considered significant. Enter the appropriate value and press ENTER to continue.
The program will prompt you to decide between the average maximum correlation filter method outlined in Cockram et al. (2008) or the columns-and-rows method described in Campana et al. (2011). Type ‘yes’/‘y’ or ‘no’/‘n’ and press ENTER to continue.
The program will then ask if you wish to output the unfiltered correlation matrices. If yes, the program will output the raw correlations (and p-values if selected) in a separate file. Type ‘yes’/‘y’ or ‘no’/‘n’ and press ENTER to continue.
The program will ask if you wish to summarise Ln P(D) and calculate ΔK. If yes, this will output these statistics in a separate file. Type ‘yes’/‘y’ or ‘no’/‘n’ and press ENTER to continue.
The program will ask if you wish to calculate F_ST statistics (and ΔF_ST). If yes, this will output these statistics in a separate file. Type ‘yes’/‘y’ or ‘no’/‘n’ and press ENTER to continue.

NOTE: F_ST statistics are only available from STRUCTURE data generated under the admixture model. Output generated in INSTRUCT (even under the admixture model) or under other STRUCTURE models will cause an error.

If you opted to calculate F_ST statistics, Since F_ST output will not necessarily be in the same order each run, the program will ask you to determine the optimization procedure to best order F_ST values. Selecting 1 will use no optimization procedure. Option 2 will order the raw F_STs by value, while option 3 will order these data using the matrix correlations.
The program will then process and output the data. The files containing the filtered matrices, the ΔK and Ln P(D) values, the F_STs and the unfiltered correlation matrices will be named “-filtered.txt”, “-deltaK.txt”, “-Fst.txt” and “- matrix_correlations.txt” respectively.

NOTE: The filtered matrices, ΔK and F_ST output files are tab-delimited text files. They can therefore be directly opened in spreadsheet programs such as Microsoft Excel.

NOTE: For K = 1, STRUCTURE will always generate a Q value of 1.0. This causes a divide by zero error (the meaning of ‘NaN’ in the raw matrix correlations), resulting in a non-significant correlation.

NOTE: In the F_STs output file, the 'Overall Mean' and 'Overall Standard Deviation' calculate the mean and standard deviation of F_STs ignoring cluster assignation. The 'St. Dev. of Means' calculates the standard deviation between the mean F_STs of the individual clusters. The 'Mean St. Dev.' is the mean of the standard deviation of the F_STs within individual clusters. The 'St. Dev. of St. Devs.' is the standard deviation between the standard deviation of the F_STs within individual clusters.

Bugs and Contributing

Please report all bugs (and any suggestions for improvements) to Michael G. Campana (campanam@si.edu).

CorrSieve Citation

Campana, M.G. et al. 2011. CorrSieve: software for summarizing and evaluating Structure output. Mol. Ecol. Resour. 11:349-352. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1755-0998.2010.02917.x

Acknowledgments

Rita Cannas helpfully checked the method for calculating ΔK. Dent Earl and Michał Żmihorski identified bugs in the software.

References

Campana, M.G. et al. 2011. CorrSieve: software for summarizing and evaluating Structure output. Mol. Ecol. Resour. 11:349-352. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1755-0998.2010.02917.x

Cockram et al. 2008. Association mapping of partitioning loci in barley. BMC Genet. 9: 16–29. https://bmcgenet.biomedcentral.com/articles/10.1186/1471-2156-9-16.

Evanno et al. 2005. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol. Ecol. 14: 2611-2620. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1365-294X.2005.02553.x.

Gao et al. 2007. A Markov Chain Monte Carlo approach for joint inference of population structure and inbreeding rates from multilocus genotype data. Genetics. 176: 1635-1651. https://www.genetics.org/content/176/3/1635.

Pritchard et al. 2000. Inference of population structure using multilocus genotype data. Genetics. 155: 945–49. https://www.genetics.org/content/155/2/945.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
CHANGELOG.md		CHANGELOG.md
CorrSieve-1.7-0.rb		CorrSieve-1.7-0.rb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

CorrSieve-1.7-0.rb

CorrSieve-1.7-0.rb

LICENSE

LICENSE

README.md

README.md

Repository files navigation

rubyCorrSieve

Licensing

Introduction

Installation and Compatibility

Usage

Bugs and Contributing

CorrSieve Citation

Acknowledgments

References

About

Releases 1

Packages

Languages

License

campanam/rubyCorrSieve

Folders and files

Latest commit

History

Repository files navigation

rubyCorrSieve

Licensing

Introduction

Installation and Compatibility

Usage

Bugs and Contributing

CorrSieve Citation

Acknowledgments

References

About

Resources

License

Stars

Watchers

Forks

Languages