Create README.md

RPGroup-PBoC · May 22, 2018 · ca3b0b8 · ca3b0b8
1 parent 6bfa9b9
commit ca3b0b8
Showing 1 changed file with 142 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,142 @@
+# A systematic approach for dissecting the molecular mechanisms of transcriptional regulation in bacteria.
+doi: [10.1101/239335](https://doi.org/10.1101/239335), [10.1073/pnas.1722055115](https://doi.org/10.1073/pnas.1722055115)
+
+The _jupyter_notebook/_ folder contains a variety of Jupyter Notebooks related
+to the work in the appendices. In addition, the following two Jupyter notebooks
+can be used to visualize the Sort-Seq data.  
+- Sort-Seq_promoter_data_visualization.ipynb can be
+used to view plots of expression shifts, information footprints, and mutation
+rate for each promoter.  
+- Sort-Seq_energy_matrix_visualization.ipynb
+can be used to visualize the inferred energy matrices.
+
+All processed data is available in tidy formatted .csv files that are most
+conveniently viewed using pandas in Python, but can also be loaded with a
+spreadsheet editor such as Microsoft Excel.
+
+All code used for analysis and plotting was written in Python. The following
+dependencies may be required to run the files. Note that energy matrix
+processing and some of the figure plotting _.py_ code
+requires PyMC 2.3.X and requires a Python 2.X installation.  
+- matplotlib  
+- numpy  
+- pandas  
+- scipy  
+- seaborn  
+- ipython  
+- biopython  
+- pymc
+- corner
+
+---
+
+# Sort-Seq experiments
+
+## Processed results
+
+Processed Sort-Seq data can be found in _code/sortseq/_. Each folder contains
+processed Sort-Seq data (from a single sequencing run, and may contain multiple
+experiments for different promoters and experimental conditions). Expression
+shift, information footprints, mutation rates are found in files that end in
+summary.csv. Energy matrices are found in seperate .csv files.
+
+## Processing new data
+
+### Pipeline to calculate expression shifts, information footprints, etc:
+
+For each Sort-Seq experiment, a configuration file is made to list the
+experimental details associated with it. These .cfg files are placed in the
+_code/sortseq/(sortseq_experiment_name)/cfg\_files/_ folder.
+
+To process new Sort-Seq data (multiple quality filtered ...bin*.fasta or
+...bin*.fastq files; must end with bin number as shown):  
+- create new folder for Sort-Seq data in the _code/sortseq/_ directory.  
+- create .cfg file in a new folder called _cfg\_files/_. Add in appropriate details and location of sequencing files (see others for example)  
+- In new folder, run python script in command prompt (i.e. Terminal in Mac):  
+```
+python ../../processing/processing_seq.py cfg_files/(config_filename).cfg
+```
+- Once that has completed (several hours with current scripts), run the following to generate plots:  
+```
+python ../../processing/analysis.py cfg_file/(config_filename).cfg
+```
+
+### Processing .sql files from MPAthic:
+
+The energy matrices were inferred by MCMC using the MPAthic software
+(doi: https://doi.org/10.1101/054676), which provided .sql
+files (20 MCMC runs per inference). Note that it expects a certain file naming
+format and may need modification for new files. In _code/sortseq/(sortseq_experiment_name)/cfg\_files/_, the associated config file
+is edited to include several lines related to matrix identify, position, and
+length; also included is location of .sql files. For example:
+
+```
+emat_dir_sql = ../../../data/sortseq_MPAthic_MCMC/  
+# position information for each energy weight matrix model  
+[CRP]  
+TF = crp  
+TF_type = 1  
+mut_region_start = 26  
+mut_region_length = 26
+```
+To process the .sql files from each MCMC, in _code/sortseq/(Sort-Seq folder
+name)/_ run the following command in the command prompt:
+
+    python ../processing_emat.py cfg_files/(name of cfg file).cfg CRP
+
+---
+
+# DNA affinity Chromatography and Mass Spectrometry experiments
+
+## Processed results
+
+Processed data can be found in _code/mass_spec/_. Each folder contains a summary
+.csv file with protein enrichment and details related to the experiment such
+as the DNA traget sequence.
+
+## data analysis:
+
+Protein enrichment values are obtained from the 'ProteinGroups.txt' file that is
+generated by the software MaxQuant (http://www.maxquant.org), used to analyzed
+Thermo '.raw' LC/MS/MS data files. Within each experimental folder (contained in
+_code/mass_spec/_), there is a .py Python file used to extract the relevant data
+from the 'ProteinGroups.txt' that is found in the MaxQuant analysis txt folder.
+This compiles a summarized '.csv' file that also will contain addition
+experimental details.
+
+---
+
+# Miscellaneous
+
+_code/processing/_  
+- Scripts used to process Sort-Seq .fastq or .fasta files, and .sql files associated with energy matrix MCMC
+
+_code/analysis/_  
+- Jupyter Notebooks and other analysis used in work.
+
+_code/flow/_  
+- Histogram data from flow cytometry experiments to measure expression of promoter plasmids. Data is used in several figures.
+
+_code/figures/_  
+- Contains all Python scripts used to generate figures for main text and SI material.
+- Note that in many cases, formatting or arrangement of figures were modified in Adobe Illustrator.
+- Note also that several of the SI figures require the full sequence data files. These are
+available upon request.
+
+_misc/plasmid_sequences/_  
+- Promoter plasmid sequences in .gb GenBank format
+
+_misc/primers_oligo_sequences/_  
+- Primer and oligo sequences used in work
+
+_data/mass_spec_  
+- Contains the ProteinGroup.txt output files from MaxQuant analysis of raw Thermo LC/MS/MS data files.
+
+_data/sortseq_raw_  
+- Contains raw sequencing files from the Sort-Seq experiments (one sorted bin per file)
+
+_data/sortseq_pymc_dump_  
+- Contains _mar_ promoter file of sequences across all sorted bins, used for library analysis in Supplemental FigS1E,F.
+
+_data/sortseq_MPAthic_MCMC_  
+- Contains the _.sql_ files obtained from running MCMC for energy matrix inference with the MPAthic software.