From ca3b0b8092bbe6deaf1b82b2dab67b4bcca679f2 Mon Sep 17 00:00:00 2001 From: nbellive Date: Tue, 22 May 2018 10:35:57 -0700 Subject: [PATCH] Create README.md --- README.md | 142 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 142 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..6105818 --- /dev/null +++ b/README.md @@ -0,0 +1,142 @@ +# A systematic approach for dissecting the molecular mechanisms of transcriptional regulation in bacteria. +doi: [10.1101/239335](https://doi.org/10.1101/239335), [10.1073/pnas.1722055115](https://doi.org/10.1073/pnas.1722055115) + +The _jupyter_notebook/_ folder contains a variety of Jupyter Notebooks related +to the work in the appendices. In addition, the following two Jupyter notebooks +can be used to visualize the Sort-Seq data. +- Sort-Seq_promoter_data_visualization.ipynb can be +used to view plots of expression shifts, information footprints, and mutation +rate for each promoter. +- Sort-Seq_energy_matrix_visualization.ipynb +can be used to visualize the inferred energy matrices. + +All processed data is available in tidy formatted .csv files that are most +conveniently viewed using pandas in Python, but can also be loaded with a +spreadsheet editor such as Microsoft Excel. + +All code used for analysis and plotting was written in Python. The following +dependencies may be required to run the files. Note that energy matrix +processing and some of the figure plotting _.py_ code +requires PyMC 2.3.X and requires a Python 2.X installation. +- matplotlib +- numpy +- pandas +- scipy +- seaborn +- ipython +- biopython +- pymc +- corner + +--- + +# Sort-Seq experiments + +## Processed results + +Processed Sort-Seq data can be found in _code/sortseq/_. Each folder contains +processed Sort-Seq data (from a single sequencing run, and may contain multiple +experiments for different promoters and experimental conditions). Expression +shift, information footprints, mutation rates are found in files that end in +summary.csv. Energy matrices are found in seperate .csv files. + +## Processing new data + +### Pipeline to calculate expression shifts, information footprints, etc: + +For each Sort-Seq experiment, a configuration file is made to list the +experimental details associated with it. These .cfg files are placed in the +_code/sortseq/(sortseq_experiment_name)/cfg\_files/_ folder. + +To process new Sort-Seq data (multiple quality filtered ...bin*.fasta or +...bin*.fastq files; must end with bin number as shown): +- create new folder for Sort-Seq data in the _code/sortseq/_ directory. +- create .cfg file in a new folder called _cfg\_files/_. Add in appropriate details and location of sequencing files (see others for example) +- In new folder, run python script in command prompt (i.e. Terminal in Mac): +``` +python ../../processing/processing_seq.py cfg_files/(config_filename).cfg +``` +- Once that has completed (several hours with current scripts), run the following to generate plots: +``` +python ../../processing/analysis.py cfg_file/(config_filename).cfg +``` + +### Processing .sql files from MPAthic: + +The energy matrices were inferred by MCMC using the MPAthic software +(doi: https://doi.org/10.1101/054676), which provided .sql +files (20 MCMC runs per inference). Note that it expects a certain file naming +format and may need modification for new files. In _code/sortseq/(sortseq_experiment_name)/cfg\_files/_, the associated config file +is edited to include several lines related to matrix identify, position, and +length; also included is location of .sql files. For example: + +``` +emat_dir_sql = ../../../data/sortseq_MPAthic_MCMC/ +# position information for each energy weight matrix model +[CRP]   +TF = crp +TF_type = 1 +mut_region_start = 26   +mut_region_length = 26 +``` +To process the .sql files from each MCMC, in _code/sortseq/(Sort-Seq folder +name)/_ run the following command in the command prompt: + + python ../processing_emat.py cfg_files/(name of cfg file).cfg CRP + +--- + +# DNA affinity Chromatography and Mass Spectrometry experiments + +## Processed results + +Processed data can be found in _code/mass_spec/_. Each folder contains a summary +.csv file with protein enrichment and details related to the experiment such +as the DNA traget sequence. + +## data analysis: + +Protein enrichment values are obtained from the 'ProteinGroups.txt' file that is +generated by the software MaxQuant (http://www.maxquant.org), used to analyzed +Thermo '.raw' LC/MS/MS data files. Within each experimental folder (contained in +_code/mass_spec/_), there is a .py Python file used to extract the relevant data +from the 'ProteinGroups.txt' that is found in the MaxQuant analysis txt folder. +This compiles a summarized '.csv' file that also will contain addition +experimental details. + +--- + +# Miscellaneous + +_code/processing/_ +- Scripts used to process Sort-Seq .fastq or .fasta files, and .sql files associated with energy matrix MCMC + +_code/analysis/_ +- Jupyter Notebooks and other analysis used in work. + +_code/flow/_ +- Histogram data from flow cytometry experiments to measure expression of promoter plasmids. Data is used in several figures. + +_code/figures/_ +- Contains all Python scripts used to generate figures for main text and SI material. +- Note that in many cases, formatting or arrangement of figures were modified in Adobe Illustrator. +- Note also that several of the SI figures require the full sequence data files. These are +available upon request. + +_misc/plasmid_sequences/_ +- Promoter plasmid sequences in .gb GenBank format + +_misc/primers_oligo_sequences/_ +- Primer and oligo sequences used in work + +_data/mass_spec_ +- Contains the ProteinGroup.txt output files from MaxQuant analysis of raw Thermo LC/MS/MS data files. + +_data/sortseq_raw_ +- Contains raw sequencing files from the Sort-Seq experiments (one sorted bin per file) + +_data/sortseq_pymc_dump_ +- Contains _mar_ promoter file of sequences across all sorted bins, used for library analysis in Supplemental FigS1E,F. + +_data/sortseq_MPAthic_MCMC_ +- Contains the _.sql_ files obtained from running MCMC for energy matrix inference with the MPAthic software.