Skip to content

Latest commit

 

History

History
executable file
·
64 lines (50 loc) · 2.71 KB

preprocess.md

File metadata and controls

executable file
·
64 lines (50 loc) · 2.71 KB

Basset

Deep convolutional neural networks for DNA sequence analysis.

Preprocess

#### preprocess_features.py

Merge a set of feature BED files for training into a single BED and activity table.

Arguments Type Description
target_beds_file table listing labels and BED One line per sample- label then BED path
Options Variable Description
-a db_act_file Existing database activity table
-b db_bed Existing database BED
-c chrom_lengths_file Table of chromosome lengths
-m merge_overlap Overlap length (after extension to feature_size) above which to merge features [Default: 200]
-n no_db_activity Do not pass along the activities of the database sequences [Default: False]
-o out_prefix Output file prefix [Default: features]
-s feature_size Extend features to this size [Default: 600]
-y ignore_y Ignore Y chromsosome features [Default: False]

#### seq_hdf5.py

Construct an HDF5 file, dividng the data into training, validation, and test subsets.

Arguments Type Description
fasta_file FASTA FASTA file of sequences.
targets_file Table Targets activity table.
out_file HDF5 Output HDF5 file.
Options Variable Description
-b batch_size Align sizes with batch size
-c counts Validation and training percentages are given as raw counts [Default: False]
-r permute Permute sequences [Default: False]
-s random_seed numpy.random seed [Default: 1]
-t test_pct Test % [Default: 0]
-v valid_pct Validation % [Default: 0]

#### basset_sample.py

Sample sequences from an existing database.

Arguments Type Description
db_bed BED Existing database BED.
db_act_file Table Existing database activity table.
sample_seqs int Number of sequences to sample.
output_prefix str Filename prefix for output BED and activity table files.
Options Variable Description
-s seed Random number generator seed [Default: 1]