Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insert a list of accessions into the samples: section of the config.yml file! #284

Open
ccbaumler opened this issue Jun 10, 2023 · 1 comment
Labels
code documentation Improvements or additions to documentation

Comments

@ccbaumler
Copy link

When working with the SRA, a list of accession numbers may be exported. To insert that list directly into a config.yml file for use in genome-grist, we can use the sed command, or base python to edit the list, read the config.yml file, and insert the list into the samples: section of the config.yml.

Using sed

My config.yml file:

samples:

outdir: outputs.trial/

sourmash_databases:
 - gtdb-rs207.genomic-reps.dna.k31.zip

The accession list directly exported from the SRA Run Browser as a txt file

ERR5004365
ERR5003005
ERR5003006
ERR5003008
ERR5003010
ERR5003011
ERR5003578
ERR5001725
ERR5001726
ERR5001728

To format and insert the accession list into the samples: section of the yml

sed "s/^/ - /" short_acc_list.txt | sed "/samples:/r /dev/stdin" -i config.yml

The first sed command inserts a space, -, and another space at the beginning of each line in the accession list txt file. This formats the list for the config file.

The second sed command reads the output of the first command and inserts it in the after the line matching samples: in the config.yml file.

Outputting a config.yml file in genome-grists desired format.

samples:
 - ERR5004365
 - ERR5003005
 - ERR5003006
 - ERR5003008
 - ERR5003010
 - ERR5003011
 - ERR5003578
 - ERR5001725
 - ERR5001726
 - ERR5001728

outdir: outputs.trial/

sourmash_databases:
 - gtdb-rs207.genomic-reps.dna.k31.zip

Using base python

With the exact same structure as above, using a python script instead of sed linux command line function we can achieve the same output.

# Read the accession list text file and format the list to work in the config file 
with open('short_acc_list.txt', 'r') as fp:
    lines = fp.readlines()
    modified_lines = [' - ' + line.strip() for line in lines]

# Read the config file and insert each line of the formatted list in a new line after `samples:`
with open('config.yml', 'r') as fp:
    content = fp.read()
    modified_content = content.replace('samples:', 'samples:\n' + '\n'.join(modified_lines))

# Overwrite the existing config file with the modified config file that contains the formatted list
with open('config.yml', 'w') as file:
    file.write(modified_content)
@ccbaumler ccbaumler added documentation Improvements or additions to documentation code labels Jun 10, 2023
@ccbaumler
Copy link
Author

A continuation using awk to parse a tab-separated dataset.

awk -F'\t' 'NR>1 && NF {print " - " $1}' assembly-test.tsv

Here I have called the awk program:

  • -F'\t' identifies the field separator to be tabs
  • 'NR>1 tells the program to return all rows greater than row number 1 (i.e. skip the header)
  • NF checks the number of fields per line. Here it is making sure we do not return empty lines
  • && ensures that both commands must be true to operate
  • {print " - " $1}' will print a hyphen followed by the contents of the first field defined by -F

This awk command can then be piped into our sed command from the previous comment:

awk -F'\t' 'NR>1 && NF {print " - " $1}' assembly-test.tsv | sed "/samples:/r /dev/stdin" -i config.yml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant