Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tool AMBER #5970

Merged
merged 38 commits into from
May 20, 2024
Merged

Add tool AMBER #5970

merged 38 commits into from
May 20, 2024

Conversation

SantaMcCloud
Copy link
Contributor

FOR CONTRIBUTOR:

  • I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • License permits unrestricted use (educational + commercial)
  • This PR adds a new tool or tool collection
  • This PR updates an existing tool or tool collection
  • This PR does something else (explain below)

label="Genome coverages TSV file"
help="Input a TSV file where the genome coverage is stated. Look at the help section to se how this file should look like!" />
</section>
<section name="tox" title="taxonomic binning-specific option">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also allow using it from history, we need a test for it, I think a subsampled file as shown here could work: https://github.com/galaxyproject/tools-iuc/tree/2d380eea5900d7650c5ddfe8cf8d19d2554d0031/tools/recentrifuge/test-data/test-db

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be done with a26dba5

<data name="metrics_bin" format="tsv" label="${tool.name}: Bin metrics" />
</outputs>
<tests>
<test expect_num_outputs="4">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test need an output check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with 80dd926

<param name="keyword" value="circular element" />
</conditional>
</section>
<output name="html">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check for something more meaningful ? Doesn't the tool produce the same output on each run? Then we could check if the files are the same ... ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the HTML file is too big to upload here, but I used the desc option to bring something unique in the file, this was done with 0b53f9d

label="Genome coverages TSV file"
help="Input a TSV file where the genome coverage is stated. Look at the help section to se how this file should look like!" />
</section>
<section name="tox" title="taxonomic binning-specific option">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,336 @@
<tool id="amber" name="AMBER" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="21.05">
<description>Evaluation package for the comparative assessment of genome reconstructions and taxonomic assignments from metagenome benchmark datasets</description>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is a bit long, can it be shortend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doen with 0e60c5f

@@ -0,0 +1,336 @@
<tool id="amber" name="AMBER" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="21.05">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the tool ID and folder name be cami_amber?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with 40d1e78 , 54c689b and 9795d98

<param argument="--min_length" type="integer" value="" optional="true"
label="Mimimum length of sequences"
help="Input how long the sequnces has to be" />
<param argument="--stdout" type="boolean" truevalue="--stdout" falsevalue="" checked="false"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large is this? Should we skip this parameter or enable it by default? Or put it into a real optional output file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with 54c689b


cp './output/index.html' '$html' &&
cp './output/heatmap_bar.png' '$path_to_html' &&
cp './output/results.tsv' '$result' &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of copying, moving would be better.

You can completely avoid it by using from_work_dir on the <output>s ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with 2cbb875

</inputs>
<outputs>
<data name="html" format="html" label="${tool.name}: HTML" />
<data name="result" format="tsv" label="${tool.name}: Results" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<data name="result" format="tsv" label="${tool.name}: Results" />
<data name="result" format="tabular" label="${tool.name}: Results" />

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with 9795d98

help="Input a TSV file where the genome coverage is stated. Look at the help section to se how this file should look like!" />
</section>
<section name="tox" title="taxonomic binning-specific option">
<param argument="--ncbi_dir" type="text" value="" optional="true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user does not know the path. We need to use the NCBI data table here. Please look at other IUC tools that are using this data table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be done with a26dba5

@SantaMcCloud
Copy link
Contributor Author

Oh the link in the commit should not be there in 80dd926 . I wanted to say that the first test is better now.

</requirements>
<command detect_errors="exit_code">
<![CDATA[
mkdir -p output inputs&&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mkdir -p output inputs&&
mkdir -p output inputs &&

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<param argument="--gold_standard_file" format="tabular" type="data"
label="Mapping of contigs or reads"
help="Input the gold standard file here so amber know the correct IDs for each contig/read" />
<repeat name="label" title="Binning names "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<repeat name="label" title="Binning names "
<repeat name="label" title="Binning names"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SantaMcCloud what about putting the binning-files into the repeat. And remove the multiple="0".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30acaae

I can do this, but it depends on your answer to my comment here: #5970 (comment)

<param argument="--desc" type="text" value=""
label="HTML description"
help="Enter the HTML page description here" />
<section name="genome" title="Genome binning-specific options" >
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also put this into the main repeat section? I imagine something like

repeat:
- input-file
- binning-name
- min_comp
- max_comp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make 2 repeat sections. One with the Files and the names since the names, if you want to input some, have to match the number of files. The min_comp and max_conta can also be put together, but there are not needed for every file, since the count for all when some values are given. But for the second repeat section, it can happen that one param has more thresholds than the other, so does it make sense to combine them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- input-file
- binning-name

-> repeat

- min_comp
- max_comp

-> normal param, but a comma seperated list, so extra repeat to create that list for both params (only ones for tool)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

</repeat>
<param argument="--remove_genomes" type="data" format="tabular" optional="true"
label="TSV file for genomes to remove"
help="Input a TSV file with binid and type in each line. In the help section is an example. WARNING: IF THE LIST CONTAIN ALL GENOME THE PROGRAM WILL FAIL!" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to add caps letter here :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SantaMcCloud
Copy link
Contributor Author

Currently, the _add_length.xml isn't working, since I have to access the package location somehow to run the .py file there. Because of this problem, I can not add the convert.xml to have a converter for the biobox format.

@bgruening
Copy link
Member

Try a combination of which and dirname to get the desired path relative to the main executable. Example here

DB_PATH="\$(dirname "\$(dirname "\$(which cojac)")")/share/cojac/voc" &&

@SantaMcCloud
Copy link
Contributor Author

SantaMcCloud commented May 5, 2024

@paulzierep when you have time you can play a bit with all tools. If there is something I should change, just let me know :)

@bgruening thank you for the help!! :D

ln -s '${gold_standard_file}' '$gold_standard_file.element_identifier' &&
ln -s '${fasta_file}' '$fasta_file.element_identifier' &&

python "\$(dirname "\$(dirname "\$(which amber.py)")")/lib/python3.11/site-packages/src/utils/add_length_column.py"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SantaMcCloud I see now. This trick only works if the scripts are in a neutral place, like /share/ etc ...
It will break on installations with python3.10 or python3.12.

We need to fix the conda package here I think. In the conda package we symlink the lib/python3.11/site-packages/src/utils/add_length_column.py package to /bin and then everyone should be happy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you also PR the recipe. I will delete my PR then :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also thank you again for the help!

@SantaMcCloud
Copy link
Contributor Author

Now all should be good!

Copy link
Contributor

@paulzierep paulzierep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Initial runs for all tools work and produce correct output. Thank you very much @SantaMcCloud
  • The taxonomic binning needs a test case.
  • I need to check which DM we can use for the nodes.dmp file.
  • The HTML out needs its heatmap

@@ -0,0 +1,91 @@
<tool id="cami_amber" name="CAMI AMBER add lenght column" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="21.05">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tools need a different id each

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,193 @@
<tool id="cami_amber" name="CAMI AMBER convert to biobox" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="21.05">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tools need a different id each

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

]]>
</command>
<inputs>
<param argument="--gold_standard_file" type="data" format="tabular"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add, that this works for any biobox file, but amber only needs seq lengths for the gold standard file ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


**Input**

This tool required 2 inputs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that is the wrong help ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

]]>
</command>
<inputs>
<param name="work" type="select" label="How should the tool work?"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a better description. Maybe like: Merge binning output files ? If yes all input files will be merged into one biobox output (required input by amber). If no each input will be converted into a separate biobox output file. <- although the tool can do it, I am not sure if this is ever needed for anything -> do you have an idea @SantaMcCloud ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

</param>
<conditional name="input">
<param name="is_select" type="select" label="Choose how to input the files"
help="You can either input the files manually or choose to input a collection">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rather something like you can provide each input seperately or as a collection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tools/cami_amber/cami_amber.xml Outdated Show resolved Hide resolved
<section name="tox" title="taxonomic binning-specific option">
<conditional name="ncbi">
<param name="is_select" type="select" label="Want to use the NCBI database?"
help="Select yes if you want to use the NCBI database from a data manager">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the choice should be first: Do you want to evaluate taxonomic binning ? Then select ncbi file manually or form DM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

</assert_contents>
</output>
</test>
<test expect_num_outputs="4">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to test weather the taxonomic binning evaluation works we need a test dataset we do not have. Strange that they do not provide it in their test set. But I think you can create a minimal example by manipulating the example from github:

Gold standard:

@Version:0.9.1
@SampleID:gsa

@@SEQUENCEID BINID      TAXID  LENGTH
RH|P|C37126  Sample6_89 45202  25096
RH|P|C3274   Sample9_91 32644  10009
RH|P|C26099  1053046    765201 689201
RH|P|C35075  1053046    765201 173282
RH|P|C20873  1053046    765201 339258

Binning1:

@Version:0.9.1
@SampleID:gsa

@@SEQUENCEID BINID      TAXID  LENGTH
RH|P|C37126  Sample6_89 32644   25096
RH|P|C3274   Sample9_91 32644  10009
RH|P|C26099  1053046    765201 689201
RH|P|C35075  1053046    765201 173282
RH|P|C20873  1053046    765201 339258

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can you also check weather is also works if we only supply the nodes.dmp file ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did check it, and you need all 3 to make it work. I just forgot it to mention it somewhere sorry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To evaluate the taxonomic binning we do also need a test data set that has the TAXID column in the gold standard and the binning file. Then it should produce a html like this one: https://cami-challenge.github.io/AMBER/cami_ii_mouse_gut/
The test here still misses that column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this data but got errors, I found out why now. I did create test_gold.tsv. test_binning.tsv and test_binning2.tsv to create the tox binning, BUT you can only create this tab in the HTML result when running this tool manually with either the download NCBI files or the DM. When using the test-db files, we never get such a output since they are mocked or compromised to fit the 1 MB restriction.

</section>
</inputs>
<outputs>
<data name="html" format="html" from_work_dir="output/index.html" label="${tool.name}: HTML" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The html output is missing the heatmap_bar.png either the output file needs to be patched to get it from an online source (maybe the amber github), or you need to add seconday files to the html, its a minor thing, but looks much better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try to make it, but I forgot a step to implement it correctly. Now it should work after I remember what I did missed when did commit the first version of it

@SantaMcCloud
Copy link
Contributor Author

SantaMcCloud commented May 7, 2024

<tool id="cami_amber" name="CAMI AMBER" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="21.05">
<description>Evaluation package for MAGs</description>
<macros>
<token name="@TOOL_VERSION@">2.0.4</token>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SantaMcCloud can you please move those macros to a separate macro file and import it into your tools. The idea behind macros is to DRY. Don't repeat yourself. requirements can also be moved into this macro file and shared. Same for citations and other stuff that can be shared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgruening
Copy link
Member

Thanks @SantaMcCloud very nice work!

@bgruening bgruening merged commit 807e74c into galaxyproject:main May 20, 2024
13 checks passed
@SantaMcCloud
Copy link
Contributor Author

Thank you :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants