Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

standardize I/O nomenclature across all tools? #1569

Open
eboyden opened this issue Dec 22, 2021 · 17 comments
Open

standardize I/O nomenclature across all tools? #1569

eboyden opened this issue Dec 22, 2021 · 17 comments

Comments

@eboyden
Copy link

eboyden commented Dec 22, 2021

Although most tools can read from either a file or stdin and write to either a file or stdout, the manner in which to specify that seems to vary from tool to tool. It's not consistent when stdin/stdout is assumed unless a file is specified (as with sort), vs. when a file or stdin/out must be specified but an option is not required (as with fixmate), vs. when an option is required for output but not input (as with addreplacerg).

Is there any possibility of standardizing the I/O across all tools to lessen the confusion? With over 30 tools it's difficult to remember how they all behave. Personally I would vote to always require -i and -o for specifying a file name (but perhaps assuming stdin/out in the absence of either option, though -i stdin and -o stdout would also be supported).

@Poshi
Copy link

Poshi commented Dec 23, 2021

I strongly agree with this idea. In fact, I've been tempted to open an issue for similar considerations.
Using -i stdin could be ok, but I think that -i /dev/stdin would be a better option, as it just means that you should read the given file, and not interpret the stdin string as the marker for standard input. Keeping - as standard also sounds good, as it has been used for a long time.

But unifying input and output parameters is a must for ease of use and to lessen the (potentially catastrophic) errors: let's say that we want to index all bam files in a folder. Common sense say you can try samtools index *.bam. But this will likely cause the overwrite of the second bam in the list by the index of the first bam (OK, maybe I will get an error saying that the file already exists, but the point is that it is very dangerous and error prone).
Another fun trick is to run samtools merge *.bam > merged.bam. This will result on the first bam being overwritten by the merge of the other bam files, and an empty merged.bam file.
Because sometimes the output file is the first in the list (like in merge) or the last of the list (like in index).

I would advise having all the output files specified via parameter; no bare output files in the command line. And for homogeneity, the input files should follow the same policy.

@jmarshall
Copy link
Member

Samtools follows the usual Unix conventions and in particular the Unix filter conventions.1 Like the overwhelming majority of Unix tools, non-option arguments are input filenames, - in the input filenames indicates standard input, an empty list of input filenames means to read from standard input, and output can be redirected with -o FILE (typically) or via the shell (which is usually an inferior technique for production use).

These are the conventions because they work well. In contrast, -i FILE to specify input files would be an abomination — in particular, -i *.bam would not work in the Unix setting (where the shell does the wildcard expansion).

Unfortunately, for historical reasons, a few samtools commands follow a slightly different convention.

This is what you see with samtools index *.bam for example. (This could be improved, at least when there are three or more files matching *.bam, by fixing bam_index.c to validate argc more carefully.)

Where it is practical, these outlier commands can be improved to also follow the desired usual conventions. Thus PR #1434 added the conventional samtools merge -o merged.bam *.bam syntax, alongside the existing historical syntax (which is of course too entrenched in the world's pipelines to be removed without a difficult-to-coordinate multi-year transition).

IMHO there is little point in making a generalised plea like “samtools should standardize its I/O nomenclature”. Well, of course it should! And substantially it does.

What is useful, instead, is to identify individual specific pain points. Then they can each be improved, as was done for merge with -o in v1.13.

Footnotes

  1. Including the usual Unix filter filename argument conventions, which are so entrenched that they're not even discussed in those links.

@Poshi
Copy link

Poshi commented Dec 23, 2021

Makes sense to have the inputs specified as non-option arguments.

-i *.bam could work, but I agree that it is not supported by the standard command parsing tools (but it can be seen in action in python code using the standard module argparse) and is not the best idea.

I can check all tools and make a specific list, but to me, the pain points are the ones where you don't only follow the usual Unix conventions you just mentioned. Specifically, the ones that write to a file non specified by an option argument, because these are the ones that can cause data loss.
Now, merge can be used with the Unix standard syntax, but unfortunately it can still be used with the non-standard syntax, and it can cause data loss. If someone tries to merge a bunch of files using the Unix conventions and redirecting the output... data loss.
To me, 'pain point' is having to check the help/documentation of every tool every time I have to use them because I (fortunately) know that each one is different and they pose a risk. This can only be solved with an uniform way to specify outputs, so I don't have to check the manual every now and then.

I know that it is not an easy and quick step, but you could try to move all the tools to follow these conventions. For a couple versions print a big warning about command line changes. Then, remove the old ones. The old pipelines that use the old syntax will keep working, as they will keep using old versions. As soon as someone decides to update the version because they need some new functionality, they can do the command line adaptation along with the modification that triggered the update.
It's a long path, but it have to start somewhere.

Of course, that's my point of view and you are free to treat these comments as you consider :-) But moving in the direction you just pointed as standard and removing the non-standard behavior will make samtools a lot less error prone.

@eboyden
Copy link
Author

eboyden commented Dec 23, 2021

I sloppily and regrettably used shorthand in my original post; when I wrote stdin and stdout I didn't mean those terms literally, I meant /dev/stdin or /dev/stdout, or alternatively -.

@eboyden
Copy link
Author

eboyden commented Dec 23, 2021

And with respect, several of what I assume are commonly used tools, e.g. view, addreplacerg, fixmate, sort, and index, all have slightly different I/O nomenclature, and so a generalized plea is warranted - the point is the lack of standardization across tools, not intrinsic difficulty with any particular tool. @Poshi 's remark about having to constantly check the documentation hits painfully close to home, despite my using Samtools for a decade. I agree that it's not a critical bug fix, but any progress over time would be appreciated. And I don't mind updating my pipelines when I update my software - that's to be expected, and users who do so without checking the release notes do so at their own risk.

"-i FILE to specify input files would be an abomination" - not entirely sure I agree with this. Or at least lots of software follows this convention, e.g. Picard/GATK/fgbio/many aligners. Not saying this is preferable either - I can see pros/cons both ways - and so ultimately just picking a format and standardizing it across all tools would be an enormous win.

@jkbonfield
Copy link
Contributor

jkbonfield commented Dec 23, 2021

Specifically, the ones that write to a file non specified by an option argument, because these are the ones that can cause data loss. Now, merge can be used with the Unix standard syntax, but unfortunately it can still be used with the non-standard syntax, and it can cause data loss. If someone tries to merge a bunch of files using the Unix conventions and redirecting the output... data loss. To me, 'pain point' is having to check the help/documentation of every tool every time I have to use them because I (fortunately) know that each one is different and they pose a risk. This can only be solved with an uniform way to specify outputs, so I don't have to check the manual every now and then.

Not having to check the manual is good, and it can be solved by adding -o options where missing for example. However removing the old behaviour will produce vastly more pain. All those pipelines and scripts that know the merge syntax without a -o would suddenly stop working after an upgrade, and we want people to be able to keep up to date. While I agree the sytax isn't good, we're simply not going to remove it for the sake of tidyness.

Rather than a broad complaint, can we please get specifics listed here about sub-commands that need fixing?

As for /dev/stdin and /dev/stdout, on linux these are supported already as the OS handles this, however note not all commands can cope with data coming from pipes (which these are). Operations such as seeking or checking file length may fail (depending on the nature of the stdin provided). Thereoften isn't much we can do about that, but it's still worth reporting in cases where you believe the tool shouldn't need to be seeking.

@eboyden
Copy link
Author

eboyden commented Dec 23, 2021

To be clear, my original post wasn't about whether /dev/stdin and /dev/stdout are supported - I'm aware that they already are. My request only pertained to how they (or a file name) are pointed to by each tool. And I wasn't necessarily advocating for removing features where it can be avoided; if the old syntax can be maintained to limit developer effort and user pain, while adding additional syntax options to create a uniform standard across tools, then so much the better.

It seems like the preference would be to adopt a universal format of
toolname -o outputsource inputsource
where the output can always use the -o option (but doesn't necessarily have to depending on existing functionality), and the input does not use an option. All tools currently support specifying a filename or stdin without an option; some tools (like view and sort) will recognize stdin and/or stdout if no input or output is specified, others (like fixmate) require a filename or "-" or "/dev/std[in/out]" to be specified.

I could also see a case for (almost) always outputting to stdout in the absence of the -o option, and (almost) always inputting from stdin in the absence of a specified input source, though this is perhaps lower priority. (FWIW I am not a fan of explicitly specifying an output stream without the -o option, as fixmate and markdup currently require; this leads to multiple I/Os in the command line with no clear indication of whether they're inputs or outputs other than the order in which they're listed. Likewise for not using an explicit option to indicate supplementary input files e.g. bed files for coverage-based tools like ampliconstats and bedcov, reference files for tools like calmd, or the sam file for reheader. But if disabling this functionality for tools where it exists isn't in the cards, so be it.)

In decreasing priority order (I would probably advocate for focusing on Category 1, followed by perhaps adding automatic stdout functionality for collate/merge/fixmate/markdup in Category 2):

Tools that do not currently support -o:

  • tview
  • index
  • idxstats
  • flagstat
  • stats
  • bedcov
  • coverage
  • split (special case as it may produce multiple files)
  • calmd
  • fixmate
  • markdup
  • reheader
  • targetcut
  • phase

Tools that do not currently output to stdout in the absence of a specified output:

  • index (writes to a file instead)
  • collate (unless -O is specified)
  • merge
  • faidx (writes index in the same directory as input when -o not used, not to stdout)
  • fqidx (writes index in the same directory as input when -o not used, not to stdout)
  • fixmate
  • markdup

Tools that do not currently input from stdin in the absence of a specified input (seems like automatically reading from stdin, as view and sort do, is the exception rather than the rule):

  • tview
  • quickcheck
  • index
  • collate
  • idxstats
  • flagstat
  • bedcov (requires index, cannot pipe in)
  • split
  • depth
  • mpileup
  • coverage
  • merge
  • cat (special case)
  • import (special case)
  • faidx (special case)
  • fqidx (special case)
  • calmd
  • fixmate
  • markdup
  • addreplacerg
  • reheader
  • targetcut
  • phase
  • depad
  • ampliconclip
  • samples

@lh3
Copy link
Member

lh3 commented Dec 24, 2021

A side note: -i *.bam is a bad CLI design. This probably started with people who required - for every argument. Then they realized something as simple as -i *.bam wouldn't work, so they added a new rule for -i *.bam. This feature unfortunately complicates the implementation of CLI parsers, breaks the backward compatibility with unix getopt and confuses more unix users like me.

In my view, this -i *.bam feature is a typical example of creating a chain of complex problems for a minor subjective opinion. Just follow the unix practice and we wouldn't have these problems to start with.

@Poshi
Copy link

Poshi commented Dec 24, 2021

Ups, sorry @eboyden, I came to your issue and added my opinions, including feature removal, that were not on your original text. My fault :-/

I subscribe your last comment. I also agree with @lh3 message (that's syntax is not good, but it exists).

I have a couple examples of debatable situations where this updates can be discussed.

  • In quickcheck, the "quick" name comes from the ability to seek to the end of the file. If it accepts the input from stdin, then it is no longer quick, as it must ingest the whole file to get the 28 interesting bytes at the end of the stream. So accepting input from stdin makes sense from a CLI homogeneity point of view, but defeats the purpose of the tool.
  • In index, most of the use cases will be to use the standard output naming (so you don't have to provide to all the other tools the specific name you choose), so an -o won't be much used. But I had several times the need to index a bunch of files, and that is something that cannot be safely done with *.bam while keeping the old syntax. Maybe using some kind of template at the output option could help in indexing multiple files.

This is just to say that adding output options to all the tools that miss them will be a nice gain for all the users, but historical reasons will still limit the improvements.

There could be a method to enable/disable the command line formats: an environment variable (or file configuration) like SAMTOOLS_CLI_VERSION that, if set to a number, only exposes that CLI format. Old pipelines won't suffer because without this variable the old interface is available, and new pipelines and users can set this variable to "2" to be able to use only the new version and avoid the dangers of overwriting anything in case of human error.

I know I'm giving new ideas of improvements that should be in a new issue if you wanted to implement them. Just ignore these ideas, as I'm giving them for context and to broaden your minds about the possibilities. I don't want to introduce more noise to the original issue.

@eboyden
Copy link
Author

eboyden commented Dec 24, 2021

No worries @Poshi . And I too agree with @lh3 , to the point that I'm not sure it ever would have occurred to me to try using *.bam as an input with or without -i, except maybe for index. To me the value of these tools are their ability to be implemented into long pipes run on single inputs, not used in isolation on an entire directory of files. Which just means my imagination is lacking.

And I completely agree that if there are good reasons for not modifying certain tools, there isn't a point in forcing it for comprehensiveness' sake. index is a special case, and I think all of Category 3 above is optional at best - it's easy enough to just specify the input if you want to pipe in, and just because a couple tools happen to have additional functionality beyond that doesn't mean that should necessarily become the standard. I was only trying to be comprehensive in my previous post to provide a full view of the landscape. To me, user-submitted issues are merely suggestions for the developers to consider and then decide if/when/how to implement. "No that's a bad idea and here's why" and "Yeah that's a good idea but I have no time to do it so either be prepared to wait years or feel free to submit your own PR" are both perfectly acceptable and appreciated responses.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 4, 2022

So being more practical about what can be done to improve uniformity while maintaining backwards compatibility.

Tools that do not currently support -o:

* tview

Could add, although I'll note this isn't a command we ever expect the output to be processed. It's intended for human viewing, not scripting. That said it has an html output so...

* index

* idxstats

* flagstat

* stats

* bedcov

Agreed.

* coverage

Already has a -o option.

* split (special case as it may produce multiple files)

I disagree. It doesn't have a single output file, so -o is misleading. It has a different option already for specifying the file pattern.

* calmd

Agreed. This one lacking -o surprises me the most as the output isn't just text (which could be argued in some cases isn't expected to be parsed and is just something expected to be viewed in a terminal). Especially as the -Q option talks about debug to stdout, which would break shell redirection. (I'd hope it actually meant stderr, but haven't checked.) It doesn't even have an additional argument to specify output, so it must be redirected.

* reheader

As above.

* fixmate

* markdup

Agreed.

* targetcut

* phase

Does anyone even use these? Or know what they do? :-) For consistency sake I'm tempted to agree with -o though.

Tools that do not currently output to stdout in the absence of a specified output:

* index (writes to a file instead)

* collate (unless `-O` is specified)

* merge

* faidx (writes index in the same directory as input when `-o` not used, not to stdout)

* fqidx (writes index in the same directory as input when `-o` not used, not to stdout)

* fixmate

* markdup

I disagree with making these output to stdout unless explicitly requested by the user by using "-" or "/dev/stdout" as the filename, which already works, so I don't see any need for change. The output isn't expected to be processed by humans (unlike eg flagstats), so it's inappropriate to have terminal output the default in the absence of specifying an output filename.

I'm aware some tools do act this way (eg samtools depad), but IMO it's a misfeature and it's a continual source of false bug reports due to people using shell redirect instead of -o and mixing up stderr and stdout due to tools such as nohup. We cannot go back and remove BAM-stdout from such commands, but I wouldn't wish to add more.

Tools that do not currently input from stdin in the absence of a specified input (seems like automatically reading from stdin, as view and sort do, is the exception rather than the rule):

* tview

* quickcheck

Couldn't do - ineeds random access.

* index

* collate

* idxstats

* flagstat

* split

* depth

* mpileup

* coverage

* merge

* cat (special case)

* import (special case)

* faidx (special case)

* fqidx (special case)

* calmd

* fixmate

* markdup

* addreplacerg

* reheader

* targetcut

* phase

* depad

* ampliconclip

* samples

I gave up going through all these individually, but just like the output-to-stdout comment above, these should all be capable of handling stdin by specifying the filename as "-".

Arguably some more tools could additionally check if the input is a terminal, as "view" does, and if not then automatically default to reading from stdin, but I'd argue that view is something of a special case as it's also used as a debugging tool / data exploration (ie to be read by human eyes). The user can simply be explicit and specify "-" as the input filename (if it doesn't seek) and I'm OK with that.

Edit: as a summary - I like adding the option to specify "-o filename" to more commands (especially ones that output data expected to be processed by another tool). I dislike automatic stdin/stdout selection though and think the existing unix standard of "-" is sufficient. I also dislike "-i" as it's non-standard and think it was a mistake to add it in most cases.

@jmarshall
Copy link
Member

PR #1674 regularises the index command's I/O options, addressing samtools index *.bam and adding an -o option.

@jkbonfield
Copy link
Contributor

Thanks for bumping this again John. It'd slipped off my radar, but there are a few low hanging fruits that definitely need fixing (eg calmd and fixmate).

@jmarshall
Copy link
Member

I have calmd, fixmate, and markdup in progress as well… 😄

@jkbonfield
Copy link
Contributor

Thank you

@mbhall88
Copy link

mbhall88 commented Jan 9, 2024

I dislike automatic stdin/stdout selection though and think the existing unix standard of "-" is sufficient. I also dislike "-i" as it's non-standard and think it was a mistake to add it in most cases.

@jkbonfield I am interested to know why you dislike the idea of automatic stdin selection? The unix standard is also to assume stdin. I do get confused/annoyed with which samtools commands I need to specify - as stdin and which I don't need to, especially so given I predominantly use samtools in piped commands.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 9, 2024

It depends on the tool. For textual based ones, absolutely automatic stdin/stdout makes a lot of sense and as you say is quite standard in Unix. For binary ones, stdin/stdout is still extremely useful for piping or redirection, but it's also sometimes problematic. Even so, it's common with things like gzip. I was perhaps a bit strong to imply I dislike it per-se. If I was designing a suite of tools from the ground up, then yes I'd probably accept stdin/stdout universally, but samtools has a long history and changing even minor things to do with the interface can byte us so I'm naturally cautious about even innocuous looking changes unless given sufficient time to invesitgate it thoroughly. (You'd be amazed at the sort of things people script.)

I do agree it's a bit of a pain when some tools require "-" and some don't. It is possible to detect when stdin/stdout is a terminal device and act accordingly I guess, which samtools already does in a lot of places so we get usage help when not piped. That's not without complications though and we have to add exceptions to the testing framework for MS Windows for example where this doesn't work and instead causes hangs. I've also lost count of the number of bug reports we've had which have been caused by people doing "some_command > file" instead of "some_command -o file" and having an environment that mixes stderr/stdout together! You'd think that would be rare...

So I'm open to persuation from others still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants