-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
standardize I/O nomenclature across all tools? #1569
Comments
I strongly agree with this idea. In fact, I've been tempted to open an issue for similar considerations. But unifying input and output parameters is a must for ease of use and to lessen the (potentially catastrophic) errors: let's say that we want to index all bam files in a folder. Common sense say you can try I would advise having all the output files specified via parameter; no bare output files in the command line. And for homogeneity, the input files should follow the same policy. |
Samtools follows the usual Unix conventions and in particular the Unix filter conventions.1 Like the overwhelming majority of Unix tools, non-option arguments are input filenames, These are the conventions because they work well. In contrast, Unfortunately, for historical reasons, a few samtools commands follow a slightly different convention. This is what you see with Where it is practical, these outlier commands can be improved to also follow the desired usual conventions. Thus PR #1434 added the conventional IMHO there is little point in making a generalised plea like “samtools should standardize its I/O nomenclature”. Well, of course it should! And substantially it does. What is useful, instead, is to identify individual specific pain points. Then they can each be improved, as was done for Footnotes
|
Makes sense to have the inputs specified as non-option arguments.
I can check all tools and make a specific list, but to me, the pain points are the ones where you don't only follow the usual Unix conventions you just mentioned. Specifically, the ones that write to a file non specified by an option argument, because these are the ones that can cause data loss. I know that it is not an easy and quick step, but you could try to move all the tools to follow these conventions. For a couple versions print a big warning about command line changes. Then, remove the old ones. The old pipelines that use the old syntax will keep working, as they will keep using old versions. As soon as someone decides to update the version because they need some new functionality, they can do the command line adaptation along with the modification that triggered the update. Of course, that's my point of view and you are free to treat these comments as you consider :-) But moving in the direction you just pointed as standard and removing the non-standard behavior will make samtools a lot less error prone. |
I sloppily and regrettably used shorthand in my original post; when I wrote |
And with respect, several of what I assume are commonly used tools, e.g. "-i FILE to specify input files would be an abomination" - not entirely sure I agree with this. Or at least lots of software follows this convention, e.g. Picard/GATK/fgbio/many aligners. Not saying this is preferable either - I can see pros/cons both ways - and so ultimately just picking a format and standardizing it across all tools would be an enormous win. |
Not having to check the manual is good, and it can be solved by adding Rather than a broad complaint, can we please get specifics listed here about sub-commands that need fixing? As for /dev/stdin and /dev/stdout, on linux these are supported already as the OS handles this, however note not all commands can cope with data coming from pipes (which these are). Operations such as seeking or checking file length may fail (depending on the nature of the stdin provided). Thereoften isn't much we can do about that, but it's still worth reporting in cases where you believe the tool shouldn't need to be seeking. |
To be clear, my original post wasn't about whether /dev/stdin and /dev/stdout are supported - I'm aware that they already are. My request only pertained to how they (or a file name) are pointed to by each tool. And I wasn't necessarily advocating for removing features where it can be avoided; if the old syntax can be maintained to limit developer effort and user pain, while adding additional syntax options to create a uniform standard across tools, then so much the better. It seems like the preference would be to adopt a universal format of I could also see a case for (almost) always outputting to stdout in the absence of the In decreasing priority order (I would probably advocate for focusing on Category 1, followed by perhaps adding automatic stdout functionality for Tools that do not currently support
Tools that do not currently output to stdout in the absence of a specified output:
Tools that do not currently input from stdin in the absence of a specified input (seems like automatically reading from stdin, as
|
A side note: In my view, this |
Ups, sorry @eboyden, I came to your issue and added my opinions, including feature removal, that were not on your original text. My fault :-/ I subscribe your last comment. I also agree with @lh3 message (that's syntax is not good, but it exists). I have a couple examples of debatable situations where this updates can be discussed.
This is just to say that adding output options to all the tools that miss them will be a nice gain for all the users, but historical reasons will still limit the improvements. There could be a method to enable/disable the command line formats: an environment variable (or file configuration) like SAMTOOLS_CLI_VERSION that, if set to a number, only exposes that CLI format. Old pipelines won't suffer because without this variable the old interface is available, and new pipelines and users can set this variable to "2" to be able to use only the new version and avoid the dangers of overwriting anything in case of human error. I know I'm giving new ideas of improvements that should be in a new issue if you wanted to implement them. Just ignore these ideas, as I'm giving them for context and to broaden your minds about the possibilities. I don't want to introduce more noise to the original issue. |
No worries @Poshi . And I too agree with @lh3 , to the point that I'm not sure it ever would have occurred to me to try using And I completely agree that if there are good reasons for not modifying certain tools, there isn't a point in forcing it for comprehensiveness' sake. |
So being more practical about what can be done to improve uniformity while maintaining backwards compatibility.
Could add, although I'll note this isn't a command we ever expect the output to be processed. It's intended for human viewing, not scripting. That said it has an html output so...
Agreed.
Already has a
I disagree. It doesn't have a single output file, so
Agreed. This one lacking
As above.
Agreed.
Does anyone even use these? Or know what they do? :-) For consistency sake I'm tempted to agree with
I disagree with making these output to stdout unless explicitly requested by the user by using "-" or "/dev/stdout" as the filename, which already works, so I don't see any need for change. The output isn't expected to be processed by humans (unlike eg flagstats), so it's inappropriate to have terminal output the default in the absence of specifying an output filename. I'm aware some tools do act this way (eg
Couldn't do - ineeds random access.
I gave up going through all these individually, but just like the output-to-stdout comment above, these should all be capable of handling stdin by specifying the filename as "-". Arguably some more tools could additionally check if the input is a terminal, as "view" does, and if not then automatically default to reading from stdin, but I'd argue that view is something of a special case as it's also used as a debugging tool / data exploration (ie to be read by human eyes). The user can simply be explicit and specify "-" as the input filename (if it doesn't seek) and I'm OK with that. Edit: as a summary - I like adding the option to specify "-o filename" to more commands (especially ones that output data expected to be processed by another tool). I dislike automatic stdin/stdout selection though and think the existing unix standard of "-" is sufficient. I also dislike "-i" as it's non-standard and think it was a mistake to add it in most cases. |
PR #1674 regularises the |
Thanks for bumping this again John. It'd slipped off my radar, but there are a few low hanging fruits that definitely need fixing (eg calmd and fixmate). |
I have calmd, fixmate, and markdup in progress as well… 😄 |
Thank you |
@jkbonfield I am interested to know why you dislike the idea of automatic stdin selection? The unix standard is also to assume stdin. I do get confused/annoyed with which samtools commands I need to specify |
It depends on the tool. For textual based ones, absolutely automatic stdin/stdout makes a lot of sense and as you say is quite standard in Unix. For binary ones, stdin/stdout is still extremely useful for piping or redirection, but it's also sometimes problematic. Even so, it's common with things like gzip. I was perhaps a bit strong to imply I dislike it per-se. If I was designing a suite of tools from the ground up, then yes I'd probably accept stdin/stdout universally, but samtools has a long history and changing even minor things to do with the interface can byte us so I'm naturally cautious about even innocuous looking changes unless given sufficient time to invesitgate it thoroughly. (You'd be amazed at the sort of things people script.) I do agree it's a bit of a pain when some tools require "-" and some don't. It is possible to detect when stdin/stdout is a terminal device and act accordingly I guess, which samtools already does in a lot of places so we get usage help when not piped. That's not without complications though and we have to add exceptions to the testing framework for MS Windows for example where this doesn't work and instead causes hangs. I've also lost count of the number of bug reports we've had which have been caused by people doing "some_command > file" instead of "some_command -o file" and having an environment that mixes stderr/stdout together! You'd think that would be rare... So I'm open to persuation from others still. |
Although most tools can read from either a file or stdin and write to either a file or stdout, the manner in which to specify that seems to vary from tool to tool. It's not consistent when stdin/stdout is assumed unless a file is specified (as with
sort
), vs. when a file or stdin/out must be specified but an option is not required (as withfixmate
), vs. when an option is required for output but not input (as withaddreplacerg
).Is there any possibility of standardizing the I/O across all tools to lessen the confusion? With over 30 tools it's difficult to remember how they all behave. Personally I would vote to always require
-i
and-o
for specifying a file name (but perhaps assuming stdin/out in the absence of either option, though-i stdin
and-o stdout
would also be supported).The text was updated successfully, but these errors were encountered: