Hopefully ending any and all discussion about how to handle quotes.... #300

Mike111177 · 2020-08-13T02:21:48Z

So whether or not to remove quotes and where seems to have been plaguing this repo with endless issues and PRs.
Sorry to wake up the dead here but see #145, #82, #93, #140, #138, #153, #180, #201, #222 and probably more I didn't find.

Overall the goal of this issue, discussion, and potential future PR is to making typing a command into your favorite shell or into your online bot interface more consistent, and also to make quote handling easier to test and more predictable.

I have two examples that well demonstrate what I think the main issues currently are:

I loaded this test.js into the node REPL before both of these examples to set up the command line and tokenizers.

//To grab arguments as passed by a shell if we recieve args
if (process.argv.length>2){
  console.log(JSON.stringify(process.argv.slice(2)))
  process.exit()
}

const fs = require('fs');
const {execSync} = require('child_process')
const util = require('util')
const parse = require('yargs-parser')

//Tokenizers
function extTokenizer(exe, ext, prefix=""){
  return command=>{
    tkfile = `tokenize.${ext}`
    fs.writeFileSync(tkfile, `${prefix}node test.js ${command}`)
    result = execSync(`${exe}${tkfile}`);
    return JSON.parse(result)
  }
}
const basicTokenizer = str=>str.split(" ") //Simple whitespace split
const yargsTokenizer = require('./tokenize-arg-string')
const cmdTokenizer = extTokenizer('cmd /C ', 'bat', "@echo off\n")
const powershellTokenizer = extTokenizer('powershell -ExecutionPolicy Bypass -File ', 'ps1')
const bashTokenizer = extTokenizer('bash ', 'sh', "/home/mike/.nvm/versions/node/v14.8.0/bin/") //Tested with WSL

const argfile = "command.txt"
const commandline = fs.readFileSync(argfile).toString()

Here I created a gist containing a quick and dirty tool for trying this out for yourself

1. Inconsistent removal of quotes depending on context of token

Post tokenizer, the argument parser seems to pick and choose which arguments aught to be stripped of quotes.

> commandline
`"positional" -a 'flag'`
> tokens = yargsTokenizer(commandline)
[ '"positional"', '-a', "'flag'" ]
> parse(tokens)
{ _: [ '"positional"' ], a: 'flag' }

Notice How the flag argument gets its quotes removed, but the positional argument does not. This inconsistency is confusing.

Now you may ask "then how do I enter a command like domything "to this other thing" from a string statement. Well that brings me to problem 2.

2. Tokenizer inconsistency with shell interfaces.

So if tokenizeArgString is supposed to be the middleware that converts your text string to the argument (token) array in the case where the shell you are in hasn't done it for you, arguably the output of tokenizeArgString should emulate the same expected results as you would get by reading process.argv. Since process.argv is controlled by the various shells, lets see how tokenArgString stacks up against the most popular ones (that I could easily test).

> commandline
`'single quote' "double quote" \\"escaped quote\\"`
> yargsTokenizer(commandline)
[ "'single quote'", '"double quote"', '\\"escaped quote\\"' ]
> cmdTokenizer(commandline)
[ "'single", "quote'", 'double quote', '"escaped', 'quote"' ]
> powershellTokenizer(commandline)
[ 'single quote', 'double quote', '\\escaped quote"' ]
> bashTokenizer(commandline)
[ 'single quote', 'double quote', '"escaped', 'quote"' ]

There are a few observations to be made here:

yargs nicely handles both single quotes and double quotes
cmd does not support single quotes, however it supports " escapes and strips the double quotes
powershell is a bit weirder, it supports both single and double quotes and strips both, but its behavior is strange when using \ escapes... (it also supports using backticks as escapes, or double quote marks to escape)
bash where everything is nice and makes sense supports single and double quotes and \ escapes

Which one of these methods is correct? None of them, they are all different flavors on the same end goal. However people are used to them and generally expect similar behavior (whether or not they get it) and more importantly, they are testable.

My Proposal

Do not strip quotes from arguments passed as an array, ever. Assume they have already been tokenized by a shell or tokenizer. I would argue that parse with an argument of any[] should leave every single argument unmolested as it should assume that it is receiving the arguments directly from process.argv. Otherwise it is difficult to figure out when and where quotations need to be added or removed. Therefore it should assume that the inputted quotes are intended to remain. This is probably a small breaking change, you could probably fix this with a deprecated option as well, but in my opinion maintaining both modes seems like a bit of a pain.

In the case where the user has specified in options that they want quotes removed the question becomes how many times to you strip the quotes, this needs further discussion as this seems to be a post processing measure.
Have the tokenizer handle removing quotes in accordance with how users would expect it from their shell. That means supporting three different tokenizers that are flavored like cmd, powershell, and bash. This way not only can the programmer select which style they want, it also means that there is no longer a "but i want my quotes removed here but not here" issue because you can respond, "you selected the powershell style and thats how powershell does it", removing any opinion on how the quotes are removed and ending this plague of issues.

You also get the benefit of being able to unit test directly against the shells themselves. Im not totally confident that the exact method to test them I wrote above is very CI friendly, however a similar concept could be used. Just take a bunch of predefined argument strings and see if the relate shell has the same output as what the flavored tokenizer outputs.

You could layout the top level API like the following:

//If you know you are tokenizing, don't accept tokens
export type Tokenizer = (arg: string) => string[];
export type CommandLineParser = (arg: string, opts?: Partial<Options>) =>Arguments;

//Same stuff as before for backwards compatibility
export interface Parser {
  (args: ArgsInput, opts?: Partial<Options>): Arguments; //Will now by default use tokenize.default to tokenize
  detailed(args: ArgsInput, opts?: Partial<Options>): DetailedArguments;
  camelCase(str: string): string;
  decamelize(str: string, joinString?: string): string;
  //new stuff
  //included tokenizers
  tokenize {
       (arg: string) => string[]; //Calling require('yargs-parser').tokenize("") should just use the default tokenizer
       default: Tokenizer; //freeze the current tokenizer for backward compat
       cmd: Tokenizer;
       powershell: Tokenizer;
       bash: Tokenizer;
  };
  //Include tokenizer flavors as full parsers
  cmd: CommandLineParser;
  powershell: CommandLineParser;
  bash: CommandLineParser;
  //Allow programmers to feed it their own tokenizer? So less complaining about format
  parser: (tokenizer: Tokenizer) => CommandLineParser;
}

I understand this goes somewhat against previous efforts (See #153), but ultimatley the tokenizer is meant to take the place of a shell that does it for you, so it doesn't make sense not to try to emulate the function of the shell in this regard.

This can also maybe open up an opportunity to seperate the tokenizers into their own npm package. It would be neat to have a package that is just designed to emulate how shells tokenize to argv. yargs-tokenizer anybody?

Let me know what yalls thoughts are on this, I'd be willing to do some of the ground work for this. I also believe that most (not all) of this can be introduced without any breaking changes...

The text was updated successfully, but these errors were encountered:

bcoe · 2020-09-20T21:28:38Z

Thanks for this great research you've done up front 👍

Have the tokenizer handle removing quotes in accordance with how users would expect it from their shell. That means supporting three different tokenizers that are flavored like cmd, powershell, and bash.

Over the years with yargs we've tried really hard to find compromises that meet the widest variety of user expectations and, ideally, are identical in a wide variety of environments.

Sometimes we introduce configuration options, when folks have irreconcilable goals ... and it seems like there are enough people on both sides of the fence that it's worth doing. Ideally these configuration options are added rarely.

I would rather start by looking at your findings, and seeing if there are a few additional bugs we can fix for some shell environments, without breaking the existing behavior. Only if it's absolutely necessary, e.g., a situation where we need to do things one way on powershell vs., bash, and it's impossible to find a non-breaking middle-ground, would I suggest that we add additional configuration options.

I do not like the idea of decomposing the tokenizers into their own packages. Ideally we'd have one tokenizer that supports as wide a variety of process.argv formats as possible, and we'd maybe introduce some additional config settings for the tokenizers.

Note of clarification: along with moving towards supporting Deno/TypeScript/ESM, I've been putting work into consolidating yargs' footprint, with regards to dependencies. Several popular alternatives like commander have zero dependencies ... I'm a proponent of decomposing libraries, but have definitely gotten some flack for yargs' footprint, tldr; I've been trying to be really selective about what additional dependencies we add or break out.

Mike111177 mentioned this issue Aug 15, 2020

Create parser builder pipeline. #302

Open

bcoe added the feature label Sep 10, 2020

Mike111177 closed this as completed Oct 26, 2020

Mike111177 reopened this Oct 26, 2020

shadowspawn mentioned this issue Sep 28, 2023

Incorrect parsing of options with escaped quotes #484

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hopefully ending any and all discussion about how to handle quotes.... #300

Hopefully ending any and all discussion about how to handle quotes.... #300

Mike111177 commented Aug 13, 2020 •

edited

bcoe commented Sep 20, 2020 •

edited

Hopefully ending any and all discussion about how to handle quotes.... #300

Hopefully ending any and all discussion about how to handle quotes.... #300

Comments

Mike111177 commented Aug 13, 2020 • edited

1. Inconsistent removal of quotes depending on context of token

2. Tokenizer inconsistency with shell interfaces.

My Proposal

bcoe commented Sep 20, 2020 • edited

Mike111177 commented Aug 13, 2020 •

edited

bcoe commented Sep 20, 2020 •

edited