Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should URLSearchParams be used in node.js 8 instead of querystring? #252

Open
jsilveira opened this issue Jul 9, 2017 · 23 comments
Open
Labels

Comments

@jsilveira
Copy link

jsilveira commented Jul 9, 2017

Node >= 8.0 introduced an issue in querystring.parse (see here nodejs/node#13773) that will affect all users of body-parser. It occurs when there is a trailing whitespace in a query string parameter:

Before require('querystring').parse('a=%20+&') was { a: ' ' }

In node 8.x require('querystring').parse('a=%20+&') is { a: '%20 ' }

This applies to any parameter value ending with a space.

Even though it is not body-parser's fault, this issue can cause non trivial problems that are very hard to track down. By updating any app that uses body-parser on older node versions to node 8.x, any url-encoded form input with a trailing space will start to have the encoding of special characters broken. It took us a long time to discover the issue and to find out where the problem was. In the issue mentioned above, it is suggested that the new URLSearchParams class is a faster way of parsing query strings and more similar to native browser implementations. Does it make sense to consider it?

@dougwilson
Copy link
Contributor

So yes, it sucks that Node.js 8 introduced a bug, though obviously they should own up to it and fix the bug.

Just reading the docs for the interface seems to have a few points that make it not easy to use. For example:

A leading '?', if present, is ignored.

But a leading ? has no meaning in the POST payload and should not be stripped, and if that behavior cannot be turned off, it seems to makes that unsuitable for parsing urlencoded bodes. The docs also mention multiple times it is only for URL parsing, which also seems to make it hard to use for body parsing.

Can you summarize all the differences between the new URLSearchParams class and the query string class? Understanding exactly hoe they are different would be the place for us to start.

@jsilveira
Copy link
Author

You make very good points @dougwilson, maybe URLSearchParams is not the way to go. I also don't know if there is an specification link between parsing "application/x-www-form-urlencoded" and url query strings (a quick look at https://url.spec.whatwg.org/#application/x-www-form-urlencoded did not clear that for me, although it was fun to read the note "The application/x-www-form-urlencoded format is in many ways an aberrant monstrosity" 🤣 )

I also thought it was important to open this issue here to help future people stumbling with this bug, as many are likely to look here for body-parsing issues.

@dougwilson
Copy link
Contributor

Yea, I was mostly wondering what the implementation differences between those two modules in Node.js core are. I assume that URLSearchParams is not usable by this module, otherwise I'm sure if it was actually a replacement, then "querystring" would have at least been deprecated, but it is not and as far as I can tell they do different things, of which the URLSearchParams is only a replacement for URL parsing, which is not the use-case this module is using querystring for. Some PRs and such in the Node.js repo makes it look like the querystring module was more closely aligned to urlencoded posts (like what body-parser uses it for) and not URLs so instead of making it more complicated, the fixes were rolled into the new WHATWG URL parser and the queryparser module was left as it is to continue to support the other use cases (like bodies).

At least, that is what I gathered. Let me know if you know otherwise, but it doesn't look like this is something we can just swap in this module, but instead need to await a fix in Node.js core.

@dougwilson
Copy link
Contributor

And in case it wasn't clear: the issue you highlighted in the queryparser is absolutely a bug; a space at the end of a value should not prevent decoding the value; it looks like it may be some kind of bug around string replacement, maybe.

@TimothyGu
Copy link

But a leading ? has no meaning in the POST payload and should not be stripped, and if that behavior cannot be turned off

Always prepending a ? before the string is parsed would fix this issue.

I'm sure if it was actually a replacement, then "querystring" would have at least been deprecated

URLSearchParams is in fact intended to be a replacement for all application/x-www-form-urlencoded content. It is not deprecated because:

  • querystring has been in Node.js since pretty much the very beginning and it would be basically impossible to deprecate it
  • querystring supports customizing the separator (&) and the name/value delimiter (=), while URLSearchParams does not (and probably will never) support this feature.

Based on my understanding of what body-parser does, I would urge using URLSearchParams when possible (v8.0.0+; v7.x's URLSearchParams is fairly buggy). But yes, I would agree that the issue in querystring is a bug, an important one too. I'll take a look at it later today.

@dougwilson
Copy link
Contributor

Thanks for the information, @TimothyGu ! Can you explain what the differences are between the classes? Is URLSearchParams just a drop-in replacement for the querystring module or are there other changes we'll need to make in order to use it correctly. I think that simply always prepending a ? to the string on our side is a smell, and completely goes against your assertion that it is meant to be used for application/x-www-form-urlencoded bodies, as those should not have any leading ? stripped. If you intend for that module to be used for urlencoded bodies, can you at least add a mode switch to the module to allow users to indicate they are trying to parse a urlencoded body, rather than people are supposed to know they need to prepend a ? to the string before parsing? I don't see that noted anywhere in the URLSearchParams documentation. I also don't see any support for maxKeys in URLSearchParams, which we use in this module.

@dougwilson
Copy link
Contributor

And for anyone coming across this issue, if you would like to take a stab at whatever it would take to use URLSearchParams when available, and querystring otherwise, please feel free to throw together a PR (even if it's a WIP) or use this issue as a discussion board / questions if there are any for how to go about doing this :) !

@dougwilson dougwilson changed the title Dangerous Node 8.x.x bug in querystring.parse, should URLSearchParams be used? Should URLSearchParams be used in node.js 8 instead of querystring? Jul 10, 2017
@dougwilson
Copy link
Contributor

And for those looking (trying to put together urlencoded issues here), it would be great if we have to move parsers (like to URLSearchParams from querystring), it should either support the special _charset_ field or at least allow the charset for which the bytes should be decoded as so we can plug into it. There have been a few requests to support _charset_ (which comes from http://www.w3.org/TR/html5/forms.html#url-encoded-form-data) and so far it's just been ignored since querystring doesn't do it, but if we end up making a lot of changes to get off querystring, it would be a great time to move towards the custom charset support (or, at least the path to add _charset_ is clear).

@TimothyGu
Copy link

Is URLSearchParams just a drop-in replacement for the querystring module or are there other changes we'll need to make in order to use it correctly.

No, URLSearchParams has a different API. It has the exact same interface as the URLSearchParams exposed in browsers.

  • params[key] = value is params.set(key, value)
  • Array.isArray(params[key]) ? params[key].push(value) : params[key] = [params[key], value] is params.append(key, value)
  • delete params[key] is params.delete(key)
  • key in params is params.has(key)
  • params[key] is params.get(key) or params.getAll(key)
  • Object.keys(params) is Array.from(params.keys()) (with the caveat that the latter would return duplicate keys)
  • querystring.stringify(params) is params.toString()

If you intend for that module to be used for urlencoded bodies, can you at least add a mode switch to the module to allow users to indicate they are trying to parse a urlencoded body, rather than people are supposed to know they need to prepend a ? to the string before parsing? I also don't see any support for maxKeys in URLSearchParams, which we use in this module.

It would not be possible for us to add any options not already in the spec to the URLSearchParams constructor. But we might make a custom API like url.parseSearchParams() that takes options, much like the custom url.format() we have for URL. Even if we do add such an API, in general I would still encourage the use of URLSearchParams class if possible as it has a more defined API in the entire JS world.


it would be a great time to move towards the custom charset support (or, at least the path to add _charset_ is clear)

I assume you are talking about decoding/parsing with custom charsets here.

Currently, the Node.js core only has Latin-1/UTF-8/UTF-16 decoders built-in. Based on this and WHATWG's stance

A legacy server-oriented implementation might have to support encodings other than UTF-8 as well as have special logic for tuples of which the name is _charset [sic]. Such logic is not described here as only UTF-8 is conforming.

the URLSearchParams constructor will probably never support _charset_. But when nodejs/node#13644 lands, and if we do decide to add the aforementioned custom parser API, supporting _charset_ should be easy enough on our side to add an option for.

@dougwilson
Copy link
Contributor

Ok, so for those following along looking to make a PR, it sounds from @TimothyGu above that you cannot just drop in the URLSearchParam class depending on Node.js version without also adding in routes to convert the return value back into a plain object such that our exposed API from this module can remain the same (and, more importantly, the API of req.body remains the same regardless of the extended parsing choice).

This module already includes iconv-lite for all the charset decoding of all the other content-types, so it's not clear how that would plug into the URLSearchParam, while it can be plugged into the querystring module, so need to figure that out. I looked through the linked PR but it's not 100% clear on how one would plug iconv-lite into URLSearchParam after that PR lands, so if anyone has some code examples, that would be awesome

@TimothyGu
Copy link

you cannot just drop in the URLSearchParam class depending on Node.js version without also adding in routes to convert the return value back into a plain object such that our exposed API from this module can remain the same

Correct.

it's not 100% clear on how one would plug iconv-lite into URLSearchParam after that PR lands

The point of that Node.js PR is that you wouldn't need to plug iconv-lite into search param parsing: we would have our ICU-based implementation that will recognize basically all the encodings a browser recognizes. After the work is done, one should be able to just specify url.parseSearchParams(string, { useCharset: true }) to get a correctly decoded URLSearchParams object.

@dougwilson
Copy link
Contributor

Re the charsets, gotcha. The idea, though, is that the charset support is at least consistent across all our body parsers. I assume that the ICU could be plugged into the rest of the parsers, at least? Also need to determine what it would look like to support the versions of Node.js without that ICU encoding and the ones with that, and also not being too familiar with ICU, I assume that means that with the ICU thing everyone's Node.js will have every encoding supported so when body-parser stops using iconv-lite users get all the encodings supported out-of-the-box with this module just like they do today?

@TimothyGu
Copy link

TimothyGu commented Jul 10, 2017

ICU is a piece of software written in C++, that Node.js and V8 are already using to support many Unicode- and internationalization-related things. The PR I linked will make use of ICU to provide a TextDecoder class -- the same TextDecoder class in browsers.

TextDecoder supports more encoding aliases than iconv-lite -- with the additional aliases being ones seen in the wild and supported by browsers. On the other hand, iconv-lite supports many encodings TextDecoder doesn't, but these additional encodings are probably not supported by browsers anyway. If I were you, since the transition to _charset_-aware urlencoded bodies is a semver-major change anyway, I'd just drop iconv-lite at that time.

I assume that the ICU could be plugged into the rest of the parsers, at least?

Yes.

Also need to determine what it would look like to support the versions of Node.js without that ICU encoding and the ones with that

There is a JS-only TextDecoder polyfill that's pretty widely used on npm: https://www.npmjs.com/package/text-encoding.

and also not being too familiar with ICU, I assume that means that with the ICU thing everyone's Node.js will have every encoding supported so when body-parser stops using iconv-lite users get all the encodings supported out-of-the-box with this module just like they do today?

Technically, no, as iconv-lite supports some encodings TextDecoder doesn't. But as I explained before, I expect very few people to use the encodings only iconv-lite supports.


It'll still be a while before the TextDecoder PR lands though, and another while before it can be used without a runtime CLI flag once it becomes Stable. I think converting to URLSearchParams should be done first; _charset_ support would be an icing on the cake for the future. For whoever wants to PR for this, one might find this snippet to be helpful:

const { URLSearchParams } = require('url');
// sanity check to make sure URLSearchParams is spec-compliant
const hasURLSearchParams = Boolean(URLSearchParams) && new URLSearchParams('a&b&a').toString() === 'a&b&a';

// Preferred alternative to querystring.parse() if hasURLSearchParams
function parse(str) {
  // Before an API that doesn't strip `?` is available, always add a `?` when constructing.
  // TODO: change when an API that limits maximum pairs is available.
  const params = new URLSearchParams(`?${str}`);
  const obj = Object.create(null);
  params.forEach((val, key) => {
    if (obj[key] === undefined) {
      obj[key] = val;
    } else if (Array.isArray(obj[key])) {
      obj[key].push(val);
    } else {
      obj[key] = [obj[key], val];
    }
  });
  return obj;
}

@dougwilson
Copy link
Contributor

Tanks, @TimothyGu very helpful! Re one of my questions above, it was about the following quote in that PR:

By default, with small-icu, only the UTF-8, UTF-16le and UTF-16be decoders are supported. With full-icu enabled, every encoding other than iso-8859-16 is supported.

I assume that meant that by default, most people will only have the "small-icu" installed, is that right? So then this module will have to use that polyfill you pointed to?

@TimothyGu
Copy link

TimothyGu commented Jul 10, 2017

I assume that meant that by default, most people will only have the "small-icu" installed, is that right?

Yes.

So then this module will have to use that polyfill you pointed to?

Hmm, I see what the problem is now. The parseSearchParams function will probably have to expose a decode hook just because most people will only have the Unicode encodings embedded in Node.js. And if that is done, this module can continue using iconv-lite.

@TimothyGu
Copy link

TimothyGu commented Jul 10, 2017

In the mean time I've finished up nodejs/node#14151, which when added to a v8.x release should fix the original bug.

EDIT: the pull request just landed in Node.js. It will be backported to v8.x in the next release.

@stevenvachon
Copy link

stevenvachon commented Jul 17, 2017

URLSearchParams can be used for both Node 8.x and 6.x users via universal-url.

@TimothyGu
Copy link

@stevenvachon whatwg-url's URLSearchParams is pretty slow.

@stevenvachon
Copy link

@TimothyGu I thought the Node.js implementation was all JavaScript and it's not "slow", correct?

@TimothyGu
Copy link

@stevenvachon The Node.js one is JS and not slow, because it was adapted from the optimized querystring implementation and made faster than querystring by getting rid of unneeded cruft. whatwg-url's goal is to be spec-compliant, not necessarily fast. whatwg-url works on a byte sequence (i.e. Buffer) level, because that's what the spec calls for.

Source: I wrote both parsers.

@stevenvachon
Copy link

I know that you wrote both and that's why I'm asking you 😛

@zemlanin
Copy link

blows the dust off

I was trying to match the behavior of bodyParser.urlencoded with node's url.searchParams and stumbled upon this issue…

Is it time to revisit this discussion?

@rhodgkins
Copy link

blows the dust off

I was trying to match the behavior of bodyParser.urlencoded with node's url.searchParams and stumbled upon this issue…

Is it time to revisit this discussion?

Just to bump this again now node 16 is out - querystring has now been marked as legacy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants