Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring Dart's String support into the modern age #28404

Closed
4 tasks
Hixie opened this issue Jan 15, 2017 · 32 comments
Closed
4 tasks

Bring Dart's String support into the modern age #28404

Hixie opened this issue Jan 15, 2017 · 32 comments
Labels
area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. area-language Dart language related items (some items might be better tracked at github.com/dart-lang/language). core-m library-core type-enhancement A request for a change that isn't a bug

Comments

@Hixie
Copy link
Contributor

Hixie commented Jan 15, 2017

Admin comment: For current support for working on strings containing Unicode (extended) grapheme clusters, please see https://pub.dev/packages/characters and https://medium.com/dartlang/dart-2-7-a3710ec54e97.


For example, consider this discussion:
http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/

We should be at least as good as Swift and Perl 6 when it comes to dealing with strings.

Things we should do:

  • Stop exposing UTF-16 words as the basic unit of a String. That's just wrong in every way.
  • Make walking by grapheme cluster the simplest way of iterating over a string. (Having easy support for walking by runes (Unicode codepoints) is ok too.)
  • Remove the ability to index into a string. It's never the correct thing to do in the face of grapheme clusters, and this affects all languages, even English, given the way Emoji are built using zero-width joiners these days.
  • Make it trivial to go from Strings to byte arrays for UTF-8 (and optionally other encodings like UTF-16).

cc @sethladd, since you were asking what improvements we can make to Dart to bring it into the modern age. These changes would have massively more meaningful impact than making semicolons optional or removing other punctuation.

@lrhn lrhn added area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. library-core type-enhancement A request for a change that isn't a bug area-language Dart language related items (some items might be better tracked at github.com/dart-lang/language). labels Jan 15, 2017
@lrhn
Copy link
Member

lrhn commented Jan 15, 2017

Most of these are perfectly good suggestions for arbitrary Unicode strings, but makes it much harder to work with text that you know is in a more limited format (typically ASCII, maybe parsing JSON with a known forma, or Dart identifiers, or something else with a simple format).

As always, it's a trade-off between making simple things easy and complex things possible. Going all Unicode-grapheme-cluster only will make simple things harder, but will also make some complex things easier - and force users to be aware that there is an issue.

@Hixie
Copy link
Contributor Author

Hixie commented Jan 16, 2017

That's why I added the fourth check box. I agree that we need to make it easy to deal with byte strings, including building byte strings from string literals that only use ASCII-compatible characters (ord<128), maybe even implicitly. It's everything between straight ASCII and full Unicode that's the problem. :-)

maybe parsing JSON with a known forma[t]

That's a security vulnerability waiting to happen, usually.

@mraleph
Copy link
Member

mraleph commented Jan 16, 2017

We should be at least as good as Swift

Being as good as Swift, I guess, also means publishing an article "Why Dart String API is so hard?" on dartlang.org similar to "Why is Swift's String API So Hard?".

@Hixie
Copy link
Contributor Author

Hixie commented Jan 16, 2017

Yes. Or we can do even better, as suggested by my forth checkbox above. We can have ways to initialize constant byte arrays from string literals and let these be accessed by index, for example. The path API could exist and work with strings as well as byte arrays. and so forth.

I'm happy to try and design a comprehensive API here if this is something that would help.

@alan-knight
Copy link
Contributor

These sound nice, but there are an awful lot of minefields. Do we want

   'é' == 'é'

and if so, which normalization scheme do we use?

@Hixie
Copy link
Contributor Author

Hixie commented Jan 17, 2017

We should compare grapheme clusters using NFD or NFC (it doesn't matter, they have the same result). The only other plausible option would be NFKC/NFKD but those aren't really reasonable for string comparisons. So I don't really see that as a minefield. The current behaviour (without any normalisation) is the minefield.

@rmacnak-google
Copy link
Contributor

Removing the ability to index into strings probably makes it possible to implement this efficiently on JS too.

@Hixie
Copy link
Contributor Author

Hixie commented Mar 7, 2017

Language proposal (this is a rough draft I just hammered out and I'm sure others will have opinions on making this better):

There are two kinds of string literals, '...' and '''...''' They are identical except that the former cannot contain a newline; the distinction is purely to aid with finding syntax errors. Both kinds can use either single quotes like that, or double quotes, as in "..." and """...""". Strings support the same interpolation and escapes as we are used to in Dart 1.0.

String literals can be prefixed by characters that control the syntax and kind of string object generated from the literal. The r prefix, short for raw, disables all escaping and interpolation logic within the string. For example, r"""\n""" in a two-character string where the first one is a backslash and the second one is a lowercase N.

String literals as described so far create a String object, described below.

The b prefix indicates that the string literal should instead create a Uint8List object. The literal characters in such a string must be in the range U+0000 to U+007F, or in the form of \xNN escapes (where NN is in the range 0x00 to 0xFF). \u escapes are not valid in strings with the b prefix. (Other escapes are fine.) If the b and r prefixes are both specified, b must come first, and the contents can then only be in the range U+0000 to U+007F. The Uint8List object is created by taking the string's value and interpreting it as 8-bit extended ASCII (meaning a charecter with value U+00NN creates a byte with value 0xNN in the final list).

String literals can be placed adjacent and will be combined to form a single object. However, all the parts of this sequence must agree on the presence or absence of the b prefix.

The String class is entirely replaced. It does not have a [] accessor.

  • Strings can be concatenated using +, the second operand must also be a String.
  • Strings can be repeated using *, the second operand must be an int.
  • Strings implement start and end accessors which return StringIterators pointing at the relevant parts.
  • String.contains(), String.endsWith(), String.replaceAll, and other functions that don't expose string positions work as today.
  • String.indexOf returns a StringIterator.
  • The width arguments to functions like padLeft and padRight work in terms of counting grapheme clusters.
  • The index arguments to functions like indexOf, replaceFirst, substring, etc, work in terms of StringPositions.

StringIterator implements StringPosition.

The StringPosition class represents a position in a particular String. In checked mode, when a StringPosition is applied to another String (e.g. applying the end StringIterator of one String to the substring of another), the two strings are compared up to the offset indicated by the iterator and the operation fails if the two are not byte-for-byte identical under the hood.

  • StringIterator has a moveNext(n) method that moves forward by n grapheme clusters. n is optional and defaults to 1.
  • StringIterator has a moveNextRune(n) method that moves forward by n Unicode codepoints. n is optional and defaults to 1.
  • StringIterator has a movePrevious(n) method that does moveNext(-n) and a movePreviousRune(n) method that does moveNextRune(-n).
  • StringIterator has a clone method that returns another StringIterator instance at the same position.

StringPosition has a + operator that results in a new StringPosition whose backing string is the concatenation of the prefixes of both StringPositions' backing strings at their respective positions and whose position is the end of that string. (This StringPosition is not a StringIterator, so it is immutable.)

For example, 'aaa'.end + 'bbb'.end creates a StringPosition with the same position as 'aaabbb'.end.

There are classes that take String objects and return Uint8List objects by encoding the String accordingly. There are classes that do the reverse, also.

@rakudrama
Copy link
Member

Some random notes:

What does it look like to use the API?

  • How do I access the grapheme at a StringPosition?
  • How do I iterate over all the graphemes?
  • How do I scan a string, e.g. how might I implement s1.indexOf(s2, pos) using the rest of the API? How do I write a custom sorting comparator for a table column?
  • How would I manually (i.e. no regex) validate a phone number is in the format (ddd) ddd-dddd ?

​Do you have a model implementation in some language?

Define .length

Mention there is no .codeUnitAt(). In practice we use this instead of [] since does not allocate a small string.

Do trim(), trimLeft(), trimRight() need any changes?
Apps use toUpperCase() and toLowerCase() - are these different to what browsers do in JavaScript today?

Are StringPositions constants, e.g. can I write const x = 'abc'.end; ?
Can stringPositions be compared with == ? hashed? (I hope not)

Are there different normalizations?

Any tie-in with for-in? Today I can say for (var r in s.runes) ...

RegExp will probably have to change - they are part of the larger string API. ES6 has /.../u which matches what Dart calls runes. It does not have grapheme boundary matching.

I work on dart2js and I am concerned about how this can be implemented with reasonable efficiency. I understand that targeting JS is not your concern, but it would be great of we could make it work, and things which are hard to optimize on one platform tend to be hard to optimize on other platforms. If we can make it that an iterator can be reduced to a JavaScript UTF16 code unit index and some helper code in static functions to find the next/previous index, it might be possible for the compiler to do some magic and reduce local iterators to scanning indexes. To pull this off would require the iterator api to be easily analyzed, for example, to know something is monotonic and bounded.

Is there a need for a single iterator instance that can go forwards and backwards?
Do we need moveNext(n) = movePrevious(-n) ? In practice this means that moveNext and movePrevious both call _moveNext and _movePrevious.

It would be easier to optimize if there was exactly one class implementing StringPosition, so operations do not need polymorphic call in tight loops, and hopepully it could reliably be exploded into a backing string + index.
Is it really useful for the iterator to implement StringPosition, or could it expose the position via a getter?

Could the API be written entirely with positions? e.g. pos = pos.next();

I'm not sure I understand StringPosition.+
How is it used?
It seems that the concatenated backing-string requirement could be very expensive. But I don't really understand the operation.

We could experiment with the API by having a class Characters with a static method of and putting all the code we want to experiment with on class Characters, i.e

var s = Characters.of("Te\u{301}");
s.startsWith("Te");  // --> false.

@lrhn
Copy link
Member

lrhn commented Aug 17, 2017

(repost of mail-comment from March 14th, now with formatting)
@Hixie wrote:

Language proposal (this is a rough draft I just hammered out and I'm sure
others will have >opinions on making this better):

There are two kinds of string literals, '...' and '''...''' They are
identical except that the former cannot contain a newline; the distinction
is purely to aid with finding syntax errors.
Not sure what the syntax errors have to do with anything?
Both kinds can use either single quotes like that, or double quotes, as in
"..." and """...""". Strings support the same interpolation and escapes
as we are used to in Dart 1.0.

String literals can be prefixed by characters that control the syntax and
kind of string object generated from the literal. The r prefix, short for
raw, disables all escaping and interpolation logic within the string. For
example, r"""\n""" in a two-character string where the first one is a
backslash and the second one is a lowercase N.

String literals as described so far create a String object, described
below.
Can a string literal be malformed?
"\uDC00"
or
"\uD800"
?
Can it become non-malformed due to interpolations?

var x = "\uDC00";  // Invalid?
var y = "\uD800$x";  // Valid?

The b prefix indicates that the string literal should instead create a
Uint8List object. The literal characters in such a string must be in the
range U+0000 to U+007F, or in the form of \xNN escapes (where NN is in
the range 0x00 to 0xFF). \u escapes are not valid in strings with the b
prefix. (Other escapes are fine.) If the b and r prefixes are both
specified, b must come first, and the contents can then only be in the
range U+0000 to U+007F. The Uint8List object is created by taking the
string's value and interpreting it as 8-bit extended ASCII (meaning a
charecter with value U+00NN creates a byte with value 0xNN in the final
list).
That's not a string, it's a Uint8List literal. I want those, but let's be
honest and do it as a list literal: Uint8List x = [byte, byte, byte];
(with automatic static-type based literal interpretation).
String literals can be placed adjacent and will be combined to form a
single object. However, all the parts of this sequence must agree on the
presence or absence of the b prefix.
Another good sign it's not really a string literal. Keep similar thing similar,
and different things different.
The String class is entirely replaced. It does not have a [] accessor.

  • Strings can be concatenated using +, the second operand must also be
    a String.
  • Strings can be repeated using *, the second operand must be an int.
  • Strings implement start and end accessors which return
    StringIterators pointing at the relevant parts.
  • String.contains(), String.endsWith(), String.replaceAll, and other
    functions that don't expose string positions work as today.
  • String.indexOf returns a StringIterator.
  • The width arguments to functions like padLeft and padRight work in
    terms of counting grapheme clusters.
  • The index arguments to functions like indexOf, replaceFirst,
    substring, etc, work in terms of StringPositions.

If the string is a one-byte String, then it might even be efficient.
Except that you probably want it in normalized form, so "é" should be
represented as e+combining-acute, or?

Do you still have the String.fromCharCodes constructor. Should it throw on
invalid sequences? Which ones?

StringIterator implements StringPosition.

The StringPosition class represents a position in a particular String.

Does the StringPosition expose the string it's a position into, and
operators to continue working with it? Or is a StringPosition opaque and
can only be passed back into String methods?

If it's opaque, it's probably a bad design. Opaque classes that just
represent a magic token that another class can use are bad for modularity
and testing - that's why Dart's typed-data ByteBuffer isn't some opaque object
that the Uint8List constructor can magically access - it's the other way
around, the buffer knows how to be used as an Uint8List.
Similarly we constantly have problems because Type and Symbol are opaque
magical capabilities for the mirror system. You can't mock them, you can't
really implement them (even if the types allow it, it just won't work).
They work by shared access to private state, not by their interface. If we
can avoid that kind of code smell, it would be great.,

So, I don't think a StringPosition by itself is necessarily useful, maybe
StringIterator should be the lowest denominator. But it's mutable, so that
might be annoying too.
Really, there are two good APIs hiding in this:

  • an immutable StringPosition based one where operations on String (and on
    the position) return new position, and
  • a mutable StringIterator based one where operations on the iterator
    mutate the iterator.

I'm not sure mixing them too much is a good idea, but I can see that either
can be useful.
The "best" solution might be to let String be StringPosition based (all
methods take and return StringPosition instances), and have an easy way to
go from StringPosition to StringIterator and back.

Then you can do:

var buffer = new StringBuffer();
for (StringIterator s = string.start.iterator; s.moveNext();) {
  Char c = s.current;  // GraphemeCluster, really.
  if (!isWhiteSpace(c)) {
    buffer.write(c);
  }
}
return buffer.toString();

In checked mode, when a StringPosition is applied to another String (e.g.
applying the end StringIterator of one String to the substring of
another), the two strings are compared up to the offset indicated by the
iterator and the operation fails if the two are not byte-for-byte identical
under the hood.
Let's forget checked mode, in Strong mode, and by extension in Dart 2.0,
there won't be different modes.
Just require it to be the same string, and by "same" I mean identical.

  • StringIterator has a moveNext(n) method that moves forward by n
    grapheme clusters. n is optional and defaults to 1.
  • StringIterator has a moveNextRune(n) method that moves forward by n
    Unicode codepoints. n is optional and defaults to 1.
  • StringIterator has a movePrevious(n) method that does moveNext(-n)
    and a movePreviousRune(n) method that does moveNextRune(-n).
  • StringIterator has a clone method that returns another
    StringIterator instance at the same position.
    It should probably have a large chunk of String operations that are based
    on the position.
    So, String has StringIterator indexOf(String), and StringIterator has
    void indexOf(String string) that moves it to next position of string. Again, so
    we don't just pass the string iterator back into String.
    We can do that too
  StringIterator indexOf(String s, {StringIterator start}) =>
     (start ?? this.start).indexOf(s);

If String is position based, then:

  StringPosition indexOf(String s, {StringPosition start}) =>
     (start ?? this.start).indexOf(s);

(Yes, StringPosition needs methods too, how else is the string operations
going to use them when you mock the position!)

StringPosition has a + operator that results in a new StringPosition
whose backing string is the concatenation of the prefixes of both
StringPosition's backing strings at their respective positions and whose
position is the end of that string. (This StringPosition is not a
StringIterator, so it is immutable.)
That's far too complicated. I can see the use-case, but it's not viable.

You have strings a and b and a position, pos, in b, and you want to convert
that to a position in a+b, which you want to write as a.end + pos.

Again, doing prefix checks on the input strings is not viable. Even doing
equality checks on the backing strings is more than I would ever want, and
since we don't canonicalize string values, identity won't help you here.

What you need is a way to convert a string position from one string to
another.
Something like (on StringPosition):

  StringPosition after(StringPosition other)

That will check that the position of other has a string where the prefix of
"this" (the string of "this" form start to "this" position) occurs in the
string of "other" at position "other". It then returns the position after
that string in "other".

Just doing addition and lazily checking later requires far too much
checking.

So, for the use-case above, that means something like:

  var sum = a + b;
  var newPos = pos.after(a.end.after(sum.start))

That's annoying. We can probably do some shorthands for when it's relative
to the start of the string.

  var newPos = pos.after(a.in(sum));

This will check that sum starts with a followed by the prefix of "pos",
but it will not do it on use, only ever on conversion of positions between
strings.

For example, 'aaa'.end + 'bbb'.end creates a StringPosition with the same
position as 'aaabbb'.end.
You picked the easy case.

What is 'aaa'.lastIndexOf('aa') + 'bbb'.end ? (something that checks for
being preceeded by "abbb"?)

Why is it different from 'bbb'.end + 'aaa'.lastIndexOf('aa')? (because
it's not commutative)

It's not that '+' can't be non-commutative (like String.+), but this is too
suggestive of there being some number reasoning behind why it's valid, and
then it isn't anyway.

There are classes that take String objects and return Uint8List objects by
encoding the String accordingly. There are classes that do the reverse,
also.
That's dart:convert's Utf8Codec. We'll still have that.
That also means that for each grapheme cluster, we will still need access
to the code units. We might want to get access to encoded forms of those
code units, otherwise conversion from String to UTF-8 might convert from
the underlying UTF-8 representation to code units, and then back to UTF-8.

So, a string grapheme cluster (no matter how it's represented, as a single
Char object, a StringPosition or the current state of a StringIterator)
probably need a way to write itself into a UTF-8 buffer, a UTF-16 buffer
and a UTF-32 buffer, and/or give access to iterators for those
representations. Otherwise we need to convert to an intermediate format
before converting to anything else, and that extra overhead is not
desirable.

All in all, this sounds heavy.
A StringPosition is a class with methods and state (at least the string
itself and the index and length of the current grapheme cluster in the
underlying representation).
With luck we can inlined the methods that create the objects and
allocation-sink most of them, and then recognize that they are usually
working on the same string.
It requires the user to determine when to use StringPosition and when to
use StringInterpolation, and if they pick the wrong one, they get too many
allocations.

Then there is normalization.
Do we normalize strings? Even across interpolations? Or additions?

  accent(bool up) => "a${up ? "\u0301" : "\u0300"}";
  eccent(bool up) => "e" + (up ? "\u0301" : "\u0300")

(That's adding the accent to a base letter computationally, which is not an
unlikely use-case, even if it's probably rare).
If not, we can't use JS indexOf to find an equivalent grapheme cluster,
only an identical one. That might be valid, a "é" is not equal to
"e\u0301", the first grapheme cluster is different, even if it's equivalent.

Not using JS string functions makes this a very expensive change for JS
compilation (as Stephen also says).

@gspencergoog
Copy link
Contributor

gspencergoog commented Oct 20, 2017

I'd just like to put in a word of support for Ian's request: if Dart doesn't have this kind of support, it's extremely hard to support multilingual programs, or even just support entering emoji in a program. It's pretty hard to argue that it isn't in the wheelhouse for Dart (and that programmers should just code their own solution) because it's very non-trivial to code, requiring lookup databases, etc., and is widely applicable (many internationalized programs could use this).

If Dart is to be batteries-included, then some kind of character-level (grapheme cluster) manipulation is needed.

As a simple concrete example, it's not possible to implement an input field that limits the number of user-visible "characters" without the ability to count them and truncate the input properly.

@Hixie
Copy link
Contributor Author

Hixie commented Oct 20, 2017

Ok, new proposal.

String foo = 'Hello world';
var space = foo.indexOf(' ');
var hello = foo.substring(foo.start, space);
var world = foo.substring(space + 1, foo.end);
// Count number of extended grapheme clusters in a string.
int lengthOf(String s) {
  int result = 0;
  for (String character in s.characters)
    result += 1;
  return result;
}
String zalgo = 'D̸̛͇̻̼̜̲a̤̕r̟͚̥͍̲̬ṯ̘̕͞';
for (String character in zalgo.characters)
  print('The character $character begins with ${character.firstRune.currentAsString}.');
// prints:
//  The character D̸̛͇̻̼̜̲ begins with D.
//  The character a̤̕ begins with a.
//  The character r̟͚̥͍̲̬ begins with r.
//  The character ṯ̘̕͞ begins with t.
// Naive implementation of indexOf (native implementation could just compare
// the underlying buffers and construct the resulting iterator artificially).
RuneIterator indexOf(String s, String pattern) {
  bool match(RuneIterator position1, RuneIterator position2) {
    while (position1.moveNext() || position2.moveNext()) {
      if (position1.current != position2.current)
        return false;
    }
    return true;
  }
  RuneIterator position = s.runes.first;
  while (position.moveNext()) {
    if (match(position.clone(), pattern.start))
      return position;
  }
  return null;
}

UNICODE STRINGS

Syntax

The syntax for String literals in Dart is unchanged by this proposal, except that string literals that would not be valid Unicode are compile-time syntax errors.

API

The constructors on the String class, its isEmpty, isNotEmpty, runes, and hashCode properties, its contains, endsWith, replaceAll, replaceAllMapped, split, splitMapJoin, trim, trimLeft, trimRight methods, and the *, +, and == operators, are left as today.

The String class codeUnits property, codeUnitAt method, the [] operator, and the length property are removed entirely. RuneIterator's at named constructor, and its currentSize and rawIndex properties, are removed entirely. The argument to its reset method is also removed. The property with the name last on the class Runes is replaced with a property described below.

A new property is introduced, characters, which returns a Characters object. Characters is like Runes but implements Iterable<String> instead of Iterable<int>. It iterates over the associated string by extended grapheme cluster, its value taking the value of the substring that represents the current extended grapheme cluster. Its iterator returns a CharacterIterator which is similar to RuneIterator but implements BidirectionalIterator<String>.

CharacterIterator has a property "firstRune" that returns a RuneIterator that points to the first rune of the substring pointed to by the CharacterIterator.

CharacterIterator and RuneIterator both implement StringPosition. StringPosition has a property that returns a RuneIterator. It returns "this" for a RuneIterator and "firstRune" for a CharacterIterator.

CharacterIterator and RuneIterator also both implement two new properties, previous and next, which return new iterators that point to the previous or next extended grapheme cluster or rune respectively, or throw if they are at the start or end of the string respectively.

CharacterIterator and RuneIterator also both implement the binary + and - operators, with int operands. The - operator is expressed in terms of the + operator with the operand negated. The + operator creates a clone of the iterator and then advances (or retreats, for negative operands) that new iterator as many times as specified by the operand. It then returns the new operator.

The replaceFirst, replaceFirstMapped, replaceRange, startsWith, and substring methods are changed to take a StringPosition instead of an int for any parameter that refers to a position in a string. If the StringPosition is an iterator that refers to a different string than the one passed to the method, then in debug mode the method asserts (in release mode behaviour is undefined).

The indexOf and lastIndexOf methods return a RuneIterator instead of an int. They return null if the pattern isn't found.

The padLeft and padRight width arguments are changed to refer to runes.

Two new methods padLeftByCharacters and padRightByCharacters are introduced that are identicial but whose width arguments refer to extended grapheme clusters.

Runes and Characters get two new properties, first and last, that return RuneIterators and CharacterIterators respectively that point to the first and last rune and extended grapheme cluster in the string respectively. String also gets start and end properties that return the same values as Runes.first and Runes.last respectively.

RuneIterator and CharacterIterator get a new method, clone(), which returns a new, identically-configured, iterator.

The toLowerCase and toUpperCase methods take a Locale object and perform the conversion according to the relevant locale.

String is given a new method, toUtf8(), which returns a Uint8List that represents the same string, encoded as UTF-8. String is also given a new constructer, fromUtf8, which takes a Uint8List and decodes it as UTF-8. There is no way to construct a String object with invalid Unicode. When strings are constructed, they apply NFC normalization.

The actual buffer of a String, and in particular its internal encoding, cannot be determined from Dart code.

BYTE STRINGS

Syntax

A "b" prefixed in front of a string literal changes it into a byte string literal.

b'Hello\tWorld\x00\x01\x02\x03\xFF'
br"""C:\SYSTEM"""

Byte strings must not contain \u escapes and must not contain any literal characters beyond U+007F.

Byte strings can't be combined with Unicode strings using the adjacent string syntax ("foo" "bar")

API

A byte string literal creates a Uint8List whose buffer contains the scalar values of each character in the literal.

The dart:io libraries that deal with filenames are changed to use Uint8List rather than String.

@Hixie
Copy link
Contributor Author

Hixie commented Oct 21, 2017

(Mostly I intend these proposals to demonstrate feasibility, not to be final concrete proposals. I'm sure Dart language and library experts can come up with better things with their holistic knowledge of the platform.)

@rakudrama
Copy link
Member

If you want to test the proposal (is it nice to use? is it fast enough? etc) I suggest that you put all the new and changed String methods on Characters, and have a 'of' constructor. Then you can experiment without changing String:

var foo = Characters.of('Hello world');
var space = foo.indexOf(' ');
var hello = foo.substring(foo.start, space);
var world = foo.substring(space + 1, foo.end);
// Count number of extended grapheme clusters in a string.
int lengthOf(String s) {
  return Characters.of(s).length;
}```

@Hixie
Copy link
Contributor Author

Hixie commented Oct 21, 2017

Hopefully not actually Characters.of(s).length since that would be O(N). :-)

@Hixie
Copy link
Contributor Author

Hixie commented Oct 21, 2017

I'm very interested in seeing alternative proposals, too. The current state of String is IMHO a non-contender.

@gspencergoog
Copy link
Contributor

Yet another incident of needing grapheme cluster support: Android's TalkBack allows the user to indicate that they want to move forward and backward by a "character". Without grapheme cluster support, support for that is not (easily) implementable in Flutter.

@adriancmurray
Copy link

This is a bit of a blocker on my Flutter app. Any word on progress with this?

@gspencergoog
Copy link
Contributor

This is definitely on the Dart team's radar. See this discussion: dart-lang/language#34

@mit-mit
Copy link
Member

mit-mit commented Oct 25, 2019

We now have an experimental version of a new package characters (dart-lang/language#685) that supports operations that are Unicode/grapheme cluster aware: https://pub.dev/packages/characters

API example (full API docs):

import 'package:characters/characters.dart';

main() {
  String hi = 'Hi 🇩🇰';
  print('String is "$hi"\n');

  // Length.
  print('String.length: ${hi.length}');
  print('Characters.length: ${Characters(hi).length}\n');

  // Skip last character.
  print('String.substring: "${hi.substring(0, hi.length - 1)}"');
  print('Characters.skipLast: "${Characters(hi).skipLast(1)}"\n');

  // Replace characters.
  Characters newHi =
      Characters(hi).replaceAll(Characters('🇩🇰'), Characters('🇺🇸'));
  print('Change flag: "$newHi"');
}

Output when run:

$ dart example/main.dart
String is "Hi 🇩🇰"

String.length: 7
Characters.length: 4

String.substring: "Hi 🇩���"
Characters.skipLast: "Hi "

Change flag: "Hi 🇺🇸"

Feedback most welcome!
cc @lrhn

@hpoul
Copy link

hpoul commented Oct 25, 2019

Nice, any plans of also supporting collation/string comparison/sorting, or is that out of scope of that library?

@lrhn
Copy link
Member

lrhn commented Oct 25, 2019

No current plans to extend this package's scope to something requiring full Unicode data tables.
The only table it has available is for grapheme cluster breaking.

@febg11
Copy link

febg11 commented Feb 16, 2020

Will the LengthLimitingTextInputFormatter() be updated to support counting the characters in emojis correctly?

@mit-mit
Copy link
Member

mit-mit commented Aug 21, 2020

Closing this issue: With the characters package shipped to 1.0, we have no plans for any further immediate changes in this area.

@mit-mit mit-mit closed this as completed Aug 21, 2020
@artob
Copy link

artob commented Aug 21, 2020

Could someone here perhaps shed some light on the thinking on having the characters library as a separate add-on package instead of incorporated into the standard library? (I understand it was experimental earlier, but as noted it's now stable.)

As a package maintainer, I do go "sigh" on having to add a package dependency for correct string handling. Is there a roadmap for this being incorporated into the standard library?

@mit-mit
Copy link
Member

mit-mit commented Aug 21, 2020

If by standard library, you mean incorporate these APIs into the ones on String in dart:core then the main reason we decided against that is that it would be a very large breaking change.

We could have shipped it as a new dart:characters library, but we don't think there is much practical difference between that and package:characters, and shipping the package unbundled on pub.dev makes it a lot easier to version and iterate on the API.

@lrhn may have additional comments on this topic.

@lrhn
Copy link
Member

lrhn commented Aug 21, 2020

I too hate adding dependencies to my packages unnecessarily (especially when they come with a multitude of transitive dependencies, at least characters doesn't do that).

That said, putting the feature into a package does indeed allow us to iterate on the API much more easily than if it was in the platform libraries. The rules against breaking changes in the platform libraries are very strict. Adding a member to a class is potentially breaking. Packages generally consider adding members to a class which is not intended as a reusable interface, to be non-breaking. Even if it does break someone, they can just stay on an earlier version of the package until things are fixed. That's not an option for platform libraries, you get the ones in the current SDK and that's it.
So, more flexibility.

We might eventually decide that the package is mature enough, and move it into the platform libraries. That depends on a lot of things, including how it's being used, and by how many, and how often we need to make changes. We don't know any of that yet. Adding the current package to the platform libraries could turn out to cause premature lock-in, and then we'll be stuck with it forever.

@artob
Copy link

artob commented Aug 21, 2020

We could have shipped it as a new dart:characters library [...]

Yes, that's what I meant. One would hope to see baseline functionality incorporated into the platform libraries going forward.

We might eventually decide that the package is mature enough, and move it into the platform libraries.

I thought the 1.0 release was an indication of that.

I suppose I'm mostly wondering what differentiates this from all the churn in the platform libraries itself in the past year or two. (I've been on this ride since 2017 or so.)

It would seem that the answer is that perhaps the problem domain is somewhat novel? As in, we're all used to our bad old ways, in most contemporary programming languages, of treating strings as sequences of bytes and/or codepoints, and how to move to thinking of grapheme clusters isn't entirely obvious in terms of its API implications?

@lrhn
Copy link
Member

lrhn commented Aug 21, 2020

Being 1.0 means that it's ready to be used for real. Below that, you should be worried if you use the package in production. I do believe the code quality is sufficient to make it a 1.0 release.
That doesn't mean that we have extensive usability studies of the API (we have some studies, but that cannot replace widescale use for real use-cases). There might end up being a 2.0 version too, with an even better API, but it's impossible to say yet.
What is safe to say is that it won't happen if we make it a platform library.
(And all the churn in the platform libraries has been painful for everybody, also the Dart team, so we'd prefer to minimize that in the future. Using packages is a way to make required churn more manageable.).

Grapheme cluster APIs are indeed not that well studied. The only other modern string API is Swift. They too need to consider backwards compatability with the 16-bit based NSString from Objective C, and what kinds of underlying representation you need to be consistent with has a large effect on the possible APIs. Whether you expose the underlying representation at all is also important. Swift tries to hide the internal representation, which means they can use UTF-8, and then they have to emulate UTF-16 for backwards compatibility. Dart could potentially go the same direction in the future, but only when we have moved most people away from the 16-bit String, and for that, we need a place to move them.
So, if anything, this is necessarily a long term plan.

@artob
Copy link

artob commented Aug 21, 2020

@lrhn All right, that all makes sense. Thank you kindly for elaborating.

@artob
Copy link

artob commented Aug 21, 2020

The previous aside, this seems as good a place as any to state for the record that while Dart's evolution has been impressive, perhaps even singularly impressive, Dart strings' internal UTF-16 encoding is surely one of Dart's remaining cardinal sins.

Given the apriori-unlikely retrofits already successfully made to the language (all the churn), is there perhaps any kind of long-term plan to move Dart (in Dart 3+, say) towards a UTF-8 basis? (Create a UTF-8 Text type to replace the UTF-16 String, adopting the eventual mature Characters interface for it?)

Other than the obvious performance implications of tons of unnecessary encoding conversions--particularly when working with native libraries via FFI, which I do frequently--this sometimes bites people in unexpected ways.

I want to share here a brief recent example that I found instructive in terms of the pitfalls facing a Dart novice coming from the external UTF-8 world.

On first glance, this seems an innocent HTTP response handler snippet, not that dissimilar from how you would write it in any number of languages in use today:

final html = await rootBundle.loadString(assetKey);
httpResponse
  ..headers.add("Content-Type", "text/html;charset=UTF-8")
  ..headers.add("Content-Length", html.length.toString())
  ..write(html);

But, of course, the code above is not actually well-formulated at all:

E/flutter ( 7620): [ERROR:flutter/lib/ui/ui_dart_state.cc(171)] Unhandled Exception: HttpException: Content size exceeds specified contentLength. 2974071 bytes written while expected 2974029.

I trust the reason is obvious to all participants here, but it certainly wasn't to everyone. The corrected code is:

final html = utf8.encode(await rootBundle.loadString(assetKey));
httpResponse
  ..headers.add("Content-Type", "text/html;charset=UTF-8")
  ..headers.add("Content-Length", html.length.toString())
  ..add(html);

Now, I realize, of course, all the caveats here, particularly in light of the preceding discussion. Indeed, beginning with the assumption that String#length (or #size, in some other languages) should mean byte size, instead of any of the other things it could mean. Be that as it may, I found this an insidiously-buggy snippet, given that it won't even fail in the typical case (for development, for testing) where the asset files in question only contain ASCII English.

@jamesderlin
Copy link
Contributor

Closing this issue: With the characters package shipped to 1.0, we have no plans for any further immediate changes in this area.

AFAICT package:characters doesn't do Unicode normalization (also see dart-lang/characters#76), and without that it's hard to do sensible string comparison. That seems kind of important.

Is there some recommended way to do Unicode normalization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. area-language Dart language related items (some items might be better tracked at github.com/dart-lang/language). core-m library-core type-enhancement A request for a change that isn't a bug
Development

No branches or pull requests