Skip to content

Releases: clipperhouse/uax29

Transforms

05 Jun 02:30
2b27b05
Compare
Choose a tag to compare

Add ability to transform tokens, such as lowercasing, based on golang.org/x/text/transform.

The signature of SegmentAll() has changed, removing the variadic parameter at the end, a breaking change if you used that param. I’m not willing to go to v2 over that, hopefully not a problem. :)

Filters

29 May 01:10
619311b
Compare
Choose a tag to compare

Adds the ability to filter tokens. Use the Filter() method on Segmenter and Scanner.

Any func([]byte) bool can serve as a filter, with arbitrary logic. An example included filter is Wordlike, which removes whitespace and punctuation, returning only ‘words’ in the common sense.

See also the Contains() and Entirely() methods, which allow creation of filters based on Unicode categories.

Go Reference

New Segmenter

18 May 17:07
Compare
Choose a tag to compare

Introduces a new Segmenter type to tokenize []byte. Analogous to the existing Scanner, but does not require a io.Reader. Segmenter is maybe 10% faster, assuming you are operating on an existing []byte.

Also adds a SegmentAll([]byte) convenience method, if you’d like to do this as a one-liner and are not too concerned about allocations.

Handle more invalid UTF-8

29 Jun 00:45
Compare
Choose a tag to compare

Invalid UTF-8 input is considered undefined behavior. That said, we’ve worked to ensure that such inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.

There are two basic tests in each package, called TestInvalidUTF8 and TestRandomBytes. Those tests pass, returning the invalid bytes verbatim, without a guarantee as to how they will be segmented.

v1.6.5

14 Jun 18:17
Compare
Choose a tag to compare

Many performance improvements, over 2× faster than the v1.6.0 release.

v1.6.0

12 May 15:25
ceb970c
Compare
Choose a tag to compare

Replace range tables with tries, using the triegen package from x/text. Performance improvement looks to be 30-40%.

This is a breaking change for consumers that depended on those range tables, which were exported. For normal usage of calling the scanner, the API is unchanged.

This should be a new major version but I can’t be bothered with the v2 renaming, in terms of discoverability, etc. Hopefully OK for most users.

v1.5.0

07 May 15:50
Compare
Choose a tag to compare

The implementation is now purely based on bufio.SplitFunc and bufio.Scanner. Our custom scanner and rune buffer is gone.

Perf improvements are substantial for words & graphemes; sentences seems to have regressed. Allocations are down to single digits, and are independent of input size.

This is a breaking change — the API is smaller. However, the main use case of calling (e.g.) words.NewScanner and iterating over it should be unchanged for most consumers. The methods are the same.

So, I should increment the major version but the recommended v2 business is more than I want to deal with. Hopefully it’s OK for most.

v1.2.1

01 May 14:08
Compare
Choose a tag to compare

Performance and docs tweaks.

v1.2.0

28 Apr 15:05
Compare
Choose a tag to compare

New SplitFunc for words, sentences and graphemes, for use with bufio.Scanner. Provided for convenience if you are using bufio.Scanner and want to drop in a replacement SplitFunc.

The ‘native’ (words|sentences|graphemes).NewScanner is ~10% faster, prefer that if you are starting from scratch.

v1.1.0

27 Apr 20:47
Compare
Choose a tag to compare

New unified uax29.Scanner, used by the subpackages, instead of each having its own Scanner type.