Releases: clipperhouse/uax29
Transforms
Add ability to transform tokens, such as lowercasing, based on golang.org/x/text/transform.
The signature of SegmentAll() has changed, removing the variadic parameter at the end, a breaking change if you used that param. I’m not willing to go to v2 over that, hopefully not a problem. :)
Filters
Adds the ability to filter tokens. Use the Filter()
method on Segmenter
and Scanner
.
Any func([]byte) bool
can serve as a filter, with arbitrary logic. An example included filter is Wordlike
, which removes whitespace and punctuation, returning only ‘words’ in the common sense.
See also the Contains()
and Entirely()
methods, which allow creation of filters based on Unicode categories.
New Segmenter
Introduces a new Segmenter
type to tokenize []byte
. Analogous to the existing Scanner
, but does not require a io.Reader
. Segmenter
is maybe 10% faster, assuming you are operating on an existing []byte
.
Also adds a SegmentAll([]byte)
convenience method, if you’d like to do this as a one-liner and are not too concerned about allocations.
Handle more invalid UTF-8
Invalid UTF-8 input is considered undefined behavior. That said, we’ve worked to ensure that such inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.
There are two basic tests in each package, called TestInvalidUTF8
and TestRandomBytes
. Those tests pass, returning the invalid bytes verbatim, without a guarantee as to how they will be segmented.
v1.6.5
Many performance improvements, over 2× faster than the v1.6.0 release.
v1.6.0
Replace range tables with tries, using the triegen package from x/text. Performance improvement looks to be 30-40%.
This is a breaking change for consumers that depended on those range tables, which were exported. For normal usage of calling the scanner, the API is unchanged.
This should be a new major version but I can’t be bothered with the v2
renaming, in terms of discoverability, etc. Hopefully OK for most users.
v1.5.0
The implementation is now purely based on bufio.SplitFunc
and bufio.Scanner
. Our custom scanner and rune buffer is gone.
Perf improvements are substantial for words & graphemes; sentences seems to have regressed. Allocations are down to single digits, and are independent of input size.
This is a breaking change — the API is smaller. However, the main use case of calling (e.g.) words.NewScanner
and iterating over it should be unchanged for most consumers. The methods are the same.
So, I should increment the major version but the recommended v2
business is more than I want to deal with. Hopefully it’s OK for most.
v1.2.1
Performance and docs tweaks.
v1.2.0
New SplitFunc
for words, sentences and graphemes, for use with bufio.Scanner
. Provided for convenience if you are using bufio.Scanner and want to drop in a replacement SplitFunc.
The ‘native’ (words|sentences|graphemes).NewScanner
is ~10% faster, prefer that if you are starting from scratch.
v1.1.0
New unified uax29.Scanner
, used by the subpackages, instead of each having its own Scanner type.